Completed
Push — TYPO3_8 ( e4052d )
by Tomas Norre
14:12
created

tx_crawler_lib   F

Complexity

Total Complexity 344

Size/Duplication

Total Lines 2461
Duplicated Lines 0 %

Coupling/Cohesion

Components 1
Dependencies 20

Test Coverage

Coverage 15.16%

Importance

Changes 0
Metric Value
dl 0
loc 2461
ccs 158
cts 1042
cp 0.1516
rs 0.6314
c 0
b 0
f 0
wmc 344
lcom 1
cbo 20

59 Methods

Rating   Name   Duplication   Size   Complexity  
A getAccessMode() 0 3 1
A setAccessMode() 0 3 1
A setDisabled() 0 9 3
A getDisabled() 0 7 2
A setProcessFilename() 0 4 1
A getProcessFilename() 0 4 1
A __construct() 0 19 3
A setExtensionSettings() 0 3 1
D checkIfPageShouldBeSkipped() 0 57 16
A getUrlsForPageRow() 0 13 2
A noUnprocessedQueueEntriesForPageWithConfigurationHashExist() 0 7 1
D urlListFromUrlArray() 0 118 21
A drawURLs_PIfilter() 0 11 4
A getPageTSconfigForId() 0 21 4
D getUrlsForPageId() 0 130 26
A getBaseUrlForConfigurationRecord() 0 18 3
C getConfigurationsForBranch() 0 45 11
A hasGroupAccess() 0 11 4
A parseParams() 0 14 3
F expandParameters() 0 110 24
C compileUrls() 0 25 7
B getLogEntriesForPageId() 0 25 6
B getLogEntriesForSetId() 0 24 6
A flushQueue() 0 13 4
A addQueueEntry_callBack() 0 17 3
B addUrl() 0 66 6
C getDuplicateRowsIfExist() 0 40 7
A getCurrentTime() 0 3 1
C readUrl() 0 67 12
A readUrlFromArray() 0 15 1
B readUrl_exec() 0 29 4
D requestUrl() 0 83 19
C getFrontendBasePath() 0 22 8
A executeShellCommand() 0 4 1
B getHttpResponseFromStream() 0 22 5
A log() 0 5 2
A buildRequestHeaderArray() 0 15 4
B getRequestUrlFrom302Header() 0 20 11
A fe_init() 0 17 4
C getPageTreeAndUrls() 0 90 7
D expandExcludeString() 0 44 9
D drawURLs_addRowsForPage() 0 114 15
A getUnprocessedItemsCount() 0 12 1
D CLI_main() 0 39 10
F CLI_main_im() 0 98 17
B CLI_main_flush() 0 37 5
A getConfigurationKeys() 0 4 2
D CLI_run() 0 106 9
A CLI_runHooks() 0 11 4
B CLI_checkAndAcquireNewProcess() 0 57 5
B CLI_releaseProcesses() 0 58 5
A CLI_deleteProcessesMarkedDeleted() 0 3 1
A CLI_checkIfProcessIsActive() 0 17 2
A CLI_buildProcessId() 0 6 2
A microtime() 0 4 1
A CLI_debug() 0 5 2
B sendDirectRequest() 0 25 1
A cleanUpOldQueueEntries() 0 8 1
A initTSFE() 0 18 2

How to fix   Complexity   

Complex Class

Complex classes like tx_crawler_lib often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use tx_crawler_lib, and based on these observations, apply Extract Interface, too.

1
<?php
2
/***************************************************************
3
 *  Copyright notice
4
 *
5
 *  (c) 2016 AOE GmbH <[email protected]>
6
 *
7
 *  All rights reserved
8
 *
9
 *  This script is part of the TYPO3 project. The TYPO3 project is
10
 *  free software; you can redistribute it and/or modify
11
 *  it under the terms of the GNU General Public License as published by
12
 *  the Free Software Foundation; either version 3 of the License, or
13
 *  (at your option) any later version.
14
 *
15
 *  The GNU General Public License can be found at
16
 *  http://www.gnu.org/copyleft/gpl.html.
17
 *
18
 *  This script is distributed in the hope that it will be useful,
19
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
20
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
21
 *  GNU General Public License for more details.
22
 *
23
 *  This copyright notice MUST APPEAR in all copies of the script!
24
 ***************************************************************/
25
26
use TYPO3\CMS\Core\Imaging\Icon;
27
use TYPO3\CMS\Core\Imaging\IconFactory;
28
29
/**
30
 * Class tx_crawler_lib
31
 */
32
class tx_crawler_lib {
33
34
    var $setID = 0;
35
    var $processID ='';
36
    var $max_CLI_exec_time = 3600;    // One hour is max stalled time for the CLI (If the process has had the status "start" for 3600 seconds it will be regarded stalled and a new process is started.
37
38
    var $duplicateTrack = array();
39
    var $downloadUrls = array();
40
41
    var $incomingProcInstructions = array();
42
    var $incomingConfigurationSelection = array();
43
44
45
    var $registerQueueEntriesInternallyOnly = array();
46
    var $queueEntries = array();
47
    var $urlList = array();
48
49
    var $debugMode=FALSE;
50
51
    var $extensionSettings=array();
52
53
    var $MP = false; // mount point
54
55
    protected $processFilename;
56
57
    /**
58
     * Holds the internal access mode can be 'gui','cli' or 'cli_im'
59
     *
60
     * @var string
61
     */
62
    protected $accessMode;
63
64
    /**
65
     * @var \TYPO3\CMS\Core\Database\DatabaseConnection
66
     */
67
    private $db;
68
69
    /**
70
     * @var TYPO3\CMS\Core\Authentication\BackendUserAuthentication
71
     */
72
    private $backendUser;
73
74
    const CLI_STATUS_NOTHING_PROCCESSED = 0;
75
    const CLI_STATUS_REMAIN = 1;    //queue not empty
76
    const CLI_STATUS_PROCESSED = 2;    //(some) queue items where processed
77
    const CLI_STATUS_ABORTED = 4;    //instance didn't finish
78
    const CLI_STATUS_POLLABLE_PROCESSED = 8;
79
80
    /**
81
     * Method to set the accessMode can be gui, cli or cli_im
82
     *
83
     * @return string
84
     */
85 1
    public function getAccessMode() {
86 1
        return $this->accessMode;
87
    }
88
89
    /**
90
     * @param string $accessMode
91
     */
92 1
    public function setAccessMode($accessMode) {
93 1
        $this->accessMode = $accessMode;
94 1
    }
95
96
    /**
97
     * Set disabled status to prevent processes from being processed
98
     *
99
     * @param  bool $disabled (optional, defaults to true)
100
     * @return void
101
     */
102 3
    public function setDisabled($disabled = true) {
103 3
        if ($disabled) {
104 2
            \TYPO3\CMS\Core\Utility\GeneralUtility::writeFile($this->processFilename, '');
105
        } else {
106 1
            if (is_file($this->processFilename)) {
107 1
                unlink($this->processFilename);
108
            }
109
        }
110 3
    }
111
112
    /**
113
     * Get disable status
114
     *
115
     * @return bool true if disabled
116
     */
117 3
    public function getDisabled() {
118 3
        if (is_file($this->processFilename)) {
119 2
            return true;
120
        } else {
121 1
            return false;
122
        }
123
    }
124
125
    /**
126
     * @param string $filenameWithPath
127
     *
128
     * @return void
129
     */
130 4
    public function setProcessFilename($filenameWithPath)
131
    {
132 4
        $this->processFilename = $filenameWithPath;
133 4
    }
134
135
    /**
136
     * @return string
137
     */
138 1
    public function getProcessFilename()
139
    {
140 1
        return $this->processFilename;
141
    }
142
143
144
145
    /************************************
146
     *
147
     * Getting URLs based on Page TSconfig
148
     *
149
     ************************************/
150
151 23
    public function __construct() {
152 23
        $this->db = $GLOBALS['TYPO3_DB'];
153 23
        $this->backendUser = $GLOBALS['BE_USER'];
154 23
        $this->processFilename = PATH_site.'typo3temp/tx_crawler.proc';
155
156 23
        $settings = unserialize($GLOBALS['TYPO3_CONF_VARS']['EXT']['extConf']['crawler']);
157 23
        $settings = is_array($settings) ? $settings : array();
158
159
        // read ext_em_conf_template settings and set
160 23
        $this->setExtensionSettings($settings);
161
162
163
        // set defaults:
164 23
        if (\TYPO3\CMS\Core\Utility\MathUtility::convertToPositiveInteger($this->extensionSettings['countInARun']) == 0) {
165 1
            $this->extensionSettings['countInARun'] = 100;
166
        }
167
168 23
        $this->extensionSettings['processLimit'] = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($this->extensionSettings['processLimit'],1,99,1);
169 23
    }
170
171
    /**
172
     * Sets the extensions settings (unserialized pendant of $TYPO3_CONF_VARS['EXT']['extConf']['crawler']).
173
     *
174
     * @param array $extensionSettings
175
     * @return void
176
     */
177 31
    public function setExtensionSettings(array $extensionSettings) {
178 31
        $this->extensionSettings = $extensionSettings;
179 31
    }
180
181
    /**
182
     * Check if the given page should be crawled
183
     *
184
     * @param array $pageRow
185
     * @return false|string false if the page should be crawled (not excluded), true / skipMessage if it should be skipped
186
     * @author Fabrizio Branca <[email protected]>
187
     */
188 6
    public function checkIfPageShouldBeSkipped(array $pageRow) {
189
190 6
        $skipPage = false;
191 6
        $skipMessage = 'Skipped'; // message will be overwritten later
192
193
            // if page is hidden
194 6
        if (!$this->extensionSettings['crawlHiddenPages']) {
195 6
            if ($pageRow['hidden']) {
196 1
                $skipPage = true;
197 1
                $skipMessage = 'Because page is hidden';
198
            }
199
        }
200
201 6
        if (!$skipPage) {
202 5
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList('3,4', $pageRow['doktype']) || $pageRow['doktype']>=199)    {
203 3
                $skipPage = true;
204 3
                $skipMessage = 'Because doktype is not allowed';
205
            }
206
        }
207
208 6
        if (!$skipPage) {
209 2
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'])) {
210 2
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'] as $key => $doktypeList) {
211 1
                    if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($doktypeList, $pageRow['doktype'])) {
212 1
                        $skipPage = true;
213 1
                        $skipMessage = 'Doktype was excluded by "'.$key.'"';
214 1
                        break;
215
                    }
216
                }
217
            }
218
        }
219
220 6
        if (!$skipPage) {
221
                // veto hook
222 1
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'])) {
223
                foreach($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'] as $key => $func)    {
224
                    $params = array(
225
                        'pageRow' => $pageRow
226
                    );
227
                    // expects "false" if page is ok and "true" or a skipMessage if this page should _not_ be crawled
228
                    $veto = \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($func, $params, $this);
229
                    if ($veto !== false)    {
230
                        $skipPage = true;
231
                        if (is_string($veto)) {
232
                            $skipMessage = $veto;
233
                        } else {
234
                            $skipMessage = 'Veto from hook "'.htmlspecialchars($key).'"';
235
                        }
236
                        // no need to execute other hooks if a previous one return a veto
237
                        break;
238
                    }
239
                }
240
            }
241
        }
242
243 6
        return $skipPage ? $skipMessage : false;
244
    }
245
246
    /**
247
     * Wrapper method for getUrlsForPageId()
248
     * It returns an array of configurations and no urls!
249
     *
250
     * @param  array  $pageRow       Page record with at least dok-type and uid columns.
251
     * @param  string $skipMessage
252
     * @return array                 Result (see getUrlsForPageId())
253
     * @see getUrlsForPageId()
254
     */
255 2
    public function getUrlsForPageRow(array $pageRow, &$skipMessage = '') {
256 2
        $message = $this->checkIfPageShouldBeSkipped($pageRow);
257
258 2
        if ($message === false) {
259 1
            $res = $this->getUrlsForPageId($pageRow['uid']);
260 1
            $skipMessage = '';
261
        } else {
262 1
            $skipMessage = $message;
263 1
            $res = array();
264
        }
265
266 2
        return $res;
267
    }
268
269
    /**
270
     * This method is used to count if there are ANY unprocessed queue entries
271
     * of a given page_id and the configuration which matches a given hash.
272
     * If there if none, we can skip an inner detail check
273
     *
274
     * @param  int    $uid
275
     * @param  string $configurationHash
276
     * @return boolean
277
     */
278
    protected function noUnprocessedQueueEntriesForPageWithConfigurationHashExist($uid,$configurationHash) {
279
        $configurationHash = $this->db->fullQuoteStr($configurationHash,'tx_crawler_queue');
280
        $res = $this->db->exec_SELECTquery('count(*) as anz','tx_crawler_queue',"page_id=".intval($uid)." AND configuration_hash=".$configurationHash." AND exec_time=0");
281
        $row = $this->db->sql_fetch_assoc($res);
282
283
        return ($row['anz'] == 0);
284
    }
285
286
    /**
287
     * Creates a list of URLs from input array (and submits them to queue if asked for)
288
     * See Web > Info module script + "indexed_search"'s crawler hook-client using this!
289
     *
290
     * @param    array        Information about URLs from pageRow to crawl.
291
     * @param    array        Page row
292
     * @param    integer        Unix time to schedule indexing to, typically time()
293
     * @param    integer        Number of requests per minute (creates the interleave between requests)
294
     * @param    boolean        If set, submits the URLs to queue
295
     * @param    boolean        If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
296
     * @param    array        Array which is passed by reference and contains the an id per url to secure we will not crawl duplicates
297
     * @param    array        Array which will be filled with URLS for download if flag is set.
298
     * @param    array        Array of processing instructions
299
     * @return    string        List of URLs (meant for display in backend module)
300
     *
301
     */
302
    function urlListFromUrlArray(
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
303
    array $vv,
304
    array $pageRow,
305
    $scheduledTime,
306
    $reqMinute,
307
    $submitCrawlUrls,
308
    $downloadCrawlUrls,
309
    array &$duplicateTrack,
310
    array &$downloadUrls,
311
    array $incomingProcInstructions) {
312
313
        // realurl support (thanks to Ingo Renner)
314
        if (\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
315
316
            /** @var tx_realurl $urlObj */
317
            $urlObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_realurl');
318
319
            if (!empty($vv['subCfg']['baseUrl'])) {
320
                $urlParts = parse_url($vv['subCfg']['baseUrl']);
321
                $host = strtolower($urlParts['host']);
322
                $urlObj->host = $host;
323
324
                // First pass, finding configuration OR pointer string:
325
                $urlObj->extConf = isset($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
326
327
                // If it turned out to be a string pointer, then look up the real config:
328
                if (is_string($urlObj->extConf)) {
329
                    $urlObj->extConf = is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
330
                }
331
332
            }
333
334
            if (!$GLOBALS['TSFE']->sys_page) {
335
                $GLOBALS['TSFE']->sys_page = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\PageRepository');
336
            }
337
            if (!$GLOBALS['TSFE']->csConvObj) {
338
                $GLOBALS['TSFE']->csConvObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Core\Charset\CharsetConverter');
339
            }
340
            if (!$GLOBALS['TSFE']->tmpl->rootLine[0]['uid']) {
341
                $GLOBALS['TSFE']->tmpl->rootLine[0]['uid'] = $urlObj->extConf['pagePath']['rootpage_id'];
342
            }
343
        }
344
345
        if (is_array($vv['URLs']))    {
346
            $configurationHash     =    md5(serialize($vv));
347
            $skipInnerCheck     =    $this->noUnprocessedQueueEntriesForPageWithConfigurationHashExist($pageRow['uid'],$configurationHash);
348
349
            foreach($vv['URLs'] as $urlQuery)    {
350
351
                if ($this->drawURLs_PIfilter($vv['subCfg']['procInstrFilter'], $incomingProcInstructions))    {
352
353
                    // Calculate cHash:
354
                    if ($vv['subCfg']['cHash'])    {
355
                        /* @var $cacheHash \TYPO3\CMS\Frontend\Page\CacheHashCalculator */
356
                        $cacheHash = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\CacheHashCalculator');
357
                        $urlQuery .= '&cHash=' . $cacheHash->generateForParameters($urlQuery);
358
                    }
359
360
                    // Create key by which to determine unique-ness:
361
                    $uKey = $urlQuery.'|'.$vv['subCfg']['userGroups'].'|'.$vv['subCfg']['baseUrl'].'|'.$vv['subCfg']['procInstrFilter'];
362
363
                    // realurl support (thanks to Ingo Renner)
364
                    $urlQuery = 'index.php' . $urlQuery;
365
                    if (\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
366
                        $params = array(
367
                            'LD' => array(
368
                                'totalURL' => $urlQuery
369
                            ),
370
                            'TCEmainHook' => true
371
                        );
372
                        $urlObj->encodeSpURL($params);
0 ignored issues
show
Bug introduced by
The variable $urlObj does not seem to be defined for all execution paths leading up to this point.

If you define a variable conditionally, it can happen that it is not defined for all execution paths.

Let’s take a look at an example:

function myFunction($a) {
    switch ($a) {
        case 'foo':
            $x = 1;
            break;

        case 'bar':
            $x = 2;
            break;
    }

    // $x is potentially undefined here.
    echo $x;
}

In the above example, the variable $x is defined if you pass “foo” or “bar” as argument for $a. However, since the switch statement has no default case statement, if you pass any other value, the variable $x would be undefined.

Available Fixes

  1. Check for existence of the variable explicitly:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        if (isset($x)) { // Make sure it's always set.
            echo $x;
        }
    }
    
  2. Define a default value for the variable:

    function myFunction($a) {
        $x = ''; // Set a default which gets overridden for certain paths.
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        echo $x;
    }
    
  3. Add a value for the missing path:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
    
            // We add support for the missing case.
            default:
                $x = '';
                break;
        }
    
        echo $x;
    }
    
Loading history...
373
                        $urlQuery = $params['LD']['totalURL'];
374
                    }
375
376
                    // Scheduled time:
377
                    $schTime = $scheduledTime + round(count($duplicateTrack)*(60/$reqMinute));
378
                    $schTime = floor($schTime/60)*60;
379
380
                    if (isset($duplicateTrack[$uKey])) {
381
382
                        //if the url key is registered just display it and do not resubmit is
383
                        $urlList = '<em><span class="typo3-dimmed">'.htmlspecialchars($urlQuery).'</span></em><br/>';
384
385
                    } else {
386
387
                        $urlList = '['.date('d.m.y H:i', $schTime).'] '.htmlspecialchars($urlQuery);
388
                        $this->urlList[] = '['.date('d.m.y H:i', $schTime).'] '.$urlQuery;
389
390
                        $theUrl = ($vv['subCfg']['baseUrl'] ? $vv['subCfg']['baseUrl'] : \TYPO3\CMS\Core\Utility\GeneralUtility::getIndpEnv('TYPO3_SITE_URL')) . $urlQuery;
391
392
                        // Submit for crawling!
393
                        if ($submitCrawlUrls)    {
394
                            $added = $this->addUrl(
395
                            $pageRow['uid'],
396
                            $theUrl,
397
                            $vv['subCfg'],
398
                            $scheduledTime,
399
                            $configurationHash,
400
                            $skipInnerCheck
401
                            );
402
                            if ($added === false) {
403
                                $urlList .= ' (Url already existed)';
404
                            }
405
                        } elseif ($downloadCrawlUrls)    {
406
                            $downloadUrls[$theUrl] = $theUrl;
407
                        }
408
409
                        $urlList .= '<br />';
410
                    }
411
                    $duplicateTrack[$uKey] = TRUE;
412
                }
413
            }
414
        } else {
415
            $urlList = 'ERROR - no URL generated';
416
        }
417
418
        return $urlList;
0 ignored issues
show
Bug introduced by
The variable $urlList does not seem to be defined for all execution paths leading up to this point.

If you define a variable conditionally, it can happen that it is not defined for all execution paths.

Let’s take a look at an example:

function myFunction($a) {
    switch ($a) {
        case 'foo':
            $x = 1;
            break;

        case 'bar':
            $x = 2;
            break;
    }

    // $x is potentially undefined here.
    echo $x;
}

In the above example, the variable $x is defined if you pass “foo” or “bar” as argument for $a. However, since the switch statement has no default case statement, if you pass any other value, the variable $x would be undefined.

Available Fixes

  1. Check for existence of the variable explicitly:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        if (isset($x)) { // Make sure it's always set.
            echo $x;
        }
    }
    
  2. Define a default value for the variable:

    function myFunction($a) {
        $x = ''; // Set a default which gets overridden for certain paths.
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        echo $x;
    }
    
  3. Add a value for the missing path:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
    
            // We add support for the missing case.
            default:
                $x = '';
                break;
        }
    
        echo $x;
    }
    
Loading history...
419
    }
420
421
    /**
422
     * Returns true if input processing instruction is among registered ones.
423
     *
424
     * @param  string $piString                     PI to test
425
     * @param  array  $incomingProcInstructions     Processing instructions
426
     * @return boolean                              TRUE if found
427
     */
428 5
    public function drawURLs_PIfilter($piString, array $incomingProcInstructions) {
429 5
        if (empty($incomingProcInstructions)) {
430 1
            return TRUE;
431
        }
432
433 4
        foreach($incomingProcInstructions as $pi) {
434 4
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($piString, $pi)) {
435 4
                return TRUE;
436
            }
437
        }
438 2
    }
439
440
441
    public function getPageTSconfigForId($id) {
442
        if(!$this->MP){
443
            $pageTSconfig = \TYPO3\CMS\Backend\Utility\BackendUtility::getPagesTSconfig($id);
444
        } else {
445
            list(,$mountPointId) = explode('-', $this->MP);
446
            $pageTSconfig = \TYPO3\CMS\Backend\Utility\BackendUtility::getPagesTSconfig($mountPointId);
447
        }
448
449
        // Call a hook to alter configuration
450
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'])) {
451
            $params = array(
452
                'pageId' => $id,
453
                'pageTSConfig' => &$pageTSconfig
454
            );
455
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'] as $userFunc) {
456
                \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($userFunc, $params, $this);
457
            }
458
        }
459
460
        return $pageTSconfig;
461
    }
462
463
    /**
464
     * This methods returns an array of configurations.
465
     * And no urls!
466
     *
467
     * @param  integer $id  Page ID
468
     * @return array        Configurations from pages and configuration records
469
     */
470
    protected function getUrlsForPageId($id)    {
471
472
        /**
473
         * Get configuration from tsConfig
474
         */
475
476
        // Get page TSconfig for page ID:
477
        $pageTSconfig = $this->getPageTSconfigForId($id);
478
479
        $res = array();
480
481
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']))    {
482
            $crawlerCfg = $pageTSconfig['tx_crawler.']['crawlerCfg.'];
483
484
            if (is_array($crawlerCfg['paramSets.']))    {
485
                foreach($crawlerCfg['paramSets.'] as $key => $values)    {
486
                    if (!is_array($values))    {
487
488
                        // Sub configuration for a single configuration string:
489
                        $subCfg = (array)$crawlerCfg['paramSets.'][$key.'.'];
490
                        $subCfg['key'] = $key;
491
492
                        if (strcmp($subCfg['procInstrFilter'],''))    {
493
                            $subCfg['procInstrFilter'] = implode(',',\TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',',$subCfg['procInstrFilter']));
494
                        }
495
                        $pidOnlyList = implode(',',\TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',',$subCfg['pidsOnly'],1));
0 ignored issues
show
Documentation introduced by
1 is of type integer, but the function expects a boolean.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
496
497
                            // process configuration if it is not page-specific or if the specific page is the current page:
498
                        if (!strcmp($subCfg['pidsOnly'],'') || \TYPO3\CMS\Core\Utility\GeneralUtility::inList($pidOnlyList,$id))    {
499
500
                                // add trailing slash if not present
501
                            if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
502
                                $subCfg['baseUrl'] .= '/';
503
                            }
504
505
                                // Explode, process etc.:
506
                            $res[$key] = array();
507
                            $res[$key]['subCfg'] = $subCfg;
508
                            $res[$key]['paramParsed'] = $this->parseParams($values);
509
                            $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'],$id);
510
                            $res[$key]['origin'] = 'pagets';
511
512
                                // recognize MP value
513
                            if(!$this->MP){
514
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'],array('?id='.$id));
515
                            } else {
516
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'],array('?id='.$id.'&MP='.$this->MP));
517
                            }
518
                        }
519
                    }
520
                }
521
522
            }
523
        }
524
525
        /**
526
         * Get configuration from tx_crawler_configuration records
527
         */
528
529
            // get records along the rootline
530
        $rootLine = \TYPO3\CMS\Backend\Utility\BackendUtility::BEgetRootLine($id);
531
532
        foreach ($rootLine as $page) {
533
            $configurationRecordsForCurrentPage = \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordsByField(
534
                'tx_crawler_configuration',
535
                'pid',
536
                intval($page['uid']),
537
                \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('tx_crawler_configuration') . \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('tx_crawler_configuration')
538
            );
539
540
            if (is_array($configurationRecordsForCurrentPage)) {
541
                foreach ($configurationRecordsForCurrentPage as $configurationRecord) {
542
543
                        // check access to the configuration record
544
                    if (empty($configurationRecord['begroups']) || $GLOBALS['BE_USER']->isAdmin() || $this->hasGroupAccess($GLOBALS['BE_USER']->user['usergroup_cached_list'], $configurationRecord['begroups'])) {
545
546
                        $pidOnlyList = implode(',',\TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',',$configurationRecord['pidsonly'],1));
0 ignored issues
show
Documentation introduced by
1 is of type integer, but the function expects a boolean.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
547
548
                            // process configuration if it is not page-specific or if the specific page is the current page:
549
                        if (!strcmp($configurationRecord['pidsonly'],'') || \TYPO3\CMS\Core\Utility\GeneralUtility::inList($pidOnlyList,$id)) {
550
                            $key = $configurationRecord['name'];
551
552
                                // don't overwrite previously defined paramSets
553
                            if (!isset($res[$key])) {
554
555
                                    /* @var $TSparserObject \TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser */
556
                                $TSparserObject = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser');
557
                                $TSparserObject->parse($configurationRecord['processing_instruction_parameters_ts']);
558
559
                                $subCfg = array(
560
                                    'procInstrFilter' => $configurationRecord['processing_instruction_filter'],
561
                                    'procInstrParams.' => $TSparserObject->setup,
562
                                    'baseUrl' => $this->getBaseUrlForConfigurationRecord($configurationRecord['base_url'], $configurationRecord['sys_domain_base_url']),
563
                                    'realurl' => $configurationRecord['realurl'],
564
                                    'cHash' => $configurationRecord['chash'],
565
                                    'userGroups' => $configurationRecord['fegroups'],
566
                                    'exclude' => $configurationRecord['exclude'],
567
                                    'key' => $key,
568
                                );
569
570
                                    // add trailing slash if not present
571
                                if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
572
                                    $subCfg['baseUrl'] .= '/';
573
                                }
574
                                if (!in_array($id, $this->expandExcludeString($subCfg['exclude']))) {
575
                                    $res[$key] = array();
576
                                    $res[$key]['subCfg'] = $subCfg;
577
                                    $res[$key]['paramParsed'] = $this->parseParams($configurationRecord['configuration']);
578
                                    $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
579
                                    $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], array('?id=' . $id));
580
                                    $res[$key]['origin'] = 'tx_crawler_configuration_'.$configurationRecord['uid'];
581
                                }
582
                            }
583
                        }
584
                    }
585
                }
586
            }
587
        }
588
589
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls']))    {
590
            foreach($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'] as $func)    {
591
                $params = array(
592
                    'res' => &$res,
593
                );
594
                \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($func, $params, $this);
595
            }
596
        }
597
598
        return $res;
599
    }
600
601
    /**
602
     * Checks if a domain record exist and returns the base-url based on the record. If not the given baseUrl string is used.
603
     *
604
     * @param  string   $baseUrl
605
     * @param  integer  $sysDomainUid
606
     * @return string
607
     */
608
    protected function getBaseUrlForConfigurationRecord($baseUrl, $sysDomainUid) {
609
        $sysDomainUid = intval($sysDomainUid);
610
611
        if ($sysDomainUid > 0) {
612
            $res = $this->db->exec_SELECTquery(
613
                '*',
614
                'sys_domain',
615
                'uid = '.$sysDomainUid .
616
                \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('sys_domain') .
617
                \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('sys_domain')
618
            );
619
            $row = $this->db->sql_fetch_assoc($res);
620
            if ($row['domainName'] != '') {
621
                return 'http://'.$row['domainName'];
622
            }
623
        }
624
        return $baseUrl;
625
    }
626
627
    function getConfigurationsForBranch($rootid, $depth) {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
628
629
        $configurationsForBranch = array();
630
631
        $pageTSconfig = $this->getPageTSconfigForId($rootid);
632
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.']))    {
633
634
            $sets = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'];
635
            if(is_array($sets)) {
636
                foreach($sets as $key=>$value) {
637
                    if(!is_array($value)) continue;
638
                    $configurationsForBranch[] = substr($key,-1)=='.'?substr($key,0,-1):$key;
639
                }
640
641
            }
642
        }
643
        $pids = array();
644
        $rootLine = \TYPO3\CMS\Backend\Utility\BackendUtility::BEgetRootLine($rootid);
645
        foreach($rootLine as $node) {
646
            $pids[] = $node['uid'];
647
        }
648
        /* @var \TYPO3\CMS\Backend\Tree\View\PageTreeView */
649
        $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
650
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
651
        $tree->init('AND ' . $perms_clause);
652
        $tree->getTree($rootid, $depth, '');
653
        foreach($tree->tree as $node) {
654
            $pids[] = $node['row']['uid'];
655
        }
656
657
        $res = $this->db->exec_SELECTquery(
658
            '*',
659
            'tx_crawler_configuration',
660
            'pid IN ('.implode(',', $pids).') '.
661
            \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('tx_crawler_configuration') .
662
            \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('tx_crawler_configuration').' '.
663
            \TYPO3\CMS\Backend\Utility\BackendUtility::versioningPlaceholderClause('tx_crawler_configuration').' '
664
        );
665
666
        while($row = $this->db->sql_fetch_assoc($res)) {
667
            $configurationsForBranch[] = $row['name'];
668
        }
669
        $this->db->sql_free_result($res);
670
        return $configurationsForBranch;
671
    }
672
673
    /**
674
     * Check if a user has access to an item
675
     * (e.g. get the group list of the current logged in user from $GLOBALS['TSFE']->gr_list)
676
     *
677
     * @see \TYPO3\CMS\Frontend\Page\PageRepository::getMultipleGroupsWhereClause()
678
     * @param  string $groupList    Comma-separated list of (fe_)group UIDs from a user
679
     * @param  string $accessList   Comma-separated list of (fe_)group UIDs of the item to access
680
     * @return bool                 TRUE if at least one of the users group UIDs is in the access list or the access list is empty
681
     * @author Fabrizio Branca <[email protected]>
682
     * @since 2009-01-19
683
     */
684 3
    function hasGroupAccess($groupList, $accessList) {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
685 3
        if (empty($accessList)) {
686 1
            return true;
687
        }
688 2
        foreach(\TYPO3\CMS\Core\Utility\GeneralUtility::intExplode(',', $groupList) as $groupUid) {
689 2
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($accessList, $groupUid)) {
690 2
                return true;
691
            }
692
        }
693 1
        return false;
694
    }
695
696
    /**
697
     * Parse GET vars of input Query into array with key=>value pairs
698
     *
699
     * @param  string  $inputQuery  Input query string
700
     * @return array                Keys are Get var names, values are the values of the GET vars.
701
     */
702 3
    function parseParams($inputQuery) {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
703
            // Extract all GET parameters into an ARRAY:
704 3
        $paramKeyValues = array();
705 3
        $GETparams = explode('&', $inputQuery);
706
707 3
        foreach($GETparams as $paramAndValue)    {
708 3
            list($p,$v) = explode('=', $paramAndValue, 2);
709 3
            if (strlen($p))        {
710 3
                $paramKeyValues[rawurldecode($p)] = rawurldecode($v);
711
            }
712
        }
713
714 3
        return $paramKeyValues;
715
    }
716
717
    /**
718
     * Will expand the parameters configuration to individual values. This follows a certain syntax of the value of each parameter.
719
     * Syntax of values:
720
     * - Basically: If the value is wrapped in [...] it will be expanded according to the following syntax, otherwise the value is taken literally
721
     * - Configuration is splitted by "|" and the parts are processed individually and finally added together
722
     * - For each configuration part:
723
     *         - "[int]-[int]" = Integer range, will be expanded to all values in between, values included, starting from low to high (max. 1000). Example "1-34" or "-40--30"
724
     *         - "_TABLE:[TCA table name];[_PID:[optional page id, default is current page]];[_ENABLELANG:1]" = Look up of table records from PID, filtering out deleted records. Example "_TABLE:tt_content; _PID:123"
725
     *        _ENABLELANG:1 picks only original records without their language overlays
726
     *         - Default: Literal value
727
     *
728
     * @param    array        Array with key (GET var name) and values (value of GET var which is configuration for expansion)
729
     * @param    integer        Current page ID
730
     * @return    array        Array with key (GET var name) with the value being an array of all possible values for that key.
731
     */
732
    function expandParameters($paramArray, $pid)    {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
733
        global $TCA;
734
735
            // Traverse parameter names:
736
        foreach($paramArray as $p => $v)    {
737
            $v = trim($v);
738
739
                // If value is encapsulated in square brackets it means there are some ranges of values to find, otherwise the value is literal
740
            if (substr($v,0,1)==='[' && substr($v,-1)===']')    {
741
                    // So, find the value inside brackets and reset the paramArray value as an array.
742
                $v = substr($v,1,-1);
743
                $paramArray[$p] = array();
744
745
                    // Explode parts and traverse them:
746
                $parts = explode('|',$v);
747
                foreach($parts as $pV)    {
748
749
                        // Look for integer range: (fx. 1-34 or -40--30 // reads minus 40 to minus 30)
750
                    if (preg_match('/^(-?[0-9]+)\s*-\s*(-?[0-9]+)$/',trim($pV),$reg))    {    // Integer range:
751
752
                            // Swap if first is larger than last:
753
                        if ($reg[1] > $reg[2])    {
754
                            $temp = $reg[2];
755
                            $reg[2] = $reg[1];
756
                            $reg[1] = $temp;
757
                        }
758
759
                            // Traverse range, add values:
760
                        $runAwayBrake = 1000;    // Limit to size of range!
761
                        for($a=$reg[1]; $a<=$reg[2];$a++)    {
762
                            $paramArray[$p][] = $a;
763
                            $runAwayBrake--;
764
                            if ($runAwayBrake<=0)    {
765
                                break;
766
                            }
767
                        }
768
                    } elseif (substr(trim($pV),0,7)=='_TABLE:')    {
769
770
                            // Parse parameters:
771
                        $subparts = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(';',$pV);
772
                        $subpartParams = array();
773
                        foreach($subparts as $spV)    {
774
                            list($pKey,$pVal) = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(':',$spV);
775
                            $subpartParams[$pKey] = $pVal;
776
                        }
777
778
                            // Table exists:
779
                        if (isset($TCA[$subpartParams['_TABLE']]))    {
780
                            $lookUpPid = isset($subpartParams['_PID']) ? intval($subpartParams['_PID']) : $pid;
781
                            $pidField = isset($subpartParams['_PIDFIELD']) ? trim($subpartParams['_PIDFIELD']) : 'pid';
782
                            $where = isset($subpartParams['_WHERE']) ? $subpartParams['_WHERE'] : '';
783
                            $addTable = isset($subpartParams['_ADDTABLE']) ? $subpartParams['_ADDTABLE'] : '';
784
785
                            $fieldName = $subpartParams['_FIELD'] ? $subpartParams['_FIELD'] : 'uid';
786
                            if ($fieldName==='uid' || $TCA[$subpartParams['_TABLE']]['columns'][$fieldName]) {
787
788
                                $andWhereLanguage = '';
789
                                $transOrigPointerField = $TCA[$subpartParams['_TABLE']]['ctrl']['transOrigPointerField'];
790
791
                                if ($subpartParams['_ENABLELANG'] && $transOrigPointerField) {
792
                                    $andWhereLanguage = ' AND ' . $this->db->quoteStr($transOrigPointerField, $subpartParams['_TABLE']) .' <= 0 ';
793
                                }
794
795
                                $where = $this->db->quoteStr($pidField, $subpartParams['_TABLE']) .'='.intval($lookUpPid) . ' ' .
796
                                    $andWhereLanguage . $where;
797
798
                                $rows = $this->db->exec_SELECTgetRows(
799
                                    $fieldName,
800
                                    $subpartParams['_TABLE'] . $addTable,
801
                                    $where . \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause($subpartParams['_TABLE']),
802
                                    '',
803
                                    '',
804
                                    '',
805
                                    $fieldName
806
                                );
807
808
                                if (is_array($rows))    {
809
                                    $paramArray[$p] = array_merge($paramArray[$p],array_keys($rows));
810
                                }
811
                            }
812
                        }
813
                    } else {    // Just add value:
814
                        $paramArray[$p][] = $pV;
815
                    }
816
                        // Hook for processing own expandParameters place holder
817
                    if (is_array($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'])) {
818
                        $_params = array(
819
                            'pObj' => &$this,
820
                            'paramArray' => &$paramArray,
821
                            'currentKey' => $p,
822
                            'currentValue' => $pV,
823
                            'pid' => $pid
824
                        );
825
                        foreach($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'] as $key => $_funcRef)    {
826
                            \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($_funcRef, $_params, $this);
827
                        }
828
                    }
829
                }
830
831
                    // Make unique set of values and sort array by key:
832
                $paramArray[$p] = array_unique($paramArray[$p]);
833
                ksort($paramArray);
834
            } else {
835
                    // Set the literal value as only value in array:
836
                $paramArray[$p] = array($v);
837
            }
838
        }
839
840
        return $paramArray;
841
    }
842
843
    /**
844
     * Compiling URLs from parameter array (output of expandParameters())
845
     * The number of URLs will be the multiplication of the number of parameter values for each key
846
     *
847
     * @param  array  $paramArray   Output of expandParameters(): Array with keys (GET var names) and for each an array of values
848
     * @param  array  $urls         URLs accumulated in this array (for recursion)
849
     * @return array                URLs accumulated, if number of urls exceed 'maxCompileUrls' it will return false as an error!
850
     */
851 3
    public function compileUrls($paramArray, $urls = array()) {
852
853 3
        if (count($paramArray) && is_array($urls)) {
854
                // shift first off stack:
855 2
            reset($paramArray);
856 2
            $varName = key($paramArray);
857 2
            $valueSet = array_shift($paramArray);
858
859
                // Traverse value set:
860 2
            $newUrls = array();
861 2
            foreach($urls as $url) {
862 1
                foreach($valueSet as $val) {
863 1
                    $newUrls[] = $url.(strcmp($val,'') ? '&'.rawurlencode($varName).'='.rawurlencode($val) : '');
864
865 1
                    if (count($newUrls) >  \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($this->extensionSettings['maxCompileUrls'], 1, 1000000000, 10000)) {
866 1
                        break;
867
                    }
868
                }
869
            }
870 2
            $urls = $newUrls;
871 2
            $urls = $this->compileUrls($paramArray, $urls);
872
        }
873
874 3
        return $urls;
875
    }
876
877
    /************************************
878
     *
879
     * Crawler log
880
     *
881
     ************************************/
882
883
    /**
884
     * Return array of records from crawler queue for input page ID
885
     *
886
     * @param  integer $id              Page ID for which to look up log entries.
887
     * @param  string  $filter          Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
888
     * @param  boolean $doFlush         If TRUE, then entries selected at DELETED(!) instead of selected!
889
     * @param  boolean $doFullFlush
890
     * @param  integer $itemsPerPage    Limit the amount of entries per page default is 10
891
     * @return array
892
     */
893
    public function getLogEntriesForPageId($id, $filter = '', $doFlush = FALSE, $doFullFlush = FALSE, $itemsPerPage = 10) {
894
        // FIXME: Write Unit tests for Filters
895
        switch($filter) {
896
            case 'pending':
897
                $addWhere = ' AND exec_time=0';
898
                break;
899
            case 'finished':
900
                $addWhere = ' AND exec_time>0';
901
                break;
902
            default:
903
                $addWhere = '';
904
                break;
905
        }
906
907
        // FIXME: Write unit test that ensures that the right records are deleted.
908
        if ($doFlush) {
909
            $this->flushQueue( ($doFullFlush?'1=1':('page_id='.intval($id))) .$addWhere);
910
            return array();
911
        } else {
912
            return $this->db->exec_SELECTgetRows('*',
913
                'tx_crawler_queue',
914
                'page_id=' . intval($id) . $addWhere, '', 'scheduled DESC',
915
                (intval($itemsPerPage)>0 ? intval($itemsPerPage) : ''));
916
        }
917
    }
918
919
    /**
920
     * Return array of records from crawler queue for input set ID
921
     *
922
     * @param    integer        Set ID for which to look up log entries.
923
     * @param    string        Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
924
     * @param    boolean        If TRUE, then entries selected at DELETED(!) instead of selected!
925
     * @param    integer        Limit the amount of entires per page default is 10
926
     * @return    array
927
     */
928
    public function getLogEntriesForSetId($set_id,$filter='',$doFlush=FALSE, $doFullFlush=FALSE, $itemsPerPage=10)    {
929
        // FIXME: Write Unit tests for Filters
930
        switch($filter)    {
931
            case 'pending':
932
                $addWhere = ' AND exec_time=0';
933
                break;
934
            case 'finished':
935
                $addWhere = ' AND exec_time>0';
936
                break;
937
            default:
938
                $addWhere = '';
939
                break;
940
        }
941
        // FIXME: Write unit test that ensures that the right records are deleted.
942
        if ($doFlush)    {
943
            $this->flushQueue($doFullFlush?'':('set_id='.intval($set_id).$addWhere));
944
            return array();
945
        } else {
946
            return $this->db->exec_SELECTgetRows('*',
947
                'tx_crawler_queue',
948
                'set_id='.intval($set_id).$addWhere,'','scheduled DESC',
949
                (intval($itemsPerPage)>0 ? intval($itemsPerPage) : ''));
950
        }
951
    }
952
953
    /**
954
     * Removes queue entires
955
     *
956
     * @param $where    SQL related filter for the entries which should be removed
957
     * @return void
958
     */
959
    protected function flushQueue($where='') {
960
961
        $realWhere = strlen($where)>0?$where:'1=1';
962
963
        if(tx_crawler_domain_events_dispatcher::getInstance()->hasObserver('queueEntryFlush')) {
964
            $groups = $this->db->exec_SELECTgetRows('DISTINCT set_id','tx_crawler_queue',$realWhere);
965
            foreach($groups as $group) {
0 ignored issues
show
Bug introduced by
The expression $groups of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
966
                tx_crawler_domain_events_dispatcher::getInstance()->post('queueEntryFlush',$group['set_id'], $this->db->exec_SELECTgetRows('uid, set_id','tx_crawler_queue',$realWhere.' AND set_id="'.$group['set_id'].'"'));
967
            }
968
        }
969
970
        $this->db->exec_DELETEquery('tx_crawler_queue', $realWhere);
971
    }
972
973
    /**
974
     * Adding call back entries to log (called from hooks typically, see indexed search class "class.crawler.php"
975
     *
976
     * @param    integer        Set ID
977
     * @param    array        Parameters to pass to call back function
978
     * @param    string        Call back object reference, eg. 'EXT:indexed_search/class.crawler.php:&tx_indexedsearch_crawler'
979
     * @param    integer        Page ID to attach it to
980
     * @param    integer        Time at which to activate
981
     * @return    void
982
     */
983
    function addQueueEntry_callBack($setId,$params,$callBack,$page_id=0,$schedule=0) {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
984
985
        if (!is_array($params))    $params = array();
986
        $params['_CALLBACKOBJ'] = $callBack;
987
988
            // Compile value array:
989
        $fieldArray = array(
990
            'page_id' => intval($page_id),
991
            'parameters' => serialize($params),
992
            'scheduled' => intval($schedule) ? intval($schedule) : $this->getCurrentTime(),
993
            'exec_time' => 0,
994
            'set_id' => intval($setId),
995
            'result_data' => '',
996
        );
997
998
        $this->db->exec_INSERTquery('tx_crawler_queue',$fieldArray);
999
    }
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
    /************************************
1012
     *
1013
     * URL setting
1014
     *
1015
     ************************************/
1016
1017
    /**
1018
     * Setting a URL for crawling:
1019
     *
1020
     * @param    integer        Page ID
1021
     * @param    string        Complete URL
1022
     * @param    array        Sub configuration array (from TS config)
1023
     * @param    integer        Scheduled-time
1024
     * @param     string        (optional) configuration hash
1025
     * @param     bool        (optional) skip inner duplication check
1026
     * @return    bool        true if the url was added, false if it already existed
1027
     */
1028
    function addUrl (
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
1029
        $id,
1030
        $url,
1031
        array $subCfg,
1032
        $tstamp,
1033
        $configurationHash='',
1034
        $skipInnerDuplicationCheck=false
1035
    ) {
1036
1037
        $urlAdded = false;
1038
1039
            // Creating parameters:
1040
        $parameters = array(
1041
            'url' => $url
1042
        );
1043
1044
            // fe user group simulation:
1045
        $uGs = implode(',',array_unique(\TYPO3\CMS\Core\Utility\GeneralUtility::intExplode(',',$subCfg['userGroups'],1)));
0 ignored issues
show
Documentation introduced by
1 is of type integer, but the function expects a boolean.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
1046
        if ($uGs)    {
1047
            $parameters['feUserGroupList'] = $uGs;
1048
        }
1049
1050
            // Setting processing instructions
1051
        $parameters['procInstructions'] = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',',$subCfg['procInstrFilter']);
1052
        if (is_array($subCfg['procInstrParams.']))    {
1053
            $parameters['procInstrParams'] = $subCfg['procInstrParams.'];
1054
        }
1055
1056
1057
            // Compile value array:
1058
        $parameters_serialized = serialize($parameters);
1059
        $fieldArray = array(
1060
            'page_id' => intval($id),
1061
            'parameters' => $parameters_serialized,
1062
            'parameters_hash' => \TYPO3\CMS\Core\Utility\GeneralUtility::shortMD5($parameters_serialized),
1063
            'configuration_hash' => $configurationHash,
1064
            'scheduled' => $tstamp,
1065
            'exec_time' => 0,
1066
            'set_id' => intval($this->setID),
1067
            'result_data' => '',
1068
            'configuration' => $subCfg['key'],
1069
        );
1070
1071
        if ($this->registerQueueEntriesInternallyOnly)    {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->registerQueueEntriesInternallyOnly of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
1072
                //the entries will only be registered and not stored to the database
1073
            $this->queueEntries[] = $fieldArray;
1074
        } else {
1075
1076
            if(!$skipInnerDuplicationCheck){
1077
                    // check if there is already an equal entry
1078
                $rows = $this->getDuplicateRowsIfExist($tstamp,$fieldArray);
1079
            }
1080
1081
            if (count($rows) == 0) {
1082
                $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1083
                $uid = $this->db->sql_insert_id();
1084
                $rows[] = $uid;
0 ignored issues
show
Bug introduced by
The variable $rows does not seem to be defined for all execution paths leading up to this point.

If you define a variable conditionally, it can happen that it is not defined for all execution paths.

Let’s take a look at an example:

function myFunction($a) {
    switch ($a) {
        case 'foo':
            $x = 1;
            break;

        case 'bar':
            $x = 2;
            break;
    }

    // $x is potentially undefined here.
    echo $x;
}

In the above example, the variable $x is defined if you pass “foo” or “bar” as argument for $a. However, since the switch statement has no default case statement, if you pass any other value, the variable $x would be undefined.

Available Fixes

  1. Check for existence of the variable explicitly:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        if (isset($x)) { // Make sure it's always set.
            echo $x;
        }
    }
    
  2. Define a default value for the variable:

    function myFunction($a) {
        $x = ''; // Set a default which gets overridden for certain paths.
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        echo $x;
    }
    
  3. Add a value for the missing path:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
    
            // We add support for the missing case.
            default:
                $x = '';
                break;
        }
    
        echo $x;
    }
    
Loading history...
1085
                $urlAdded = true;
1086
                tx_crawler_domain_events_dispatcher::getInstance()->post('urlAddedToQueue',$this->setID,array('uid' => $uid, 'fieldArray' => $fieldArray));
1087
            }else{
1088
                tx_crawler_domain_events_dispatcher::getInstance()->post('duplicateUrlInQueue',$this->setID,array('rows' => $rows, 'fieldArray' => $fieldArray));
1089
            }
1090
        }
1091
1092
        return $urlAdded;
1093
    }
1094
1095
    /**
1096
     * This method determines duplicates for a queue entry with the same parameters and this timestamp.
1097
     * If the timestamp is in the past, it will check if there is any unprocessed queue entry in the past.
1098
     * If the timestamp is in the future it will check, if the queued entry has exactly the same timestamp
1099
     *
1100
     * @param int $tstamp
1101
     * @param string $parameters
0 ignored issues
show
Bug introduced by
There is no parameter named $parameters. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
1102
     * @author Fabrizio Branca
1103
     * @author Timo Schmidt
1104
     * @return array;
0 ignored issues
show
Documentation introduced by
The doc-type array; could not be parsed: Expected "|" or "end of type", but got ";" at position 5. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
1105
     */
1106
    protected function getDuplicateRowsIfExist($tstamp,$fieldArray){
1107
        $rows = array();
1108
1109
        $currentTime = $this->getCurrentTime();
1110
1111
            //if this entry is scheduled with "now"
1112
        if ($tstamp <= $currentTime) {
1113
            if($this->extensionSettings['enableTimeslot']){
1114
                $timeBegin     = $currentTime - 100;
1115
                $timeEnd     = $currentTime + 100;
1116
                $where         = ' ((scheduled BETWEEN '.$timeBegin.' AND '.$timeEnd.' ) OR scheduled <= '. $currentTime.') ';
1117
            }else{
1118
                $where = 'scheduled <= ' . $currentTime;
1119
            }
1120
        } elseif ($tstamp > $currentTime) {
1121
                //entry with a timestamp in the future need to have the same schedule time
1122
            $where = 'scheduled = ' . $tstamp ;
1123
        }
1124
1125
        if(!empty($where)){
1126
            $result = $this->db->exec_SELECTgetRows(
1127
                'qid',
1128
                'tx_crawler_queue',
1129
                $where.
1130
                ' AND NOT exec_time' .
1131
                ' AND NOT process_id '.
1132
                ' AND page_id='.intval($fieldArray['page_id']).
1133
                ' AND parameters_hash = ' . $this->db->fullQuoteStr($fieldArray['parameters_hash'], 'tx_crawler_queue')
1134
            );
1135
1136
            if (is_array($result)) {
1137
                foreach ($result as $value) {
1138
                    $rows[] = $value['qid'];
1139
                }
1140
            }
1141
        }
1142
1143
1144
        return $rows;
1145
    }
1146
1147
    /**
1148
     * Returns the current system time
1149
     *
1150
     * @author Timo Schmidt <[email protected]>
1151
     * @return int
1152
     */
1153
    public function getCurrentTime(){
1154
        return time();
1155
    }
1156
1157
1158
1159
    /************************************
1160
     *
1161
     * URL reading
1162
     *
1163
     ************************************/
1164
1165
    /**
1166
     * Read URL for single queue entry
1167
     *
1168
     * @param integer $queueId
1169
     * @param boolean $force If set, will process even if exec_time has been set!
1170
     * @return integer
1171
     */
1172
    function readUrl($queueId, $force = FALSE) {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
1173
        $ret = 0;
1174
        if ($this->debugMode) {
1175
            \TYPO3\CMS\Core\Utility\GeneralUtility::devlog('crawler-readurl start ' . microtime(true), __FUNCTION__);
1176
        }
1177
        // Get entry:
1178
        list($queueRec) = $this->db->exec_SELECTgetRows('*', 'tx_crawler_queue',
1179
            'qid=' . intval($queueId) . ($force ? '' : ' AND exec_time=0 AND process_scheduled > 0'));
1180
1181
        if (!is_array($queueRec)) {
1182
            return;
1183
        }
1184
1185
        $pageUidRootTypoScript = \AOE\Crawler\Utility\TypoScriptUtility::getPageUidForTypoScriptRootTemplateInRootLine((int)$queueRec['page_id']);
1186
        $this->initTSFE((int)$pageUidRootTypoScript);
1187
1188
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1189
            __CLASS__,
1190
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_PREPROCESS,
1191
            array($queueId, &$queueRec)
1192
        );
1193
1194
        // Set exec_time to lock record:
1195
        $field_array = array('exec_time' => $this->getCurrentTime());
1196
1197
        if (isset($this->processID)) {
1198
            //if mulitprocessing is used we need to store the id of the process which has handled this entry
1199
            $field_array['process_id_completed'] = $this->processID;
1200
        }
1201
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1202
1203
        $result = $this->readUrl_exec($queueRec);
1204
        $resultData = unserialize($result['content']);
1205
1206
        //atm there's no need to point to specific pollable extensions
1207
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'])) {
1208
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'] as $pollable) {
1209
                // only check the success value if the instruction is runnig
1210
                // it is important to name the pollSuccess key same as the procInstructions key
1211
                if (is_array($resultData['parameters']['procInstructions']) && in_array($pollable,
1212
                        $resultData['parameters']['procInstructions'])
1213
                ) {
1214
                    if (!empty($resultData['success'][$pollable]) && $resultData['success'][$pollable]) {
1215
                        $ret |= self::CLI_STATUS_POLLABLE_PROCESSED;
1216
                    }
1217
                }
1218
            }
1219
        }
1220
1221
        // Set result in log which also denotes the end of the processing of this entry.
1222
        $field_array = array('result_data' => serialize($result));
1223
1224
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1225
            __CLASS__,
1226
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1227
            array($queueId, &$field_array)
1228
        );
1229
1230
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1231
1232
1233
        if ($this->debugMode) {
1234
            \TYPO3\CMS\Core\Utility\GeneralUtility::devlog('crawler-readurl stop ' . microtime(true), __FUNCTION__);
1235
        }
1236
1237
        return $ret;
1238
    }
1239
1240
    /**
1241
     * Read URL for not-yet-inserted log-entry
1242
     *
1243
     * @param    integer        Queue field array,
1244
     * @return    string
1245
     */
1246
    function readUrlFromArray($field_array)    {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
1247
1248
            // Set exec_time to lock record:
1249
        $field_array['exec_time'] = $this->getCurrentTime();
1250
        $this->db->exec_INSERTquery('tx_crawler_queue', $field_array);
1251
        $queueId = $field_array['qid'] = $this->db->sql_insert_id();
1252
1253
        $result = $this->readUrl_exec($field_array);
1254
1255
            // Set result in log which also denotes the end of the processing of this entry.
1256
        $field_array = array('result_data' => serialize($result));
1257
        $this->db->exec_UPDATEquery('tx_crawler_queue','qid='.intval($queueId), $field_array);
1258
1259
        return $result;
1260
    }
1261
1262
    /**
1263
     * Read URL for a queue record
1264
     *
1265
     * @param    array        Queue record
1266
     * @return    string        Result output.
1267
     */
1268
    function readUrl_exec($queueRec)    {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
1269
            // Decode parameters:
1270
        $parameters = unserialize($queueRec['parameters']);
1271
        $result = 'ERROR';
1272
        if (is_array($parameters))    {
1273
            if ($parameters['_CALLBACKOBJ'])    {    // Calling object:
1274
                $objRef = $parameters['_CALLBACKOBJ'];
1275
                $callBackObj = &\TYPO3\CMS\Core\Utility\GeneralUtility::getUserObj($objRef);
1276
                if (is_object($callBackObj))    {
1277
                    unset($parameters['_CALLBACKOBJ']);
1278
                    $result = array('content' => serialize($callBackObj->crawler_execute($parameters,$this)));
1279
                } else {
1280
                    $result = array('content' => 'No object: '.$objRef);
1281
                }
1282
            } else {    // Regular FE request:
1283
1284
                    // Prepare:
1285
                $crawlerId = $queueRec['qid'].':'.md5($queueRec['qid'].'|'.$queueRec['set_id'].'|'.$GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey']);
1286
1287
                    // Get result:
1288
                $result = $this->requestUrl($parameters['url'],$crawlerId);
1289
1290
                tx_crawler_domain_events_dispatcher::getInstance()->post('urlCrawled',$queueRec['set_id'],array('url' => $parameters['url'], 'result' => $result));
1291
            }
1292
        }
1293
1294
1295
        return $result;
1296
    }
1297
1298
    /**
1299
     * Gets the content of a URL.
1300
     *
1301
     * @param  string   $originalUrl    URL to read
1302
     * @param  string   $crawlerId      Crawler ID string (qid + hash to verify)
1303
     * @param  integer  $timeout        Timeout time
1304
     * @param  integer  $recursion      Recursion limiter for 302 redirects
1305
     * @return array                    Array with content
1306
     */
1307 2
    function requestUrl($originalUrl, $crawlerId, $timeout=2, $recursion=10) {
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
1308
1309 2
        if (!$recursion) return false;
1310
1311
            // Parse URL, checking for scheme:
1312 2
        $url = parse_url($originalUrl);
1313
1314 2
        if ($url === FALSE) {
1315
            if (TYPO3_DLOG) \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Could not parse_url() for string "%s"', $url), 'crawler', 4, array('crawlerId' => $crawlerId));
1316
            return FALSE;
1317
        }
1318
1319 2
        if (!in_array($url['scheme'], array('','http','https'))) {
1320
            if (TYPO3_DLOG) \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Scheme does not match for url "%s"', $url), 'crawler', 4, array('crawlerId' => $crawlerId));
1321
            return FALSE;
1322
        }
1323
1324
 	    // direct request
1325 2
        if ($this->extensionSettings['makeDirectRequests']) {
1326 2
            $result = $this->sendDirectRequest($originalUrl, $crawlerId);
1327 2
            return $result;
1328
        }
1329
1330
        $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1331
1332
            // thanks to Pierrick Caillon for adding proxy support
1333
        $rurl = $url;
1334
1335
        if ($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlUse'] && $GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']) {
1336
            $rurl = parse_url($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']);
1337
            $url['path'] = $url['scheme'] . '://' . $url['host'] . ($url['port'] > 0 ? ':' . $url['port'] : '') . $url['path'];
1338
            $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1339
        }
1340
1341
        $host = $rurl['host'];
1342
1343
        if ($url['scheme'] == 'https') {
1344
            $host = 'ssl://' . $host;
1345
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 443;
1346
        } else {
1347
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 80;
1348
        }
1349
1350
        $startTime = microtime(true);
1351
        $fp = fsockopen($host, $port, $errno, $errstr, $timeout);
1352
1353
        if (!$fp) {
1354
            if (TYPO3_DLOG) \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, array('crawlerId' => $crawlerId));
1355
            return FALSE;
1356
        } else {
1357
                // Request message:
1358
            $msg = implode("\r\n",$reqHeaders)."\r\n\r\n";
1359
            fputs ($fp, $msg);
1360
1361
                // Read response:
1362
            $d = $this->getHttpResponseFromStream($fp);
1363
            fclose ($fp);
1364
1365
            $time = microtime(true) - $startTime;
1366
            $this->log($originalUrl .' '.$time);
1367
1368
                // Implode content and headers:
1369
            $result = array(
1370
                'request' => $msg,
1371
                'headers' => implode('', $d['headers']),
1372
                'content' => implode('', (array)$d['content'])
1373
            );
1374
1375
            if (($this->extensionSettings['follow30x']) && ($newUrl = $this->getRequestUrlFrom302Header($d['headers'],$url['user'],$url['pass']))) {
1376
                $result = array_merge(array('parentRequest'=>$result), $this->requestUrl($newUrl, $crawlerId, $recursion--));
1377
                $newRequestUrl = $this->requestUrl($newUrl, $crawlerId, $timeout, --$recursion);
1378
1379
                if (is_array($newRequestUrl)) {
1380
                    $result = array_merge(array('parentRequest'=>$result), $newRequestUrl);
1381
                } else {
1382
                    if (TYPO3_DLOG) \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, array('crawlerId' => $crawlerId));
1383
                    return FALSE;
1384
                }
1385
            }
1386
1387
            return $result;
1388
        }
1389
    }
1390
1391
    /**
1392
     * Gets the base path of the website frontend.
1393
     * (e.g. if you call http://mydomain.com/cms/index.php in
1394
     * the browser the base path is "/cms/")
1395
     *
1396
     * @return string Base path of the website frontend
1397
     */
1398
    protected function getFrontendBasePath() {
1399
        $frontendBasePath = '/';
1400
1401
        // Get the path from the extension settings:
1402
        if (isset($this->extensionSettings['frontendBasePath']) && $this->extensionSettings['frontendBasePath']) {
1403
            $frontendBasePath = $this->extensionSettings['frontendBasePath'];
1404
        // If empty, try to use config.absRefPrefix:
1405
        } elseif (isset($GLOBALS['TSFE']->absRefPrefix) && !empty($GLOBALS['TSFE']->absRefPrefix)) {
1406
            $frontendBasePath = $GLOBALS['TSFE']->absRefPrefix;
1407
        // If not in CLI mode the base path can be determined from $_SERVER environment:
1408
        } elseif (!defined('TYPO3_cliMode') || !TYPO3_cliMode) {
1409
            $frontendBasePath = \TYPO3\CMS\Core\Utility\GeneralUtility::getIndpEnv('TYPO3_SITE_PATH');
1410
        }
1411
1412
        // Base path must be '/<pathSegements>/':
1413
        if ($frontendBasePath != '/') {
1414
            $frontendBasePath = '/' . ltrim($frontendBasePath, '/');
1415
            $frontendBasePath = rtrim($frontendBasePath, '/') . '/';
1416
        }
1417
1418
        return $frontendBasePath;
1419
    }
1420
1421
    /**
1422
     * Executes a shell command and returns the outputted result.
1423
     *
1424
     * @param string $command Shell command to be executed
1425
     * @return string Outputted result of the command execution
1426
     */
1427
    protected function executeShellCommand($command) {
1428
        $result = shell_exec($command);
1429
        return $result;
1430
    }
1431
1432
    /**
1433
     * Reads HTTP response from the given stream.
1434
     *
1435
     * @param  resource $streamPointer  Pointer to connection stream.
1436
     * @return array                    Associative array with the following items:
1437
     *                                  headers <array> Response headers sent by server.
1438
     *                                  content <array> Content, with each line as an array item.
1439
     */
1440
    protected function getHttpResponseFromStream($streamPointer) {
1441
        $response = array('headers' => array(), 'content' => array());
1442
1443
        if (is_resource($streamPointer)) {
1444
                // read headers
1445
            while($line = fgets($streamPointer, '2048')) {
1446
                $line = trim($line);
1447
                if ($line !== '') {
1448
                    $response['headers'][] = $line;
1449
                } else {
1450
                    break;
1451
                }
1452
            }
1453
1454
                // read content
1455
            while($line = fgets($streamPointer, '2048')) {
1456
                $response['content'][] = $line;
1457
            }
1458
        }
1459
1460
        return $response;
1461
    }
1462
1463
    /**
1464
     * @param message
1465
     */
1466 2
    protected function log($message) {
1467 2
        if (!empty($this->extensionSettings['logFileName'])) {
1468
            @file_put_contents($this->extensionSettings['logFileName'], date('Ymd His') . $message . "\n", FILE_APPEND);
0 ignored issues
show
Security Best Practice introduced by
It seems like you do not handle an error condition here. This can introduce security issues, and is generally not recommended.

If you suppress an error, we recommend checking for the error condition explicitly:

// For example instead of
@mkdir($dir);

// Better use
if (@mkdir($dir) === false) {
    throw new \RuntimeException('The directory '.$dir.' could not be created.');
}
Loading history...
1469
        }
1470 2
    }
1471
1472
    /**
1473
     * Builds HTTP request headers.
1474
     *
1475
     * @param array $url
1476
     * @param string $crawlerId
1477
     *
1478
     * @return array
1479
     */
1480 6
    protected function buildRequestHeaderArray(array $url, $crawlerId) {
1481 6
        $reqHeaders = array();
1482 6
        $reqHeaders[] = 'GET '.$url['path'].($url['query'] ? '?'.$url['query'] : '').' HTTP/1.0';
1483 6
        $reqHeaders[] = 'Host: '.$url['host'];
1484 6
        if (stristr($url['query'],'ADMCMD_previewWS')) {
1485 2
            $reqHeaders[] = 'Cookie: $Version="1"; be_typo_user="1"; $Path=/';
1486
        }
1487 6
        $reqHeaders[] = 'Connection: close';
1488 6
        if ($url['user']!='') {
1489 2
            $reqHeaders[] = 'Authorization: Basic '. base64_encode($url['user'].':'.$url['pass']);
1490
        }
1491 6
        $reqHeaders[] = 'X-T3crawler: '.$crawlerId;
1492 6
        $reqHeaders[] = 'User-Agent: TYPO3 crawler';
1493 6
        return $reqHeaders;
1494
    }
1495
1496
    /**
1497
     * Check if the submitted HTTP-Header contains a redirect location and built new crawler-url
1498
     *
1499
     * @param    array        HTTP Header
1500
     * @param    string        HTTP Auth. User
1501
     * @param    string        HTTP Auth. Password
1502
     * @return    string        URL from redirection
1503
     */
1504 12
    protected function getRequestUrlFrom302Header($headers,$user='',$pass='') {
1505 12
        if(!is_array($headers)) return false;
1506 11
        if(!(stristr($headers[0],'301 Moved') || stristr($headers[0],'302 Found') || stristr($headers[0],'302 Moved'))) return false;
1507
1508 9
        foreach($headers as $hl) {
1509 9
            $tmp = explode(": ",$hl);
1510 9
            $header[trim($tmp[0])] = trim($tmp[1]);
0 ignored issues
show
Coding Style Comprehensibility introduced by
$header was never initialized. Although not strictly required by PHP, it is generally a good practice to add $header = array(); before regardless.

Adding an explicit array definition is generally preferable to implicit array definition as it guarantees a stable state of the code.

Let’s take a look at an example:

foreach ($collection as $item) {
    $myArray['foo'] = $item->getFoo();

    if ($item->hasBar()) {
        $myArray['bar'] = $item->getBar();
    }

    // do something with $myArray
}

As you can see in this example, the array $myArray is initialized the first time when the foreach loop is entered. You can also see that the value of the bar key is only written conditionally; thus, its value might result from a previous iteration.

This might or might not be intended. To make your intention clear, your code more readible and to avoid accidental bugs, we recommend to add an explicit initialization $myArray = array() either outside or inside the foreach loop.

Loading history...
1511 9
            if(trim($tmp[0])=='Location') break;
1512
        }
1513 9
        if(!array_key_exists('Location',$header)) return false;
0 ignored issues
show
Bug introduced by
The variable $header does not seem to be defined for all execution paths leading up to this point.

If you define a variable conditionally, it can happen that it is not defined for all execution paths.

Let’s take a look at an example:

function myFunction($a) {
    switch ($a) {
        case 'foo':
            $x = 1;
            break;

        case 'bar':
            $x = 2;
            break;
    }

    // $x is potentially undefined here.
    echo $x;
}

In the above example, the variable $x is defined if you pass “foo” or “bar” as argument for $a. However, since the switch statement has no default case statement, if you pass any other value, the variable $x would be undefined.

Available Fixes

  1. Check for existence of the variable explicitly:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        if (isset($x)) { // Make sure it's always set.
            echo $x;
        }
    }
    
  2. Define a default value for the variable:

    function myFunction($a) {
        $x = ''; // Set a default which gets overridden for certain paths.
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        echo $x;
    }
    
  3. Add a value for the missing path:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
    
            // We add support for the missing case.
            default:
                $x = '';
                break;
        }
    
        echo $x;
    }
    
Loading history...
1514
1515 6
        if($user!='') {
1516 3
            if(!($tmp = parse_url($header['Location']))) return false;
1517 2
            $newUrl = $tmp['scheme'] . '://' . $user . ':' . $pass . '@' . $tmp['host'] . $tmp['path'];
1518 2
            if($tmp['query']!='') $newUrl .= '?' . $tmp['query'];
1519
        } else {
1520 3
            $newUrl = $header['Location'];
1521
        }
1522 5
        return $newUrl;
1523
    }
1524
1525
1526
1527
1528
1529
1530
1531
1532
    /**************************
1533
     *
1534
     * tslib_fe hooks:
1535
     *
1536
     **************************/
1537
1538
    /**
1539
     * Initialization hook (called after database connection)
1540
     * Takes the "HTTP_X_T3CRAWLER" header and looks up queue record and verifies if the session comes from the system (by comparing hashes)
1541
     *
1542
     * @param    array        Parameters from frontend
1543
     * @param    object        TSFE object (reference under PHP5)
1544
     * @return    void
1545
     */
1546
    function fe_init(&$params, $ref)    {
0 ignored issues
show
Unused Code introduced by
The parameter $ref is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
1547
1548
            // Authenticate crawler request:
1549
        if (isset($_SERVER['HTTP_X_T3CRAWLER']))    {
1550
            list($queueId,$hash) = explode(':', $_SERVER['HTTP_X_T3CRAWLER']);
1551
            list($queueRec) = $this->db->exec_SELECTgetRows('*','tx_crawler_queue','qid='.intval($queueId));
1552
1553
                // If a crawler record was found and hash was matching, set it up:
1554
            if (is_array($queueRec) && $hash === md5($queueRec['qid'].'|'.$queueRec['set_id'].'|'.$GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey']))    {
1555
                $params['pObj']->applicationData['tx_crawler']['running'] = TRUE;
1556
                $params['pObj']->applicationData['tx_crawler']['parameters'] = unserialize($queueRec['parameters']);
1557
                $params['pObj']->applicationData['tx_crawler']['log'] = array();
1558
            } else {
1559
                die('No crawler entry found!');
0 ignored issues
show
Coding Style Compatibility introduced by
The method fe_init() contains an exit expression.

An exit expression should only be used in rare cases. For example, if you write a short command line script.

In most cases however, using an exit expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage.

Loading history...
1560
            }
1561
        }
1562
    }
1563
1564
1565
1566
    /*****************************
1567
     *
1568
     * Compiling URLs to crawl - tools
1569
     *
1570
     *****************************/
1571
1572
    /**
1573
     * @param    integer        Root page id to start from.
1574
     * @param    integer        Depth of tree, 0=only id-page, 1= on sublevel, 99 = infinite
1575
     * @param    integer        Unix Time when the URL is timed to be visited when put in queue
1576
     * @param    integer        Number of requests per minute (creates the interleave between requests)
1577
     * @param    boolean        If set, submits the URLs to queue in database (real crawling)
1578
     * @param    boolean        If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
1579
     * @param    array        Array of processing instructions
1580
     * @param    array        Array of configuration keys
1581
     * @return    string        HTML code
1582
     */
1583
    function getPageTreeAndUrls(
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
1584
        $id,
1585
        $depth,
1586
        $scheduledTime,
1587
        $reqMinute,
1588
        $submitCrawlUrls,
1589
        $downloadCrawlUrls,
1590
        array $incomingProcInstructions,
1591
        array $configurationSelection
1592
    ) {
1593
1594
        global $BACK_PATH;
1595
        global $LANG;
1596
        if (!is_object($LANG)) {
1597
            $LANG = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('language');
1598
            $LANG->init(0);
1599
        }
1600
        $this->scheduledTime = $scheduledTime;
0 ignored issues
show
Bug introduced by
The property scheduledTime does not exist. Did you maybe forget to declare it?

In PHP it is possible to write to properties without declaring them. For example, the following is perfectly valid PHP code:

class MyClass { }

$x = new MyClass();
$x->foo = true;

Generally, it is a good practice to explictly declare properties to avoid accidental typos and provide IDE auto-completion:

class MyClass {
    public $foo;
}

$x = new MyClass();
$x->foo = true;
Loading history...
1601
        $this->reqMinute = $reqMinute;
0 ignored issues
show
Bug introduced by
The property reqMinute does not exist. Did you maybe forget to declare it?

In PHP it is possible to write to properties without declaring them. For example, the following is perfectly valid PHP code:

class MyClass { }

$x = new MyClass();
$x->foo = true;

Generally, it is a good practice to explictly declare properties to avoid accidental typos and provide IDE auto-completion:

class MyClass {
    public $foo;
}

$x = new MyClass();
$x->foo = true;
Loading history...
1602
        $this->submitCrawlUrls = $submitCrawlUrls;
0 ignored issues
show
Bug introduced by
The property submitCrawlUrls does not exist. Did you maybe forget to declare it?

In PHP it is possible to write to properties without declaring them. For example, the following is perfectly valid PHP code:

class MyClass { }

$x = new MyClass();
$x->foo = true;

Generally, it is a good practice to explictly declare properties to avoid accidental typos and provide IDE auto-completion:

class MyClass {
    public $foo;
}

$x = new MyClass();
$x->foo = true;
Loading history...
1603
        $this->downloadCrawlUrls = $downloadCrawlUrls;
0 ignored issues
show
Bug introduced by
The property downloadCrawlUrls does not exist. Did you maybe forget to declare it?

In PHP it is possible to write to properties without declaring them. For example, the following is perfectly valid PHP code:

class MyClass { }

$x = new MyClass();
$x->foo = true;

Generally, it is a good practice to explictly declare properties to avoid accidental typos and provide IDE auto-completion:

class MyClass {
    public $foo;
}

$x = new MyClass();
$x->foo = true;
Loading history...
1604
        $this->incomingProcInstructions = $incomingProcInstructions;
1605
        $this->incomingConfigurationSelection = $configurationSelection;
1606
1607
        $this->duplicateTrack = array();
1608
        $this->downloadUrls = array();
1609
1610
            // Drawing tree:
1611
            /* @var $tree \TYPO3\CMS\Backend\Tree\View\PageTreeView */
1612
        $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1613
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
1614
        $tree->init('AND ' . $perms_clause);
1615
1616
        $pageinfo = \TYPO3\CMS\Backend\Utility\BackendUtility::readPageAccess($id, $perms_clause);
1617
        /** @var \TYPO3\CMS\Core\Imaging\IconFactory $iconFactory */
1618
        $iconFactory = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(IconFactory::class);
1619
1620
            // Set root row:
1621
        $tree->tree[] = [
1622
            'row' => $pageinfo,
1623
            'HTML' => $iconFactory->getIconForRecord('pages', $pageinfo, Icon::SIZE_SMALL)->render()
0 ignored issues
show
Security Bug introduced by
It seems like $pageinfo defined by \TYPO3\CMS\Backend\Utili...ess($id, $perms_clause) on line 1616 can also be of type false; however, TYPO3\CMS\Core\Imaging\I...ory::getIconForRecord() does only seem to accept array, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
1624
        ];
1625
1626
            // Get branch beneath:
1627
        if ($depth)    {
1628
            $tree->getTree($id, $depth, '');
1629
        }
1630
1631
            // Traverse page tree:
1632
        $code = '';
1633
1634
        foreach ($tree->tree as $data) {
1635
1636
            $this->MP = false;
1637
1638
                // recognize mount points
1639
            if($data['row']['doktype'] == 7){
1640
                $mountpage = $this->db->exec_SELECTgetRows('*', 'pages', 'uid = '.$data['row']['uid']);
1641
1642
                    // fetch mounted pages
1643
                $this->MP = $mountpage[0]['mount_pid'].'-'.$data['row']['uid'];
0 ignored issues
show
Documentation Bug introduced by
The property $MP was declared of type boolean, but $mountpage[0]['mount_pid...' . $data['row']['uid'] is of type string. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
1644
1645
                $mountTree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1646
                $mountTree->init('AND '.$perms_clause);
1647
                $mountTree->getTree($mountpage[0]['mount_pid'], $depth, '');
1648
1649
                foreach($mountTree->tree as $mountData)    {
1650
                    $code .= $this->drawURLs_addRowsForPage(
1651
                        $mountData['row'],
1652
                        $mountData['HTML'].\TYPO3\CMS\Backend\Utility\BackendUtility::getRecordTitle('pages',$mountData['row'],TRUE)
1653
                    );
1654
                }
1655
1656
                    // replace page when mount_pid_ol is enabled
1657
                if($mountpage[0]['mount_pid_ol']){
1658
                    $data['row']['uid'] = $mountpage[0]['mount_pid'];
1659
                } else {
1660
                        // if the mount_pid_ol is not set the MP must not be used for the mountpoint page
1661
                    $this->MP = false;
1662
                }
1663
            }
1664
1665
            $code .= $this->drawURLs_addRowsForPage(
1666
                $data['row'],
0 ignored issues
show
Security Bug introduced by
It seems like $data['row'] can also be of type false; however, tx_crawler_lib::drawURLs_addRowsForPage() does only seem to accept array, did you maybe forget to handle an error condition?
Loading history...
1667
                $data['HTML'] . \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordTitle('pages', $data['row'], TRUE)
0 ignored issues
show
Security Bug introduced by
It seems like $data['row'] can also be of type false; however, TYPO3\CMS\Backend\Utilit...ility::getRecordTitle() does only seem to accept array, did you maybe forget to handle an error condition?
Loading history...
1668
            );
1669
        }
1670
1671
        return $code;
1672
    }
1673
1674
    /**
1675
     * Expands exclude string.
1676
     *
1677
     * @param  string $excludeString    Exclude string
1678
     * @return array                    Array of page ids.
1679
     */
1680
    public function expandExcludeString($excludeString) {
1681
            // internal static caches;
1682
        static $expandedExcludeStringCache;
1683
        static $treeCache;
1684
1685
        if (empty($expandedExcludeStringCache[$excludeString])) {
1686
            $pidList = array();
1687
1688
            if (!empty($excludeString)) {
1689
                /* @var $tree \TYPO3\CMS\Backend\Tree\View\PageTreeView */
1690
                $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1691
                $tree->init('AND ' . $this->backendUser->getPagePermsClause(1));
1692
1693
                $excludeParts = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $excludeString);
1694
1695
                foreach ($excludeParts as $excludePart) {
1696
                    list($pid, $depth) = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode('+', $excludePart);
1697
1698
                        // default is "page only" = "depth=0"
1699
                    if (empty($depth)) {
1700
                        $depth = ( stristr($excludePart,'+')) ? 99 : 0;
1701
                    }
1702
1703
                    $pidList[] = $pid;
1704
1705
                    if ($depth > 0) {
1706
                        if (empty($treeCache[$pid][$depth])) {
1707
                            $tree->reset();
1708
                            $tree->getTree($pid, $depth);
1709
                            $treeCache[$pid][$depth] = $tree->tree;
1710
                        }
1711
1712
                        foreach ($treeCache[$pid][$depth] as $data) {
1713
                            $pidList[] = $data['row']['uid'];
1714
                        }
1715
                    }
1716
                }
1717
            }
1718
1719
            $expandedExcludeStringCache[$excludeString] = array_unique($pidList);
1720
        }
1721
1722
        return $expandedExcludeStringCache[$excludeString];
1723
    }
1724
1725
    /**
1726
     * Create the rows for display of the page tree
1727
     * For each page a number of rows are shown displaying GET variable configuration
1728
     *
1729
     * @param    array        Page row
1730
     * @param    string        Page icon and title for row
1731
     * @return    string        HTML <tr> content (one or more)
1732
     */
1733
    public function drawURLs_addRowsForPage(array $pageRow, $pageTitleAndIcon)    {
1734
1735
        $skipMessage = '';
1736
1737
            // Get list of configurations
1738
        $configurations = $this->getUrlsForPageRow($pageRow, $skipMessage);
1739
1740
        if (count($this->incomingConfigurationSelection) > 0) {
1741
                //     remove configuration that does not match the current selection
1742
            foreach ($configurations as $confKey => $confArray) {
1743
                if (!in_array($confKey, $this->incomingConfigurationSelection)) {
1744
                    unset($configurations[$confKey]);
1745
                }
1746
            }
1747
        }
1748
1749
            // Traverse parameter combinations:
1750
        $c = 0;
1751
        $cc = 0;
0 ignored issues
show
Unused Code introduced by
$cc is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
1752
        $content = '';
1753
        if (count($configurations)) {
1754
            foreach($configurations as $confKey => $confArray)    {
1755
1756
                    // Title column:
1757
                if (!$c) {
1758
                    $titleClm = '<td rowspan="'.count($configurations).'">'.$pageTitleAndIcon.'</td>';
1759
                } else {
1760
                    $titleClm = '';
1761
                }
1762
1763
1764
                if (!in_array($pageRow['uid'], $this->expandExcludeString($confArray['subCfg']['exclude']))) {
1765
1766
                        // URL list:
1767
                    $urlList = $this->urlListFromUrlArray(
1768
                        $confArray,
1769
                        $pageRow,
1770
                        $this->scheduledTime,
1771
                        $this->reqMinute,
1772
                        $this->submitCrawlUrls,
1773
                        $this->downloadCrawlUrls,
1774
                        $this->duplicateTrack,
1775
                        $this->downloadUrls,
1776
                        $this->incomingProcInstructions // if empty the urls won't be filtered by processing instructions
1777
                    );
1778
1779
                        // Expanded parameters:
1780
                    $paramExpanded = '';
1781
                    $calcAccu = array();
1782
                    $calcRes = 1;
1783
                    foreach($confArray['paramExpanded'] as $gVar => $gVal)    {
1784
                        $paramExpanded.= '
1785
                            <tr>
1786
                                <td class="bgColor4-20">'.htmlspecialchars('&'.$gVar.'=').'<br/>'.
1787
                                                '('.count($gVal).')'.
1788
                                                '</td>
1789
                                <td class="bgColor4" nowrap="nowrap">'.nl2br(htmlspecialchars(implode(chr(10),$gVal))).'</td>
1790
                            </tr>
1791
                        ';
1792
                        $calcRes*= count($gVal);
1793
                        $calcAccu[] = count($gVal);
1794
                    }
1795
                    $paramExpanded = '<table class="lrPadding c-list param-expanded">'.$paramExpanded.'</table>';
1796
                    $paramExpanded.= 'Comb: '.implode('*',$calcAccu).'='.$calcRes;
1797
1798
                        // Options
1799
                    $optionValues = '';
1800
                    if ($confArray['subCfg']['userGroups'])    {
1801
                        $optionValues.='User Groups: '.$confArray['subCfg']['userGroups'].'<br/>';
1802
                    }
1803
                    if ($confArray['subCfg']['baseUrl'])    {
1804
                        $optionValues.='Base Url: '.$confArray['subCfg']['baseUrl'].'<br/>';
1805
                    }
1806
                    if ($confArray['subCfg']['procInstrFilter'])    {
1807
                        $optionValues.='ProcInstr: '.$confArray['subCfg']['procInstrFilter'].'<br/>';
1808
                    }
1809
1810
                        // Compile row:
1811
                    $content .= '
1812
                        <tr class="bgColor' . ($c%2 ? '-20':'-10') . '">
1813
                            ' . $titleClm . '
1814
                            <td>' . htmlspecialchars($confKey) . '</td>
1815
                            <td>' . nl2br(htmlspecialchars(rawurldecode(trim(str_replace('&', chr(10) . '&', \TYPO3\CMS\Core\Utility\GeneralUtility::implodeArrayForUrl('', $confArray['paramParsed'])))))) . '</td>
1816
                            <td>'.$paramExpanded.'</td>
1817
                            <td nowrap="nowrap">' . $urlList . '</td>
1818
                            <td nowrap="nowrap">' . $optionValues . '</td>
1819
                            <td nowrap="nowrap">' . \TYPO3\CMS\Core\Utility\DebugUtility::viewArray($confArray['subCfg']['procInstrParams.']) . '</td>
1820
                        </tr>';
1821
                } else {
1822
1823
                    $content .= '<tr class="bgColor'.($c%2 ? '-20':'-10') . '">
1824
                            '.$titleClm.'
1825
                            <td>'.htmlspecialchars($confKey).'</td>
1826
                            <td colspan="5"><em>No entries</em> (Page is excluded in this configuration)</td>
1827
                        </tr>';
1828
1829
                }
1830
1831
1832
                $c++;
1833
            }
1834
        } else {
1835
            $message = !empty($skipMessage) ? ' ('.$skipMessage.')' : '';
1836
1837
                // Compile row:
1838
            $content.= '
1839
                <tr class="bgColor-20" style="border-bottom: 1px solid black;">
1840
                    <td>'.$pageTitleAndIcon.'</td>
1841
                    <td colspan="6"><em>No entries</em>'.$message.'</td>
1842
                </tr>';
1843
        }
1844
1845
        return $content;
1846
    }
1847
1848
    /**
1849
     *
1850
     * @return int
1851
     */
1852
    function getUnprocessedItemsCount() {
1853
        $res = $this->db->exec_SELECTquery(
1854
                    'count(*) as num',
1855
                    'tx_crawler_queue',
1856
                    'exec_time=0
1857
                    AND process_scheduled= 0
1858
                    AND scheduled<='.$this->getCurrentTime()
1859
        );
1860
1861
        $count = $this->db->sql_fetch_assoc($res);
1862
        return $count['num'];
1863
    }
1864
1865
1866
1867
1868
1869
1870
1871
1872
    /*****************************
1873
     *
1874
     * CLI functions
1875
     *
1876
     *****************************/
1877
1878
    /**
1879
     * Main function for running from Command Line PHP script (cron job)
1880
     * See ext/crawler/cli/crawler_cli.phpsh for details
1881
     *
1882
     * @return    int number of remaining items or false if error
1883
     */
1884
    function CLI_main() {
1885
        $this->setAccessMode('cli');
1886
        $result = self::CLI_STATUS_NOTHING_PROCCESSED;
1887
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli');
1888
1889
        if (isset($cliObj->cli_args['-h']) || isset($cliObj->cli_args['--help'])) {
1890
            $cliObj->cli_validateArgs();
1891
            $cliObj->cli_help();
1892
            exit;
0 ignored issues
show
Coding Style Compatibility introduced by
The method CLI_main() contains an exit expression.

An exit expression should only be used in rare cases. For example, if you write a short command line script.

In most cases however, using an exit expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage.

Loading history...
1893
        }
1894
1895
        if (!$this->getDisabled() && $this->CLI_checkAndAcquireNewProcess($this->CLI_buildProcessId())) {
1896
            $countInARun = $cliObj->cli_argValue('--countInARun') ? intval($cliObj->cli_argValue('--countInARun')) : $this->extensionSettings['countInARun'];
1897
                // Seconds
1898
            $sleepAfterFinish = $cliObj->cli_argValue('--sleepAfterFinish') ? intval($cliObj->cli_argValue('--sleepAfterFinish')) : $this->extensionSettings['sleepAfterFinish'];
1899
                // Milliseconds
1900
            $sleepTime = $cliObj->cli_argValue('--sleepTime') ? intval($cliObj->cli_argValue('--sleepTime')) : $this->extensionSettings['sleepTime'];
1901
1902
            try {
1903
                    // Run process:
1904
                $result = $this->CLI_run($countInARun, $sleepTime, $sleepAfterFinish);
1905
            } catch (Exception $e) {
1906
                $result = self::CLI_STATUS_ABORTED;
1907
            }
1908
1909
                // Cleanup
1910
            $this->db->exec_DELETEquery('tx_crawler_process', 'assigned_items_count = 0');
1911
1912
                //TODO can't we do that in a clean way?
1913
            $releaseStatus = $this->CLI_releaseProcesses($this->CLI_buildProcessId());
0 ignored issues
show
Unused Code introduced by
$releaseStatus is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
1914
1915
            $this->CLI_debug("Unprocessed Items remaining:".$this->getUnprocessedItemsCount()." (".$this->CLI_buildProcessId().")");
1916
            $result |= ( $this->getUnprocessedItemsCount() > 0 ? self::CLI_STATUS_REMAIN : self::CLI_STATUS_NOTHING_PROCCESSED );
1917
        } else {
1918
            $result |= self::CLI_STATUS_ABORTED;
1919
        }
1920
1921
        return $result;
1922
    }
1923
1924
    /**
1925
     * Function executed by crawler_im.php cli script.
1926
     *
1927
     * @return    void
1928
     */
1929
    function CLI_main_im()    {
1930
        $this->setAccessMode('cli_im');
1931
1932
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli_im');
1933
1934
            // Force user to admin state and set workspace to "Live":
1935
        $this->backendUser->user['admin'] = 1;
1936
        $this->backendUser->setWorkspace(0);
1937
1938
            // Print help
1939
        if (!isset($cliObj->cli_args['_DEFAULT'][1]))    {
1940
            $cliObj->cli_validateArgs();
1941
            $cliObj->cli_help();
1942
            exit;
0 ignored issues
show
Coding Style Compatibility introduced by
The method CLI_main_im() contains an exit expression.

An exit expression should only be used in rare cases. For example, if you write a short command line script.

In most cases however, using an exit expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage.

Loading history...
1943
        }
1944
1945
        $cliObj->cli_validateArgs();
1946
1947
        if ($cliObj->cli_argValue('-o')==='exec')    {
1948
            $this->registerQueueEntriesInternallyOnly=TRUE;
0 ignored issues
show
Documentation Bug introduced by
It seems like TRUE of type boolean is incompatible with the declared type array of property $registerQueueEntriesInternallyOnly.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
1949
        }
1950
1951
        if (isset($cliObj->cli_args['_DEFAULT'][2])) {
1952
            // Crawler is called over TYPO3 BE
1953
            $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][2], 0);
1954
        } else {
1955
            // Crawler is called over cli
1956
            $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
1957
        }
1958
1959
        $configurationKeys  = $this->getConfigurationKeys($cliObj);
1960
1961
        if(!is_array($configurationKeys)){
1962
            $configurations = $this->getUrlsForPageId($pageId);
1963
            if(is_array($configurations)){
1964
                $configurationKeys = array_keys($configurations);
1965
            }else{
1966
                $configurationKeys = array();
1967
            }
1968
        }
1969
1970
        if($cliObj->cli_argValue('-o')==='queue' || $cliObj->cli_argValue('-o')==='exec'){
1971
1972
            $reason = new tx_crawler_domain_reason();
1973
            $reason->setReason(tx_crawler_domain_reason::REASON_GUI_SUBMIT);
1974
            $reason->setDetailText('The cli script of the crawler added to the queue');
1975
            tx_crawler_domain_events_dispatcher::getInstance()->post(
1976
                'invokeQueueChange',
1977
                $this->setID,
1978
                array(    'reason' => $reason )
1979
            );
1980
        }
1981
1982
        if ($this->extensionSettings['cleanUpOldQueueEntries']) {
1983
            $this->cleanUpOldQueueEntries();
1984
        }
1985
1986
        $this->setID = \TYPO3\CMS\Core\Utility\GeneralUtility::md5int(microtime());
0 ignored issues
show
Documentation Bug introduced by
It seems like \TYPO3\CMS\Core\Utility\...ty::md5int(microtime()) can also be of type double. However, the property $setID is declared as type integer. Maybe add an additional type check?

Our type inference engine has found a suspicous assignment of a value to a property. This check raises an issue when a value that can be of a mixed type is assigned to a property that is type hinted more strictly.

For example, imagine you have a variable $accountId that can either hold an Id object or false (if there is no account id yet). Your code now assigns that value to the id property of an instance of the Account class. This class holds a proper account, so the id value must no longer be false.

Either this assignment is in error or a type check should be added for that assignment.

class Id
{
    public $id;

    public function __construct($id)
    {
        $this->id = $id;
    }

}

class Account
{
    /** @var  Id $id */
    public $id;
}

$account_id = false;

if (starsAreRight()) {
    $account_id = new Id(42);
}

$account = new Account();
if ($account instanceof Id)
{
    $account->id = $account_id;
}
Loading history...
1987
        $this->getPageTreeAndUrls(
1988
            $pageId,
1989
            \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_argValue('-d'),0,99),
1990
            $this->getCurrentTime(),
1991
            \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_isArg('-n') ? $cliObj->cli_argValue('-n') : 30,1,1000),
1992
            $cliObj->cli_argValue('-o')==='queue' || $cliObj->cli_argValue('-o')==='exec',
1993
            $cliObj->cli_argValue('-o')==='url',
1994
            \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',',$cliObj->cli_argValue('-proc'),1),
0 ignored issues
show
Documentation introduced by
1 is of type integer, but the function expects a boolean.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
1995
            $configurationKeys
1996
        );
1997
1998
        if ($cliObj->cli_argValue('-o')==='url') {
1999
            $cliObj->cli_echo(implode(chr(10),$this->downloadUrls).chr(10),1);
0 ignored issues
show
Documentation introduced by
1 is of type integer, but the function expects a boolean.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
2000
        } elseif ($cliObj->cli_argValue('-o')==='exec')    {
2001
            $cliObj->cli_echo("Executing ".count($this->urlList)." requests right away:\n\n");
2002
            $cliObj->cli_echo(implode(chr(10),$this->urlList).chr(10));
2003
            $cliObj->cli_echo("\nProcessing:\n");
2004
2005
            foreach($this->queueEntries as $queueRec)    {
2006
                $p = unserialize($queueRec['parameters']);
2007
                $cliObj->cli_echo($p['url'].' ('.implode(',',$p['procInstructions']).') => ');
2008
2009
                $result = $this->readUrlFromArray($queueRec);
2010
2011
                $requestResult = unserialize($result['content']);
2012
                if (is_array($requestResult))    {
2013
                    $resLog = is_array($requestResult['log']) ?  chr(10).chr(9).chr(9).implode(chr(10).chr(9).chr(9),$requestResult['log']) : '';
2014
                    $cliObj->cli_echo('OK: '.$resLog.chr(10));
2015
                } else {
2016
                    $cliObj->cli_echo('Error checking Crawler Result: '.substr(preg_replace('/\s+/',' ',strip_tags($result['content'])),0,30000).'...'.chr(10));
2017
                }
2018
            }
2019
        } elseif ($cliObj->cli_argValue('-o')==='queue')    {
2020
            $cliObj->cli_echo("Putting ".count($this->urlList)." entries in queue:\n\n");
2021
            $cliObj->cli_echo(implode(chr(10),$this->urlList).chr(10));
2022
        } else {
2023
            $cliObj->cli_echo(count($this->urlList)." entries found for processing. (Use -o to decide action):\n\n",1);
0 ignored issues
show
Documentation introduced by
1 is of type integer, but the function expects a boolean.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
2024
            $cliObj->cli_echo(implode(chr(10),$this->urlList).chr(10),1);
0 ignored issues
show
Documentation introduced by
1 is of type integer, but the function expects a boolean.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
2025
        }
2026
    }
2027
2028
    /**
2029
     * Function executed by crawler_im.php cli script.
2030
     *
2031
     * @return bool
2032
     */
2033
    function CLI_main_flush() {
2034
        $this->setAccessMode('cli_flush');
2035
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli_flush');
2036
2037
            // Force user to admin state and set workspace to "Live":
2038
        $this->backendUser->user['admin'] = 1;
2039
        $this->backendUser->setWorkspace(0);
2040
2041
            // Print help
2042
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2043
            $cliObj->cli_validateArgs();
2044
            $cliObj->cli_help();
2045
            exit;
0 ignored issues
show
Coding Style Compatibility introduced by
The method CLI_main_flush() contains an exit expression.

An exit expression should only be used in rare cases. For example, if you write a short command line script.

In most cases however, using an exit expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage.

Loading history...
2046
        }
2047
2048
        $cliObj->cli_validateArgs();
2049
        $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1],0);
2050
        $fullFlush = ($pageId == 0);
2051
2052
        $mode = $cliObj->cli_argValue('-o');
2053
2054
        switch($mode) {
2055
            case 'all':
2056
                $result = $this->getLogEntriesForPageId($pageId, '', true, $fullFlush);
2057
                break;
2058
            case 'finished':
2059
            case 'pending':
2060
                $result = $this->getLogEntriesForPageId($pageId, $mode, true, $fullFlush);
2061
                break;
2062
            default:
2063
                $cliObj->cli_validateArgs();
2064
                $cliObj->cli_help();
2065
                $result = false;
2066
        }
2067
2068
        return $result !== false;
2069
    }
2070
2071
    /**
2072
     * Obtains configuration keys from the CLI arguments
2073
     *
2074
     * @param  tx_crawler_cli_im $cliObj    Command line object
2075
     * @return mixed                        Array of keys or null if no keys found
2076
     */
2077
    protected function getConfigurationKeys(tx_crawler_cli_im &$cliObj) {
2078
        $parameter = trim($cliObj->cli_argValue('-conf'));
2079
        return ($parameter != '' ? \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $parameter) : array());
2080
    }
2081
2082
    /**
2083
     * Running the functionality of the CLI (crawling URLs from queue)
2084
     *
2085
     * @param  int $countInARun
2086
     * @param  int $sleepTime
2087
     * @param  int $sleepAfterFinish
2088
     * @return string                   Status message
2089
     */
2090
    public function CLI_run($countInARun, $sleepTime, $sleepAfterFinish) {
2091
        $result = 0;
2092
        $counter = 0;
2093
2094
            // First, run hooks:
2095
        $this->CLI_runHooks();
2096
2097
            // Clean up the queue
2098
        if (intval($this->extensionSettings['purgeQueueDays']) > 0) {
2099
            $purgeDate = $this->getCurrentTime() - 24 * 60 * 60 * intval($this->extensionSettings['purgeQueueDays']);
2100
            $del = $this->db->exec_DELETEquery(
0 ignored issues
show
Unused Code introduced by
$del is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
2101
                'tx_crawler_queue',
2102
                'exec_time!=0 AND exec_time<' . $purgeDate
2103
            );
2104
        }
2105
2106
            // Select entries:
2107
            //TODO Shouldn't this reside within the transaction?
2108
        $rows = $this->db->exec_SELECTgetRows(
2109
            'qid,scheduled',
2110
            'tx_crawler_queue',
2111
            'exec_time=0
2112
                AND process_scheduled= 0
2113
                AND scheduled<='.$this->getCurrentTime(),
2114
            '',
2115
            'scheduled, qid',
2116
        intval($countInARun)
2117
        );
2118
2119
        if (count($rows)>0) {
2120
            $quidList = array();
2121
2122
            foreach($rows as $r) {
0 ignored issues
show
Bug introduced by
The expression $rows of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
2123
                $quidList[] = $r['qid'];
2124
            }
2125
2126
            $processId = $this->CLI_buildProcessId();
2127
2128
                //reserve queue entrys for process
2129
            $this->db->sql_query('BEGIN');
2130
                //TODO make sure we're not taking assigned queue-entires
2131
            $this->db->exec_UPDATEquery(
2132
                'tx_crawler_queue',
2133
                'qid IN ('.implode(',',$quidList).')',
2134
                array(
2135
                    'process_scheduled' => intval($this->getCurrentTime()),
2136
                    'process_id' => $processId
2137
                )
2138
            );
2139
2140
                //save the number of assigned queue entrys to determine who many have been processed later
2141
            $numberOfAffectedRows = $this->db->sql_affected_rows();
2142
            $this->db->exec_UPDATEquery(
2143
                'tx_crawler_process',
2144
                "process_id = '".$processId."'" ,
2145
                array(
2146
                    'assigned_items_count' => intval($numberOfAffectedRows)
2147
                )
2148
            );
2149
2150
            if($numberOfAffectedRows == count($quidList)) {
2151
                $this->db->sql_query('COMMIT');
2152
            } else  {
2153
                $this->db->sql_query('ROLLBACK');
2154
                $this->CLI_debug("Nothing processed due to multi-process collision (".$this->CLI_buildProcessId().")");
2155
                return ( $result | self::CLI_STATUS_ABORTED );
2156
            }
2157
2158
2159
2160
            foreach($rows as $r)    {
0 ignored issues
show
Bug introduced by
The expression $rows of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
2161
                $result |= $this->readUrl($r['qid']);
2162
2163
                $counter++;
2164
                usleep(intval($sleepTime));    // Just to relax the system
2165
2166
                    // if during the start and the current read url the cli has been disable we need to return from the function
2167
                    // mark the process NOT as ended.
2168
                if ($this->getDisabled()) {
2169
                    return ( $result | self::CLI_STATUS_ABORTED );
2170
                }
2171
2172
                if (!$this->CLI_checkIfProcessIsActive($this->CLI_buildProcessId())) {
2173
                    $this->CLI_debug("conflict / timeout (".$this->CLI_buildProcessId().")");
2174
2175
                        //TODO might need an additional returncode
2176
                    $result |= self::CLI_STATUS_ABORTED;
2177
                    break;        //possible timeout
2178
                }
2179
            }
2180
2181
            sleep(intval($sleepAfterFinish));
2182
2183
            $msg = 'Rows: '.$counter;
2184
            $this->CLI_debug($msg." (".$this->CLI_buildProcessId().")");
2185
2186
        } else {
2187
            $this->CLI_debug("Nothing within queue which needs to be processed (".$this->CLI_buildProcessId().")");
2188
        }
2189
2190
        if($counter > 0) {
2191
            $result |= self::CLI_STATUS_PROCESSED;
2192
        }
2193
2194
        return $result;
2195
    }
2196
2197
    /**
2198
     * Activate hooks
2199
     *
2200
     * @return    void
2201
     */
2202
    function CLI_runHooks()    {
2203
        global $TYPO3_CONF_VARS;
2204
        if (is_array($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks']))    {
2205
            foreach($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'] as $objRef)    {
2206
                $hookObj = &\TYPO3\CMS\Core\Utility\GeneralUtility::getUserObj($objRef);
2207
                if (is_object($hookObj))    {
2208
                    $hookObj->crawler_init($this);
2209
                }
2210
            }
2211
        }
2212
    }
2213
2214
    /**
2215
     * Try to acquire a new process with the given id
2216
     * also performs some auto-cleanup for orphan processes
2217
     * @todo preemption might not be the most elegant way to clean up
2218
     *
2219
     * @param  string    $id  identification string for the process
2220
     * @return boolean        determines whether the attempt to get resources was successful
2221
     */
2222
    function CLI_checkAndAcquireNewProcess($id) {
2223
2224
        $ret = true;
2225
2226
        $systemProcessId = getmypid();
2227
        if ($systemProcessId < 1) {
2228
            return FALSE;
2229
        }
2230
2231
        $processCount = 0;
2232
        $orphanProcesses = array();
2233
2234
        $this->db->sql_query('BEGIN');
2235
2236
        $res = $this->db->exec_SELECTquery(
2237
            'process_id,ttl',
2238
            'tx_crawler_process',
2239
            'active=1 AND deleted=0'
2240
            );
2241
2242
            $currentTime = $this->getCurrentTime();
2243
2244
            while($row = $this->db->sql_fetch_assoc($res))    {
2245
                if ($row['ttl'] < $currentTime) {
2246
                    $orphanProcesses[] = $row['process_id'];
2247
                } else {
2248
                    $processCount++;
2249
                }
2250
            }
2251
2252
                // if there are less than allowed active processes then add a new one
2253
            if ($processCount < intval($this->extensionSettings['processLimit'])) {
2254
                $this->CLI_debug("add ".$this->CLI_buildProcessId()." (".($processCount+1)."/".intval($this->extensionSettings['processLimit']).")");
2255
2256
                    // create new process record
2257
                $this->db->exec_INSERTquery(
2258
                'tx_crawler_process',
2259
                array(
2260
                    'process_id' => $id,
2261
                    'active'=>'1',
2262
                    'ttl' => ($currentTime + intval($this->extensionSettings['processMaxRunTime'])),
2263
                    'system_process_id' => $systemProcessId
2264
                )
2265
                );
2266
2267
            } else {
2268
                $this->CLI_debug("Processlimit reached (".($processCount)."/".intval($this->extensionSettings['processLimit']).")");
2269
                $ret = false;
2270
            }
2271
2272
            $this->CLI_releaseProcesses($orphanProcesses, true); // maybe this should be somehow included into the current lock
2273
            $this->CLI_deleteProcessesMarkedDeleted();
2274
2275
            $this->db->sql_query('COMMIT');
2276
2277
            return $ret;
2278
    }
2279
2280
    /**
2281
     * Release a process and the required resources
2282
     *
2283
     * @param  mixed    $releaseIds   string with a single process-id or array with multiple process-ids
2284
     * @param  boolean  $withinLock   show whether the DB-actions are included within an existing lock
2285
     * @return boolean
2286
     */
2287
    function CLI_releaseProcesses($releaseIds, $withinLock=false) {
2288
2289
        if (!is_array($releaseIds)) {
2290
            $releaseIds = array($releaseIds);
2291
        }
2292
2293
        if (!count($releaseIds) > 0) {
2294
            return false;   //nothing to release
2295
        }
2296
2297
        if(!$withinLock) $this->db->sql_query('BEGIN');
2298
2299
            // some kind of 2nd chance algo - this way you need at least 2 processes to have a real cleanup
2300
            // this ensures that a single process can't mess up the entire process table
2301
2302
            // mark all processes as deleted which have no "waiting" queue-entires and which are not active
2303
        $this->db->exec_UPDATEquery(
2304
            'tx_crawler_queue',
2305
            'process_id IN (SELECT process_id FROM tx_crawler_process WHERE active=0 AND deleted=0)',
2306
            array(
2307
                'process_scheduled' => 0,
2308
                'process_id' => ''
2309
            )
2310
        );
2311
        $this->db->exec_UPDATEquery(
2312
            'tx_crawler_process',
2313
            'active=0 AND deleted=0
2314
            AND NOT EXISTS (
2315
                SELECT * FROM tx_crawler_queue
2316
                WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
2317
                AND tx_crawler_queue.exec_time = 0
2318
            )',
2319
            array(
2320
                'deleted'=>'1',
2321
                'system_process_id' => 0
2322
            )
2323
        );
2324
                // mark all requested processes as non-active
2325
        $this->db->exec_UPDATEquery(
2326
            'tx_crawler_process',
2327
            'process_id IN (\''.implode('\',\'',$releaseIds).'\') AND deleted=0',
2328
            array(
2329
                'active'=>'0'
2330
            )
2331
        );
2332
        $this->db->exec_UPDATEquery(
2333
            'tx_crawler_queue',
2334
            'exec_time=0 AND process_id IN ("'.implode('","',$releaseIds).'")',
2335
            array(
2336
                'process_scheduled'=>0,
2337
                'process_id'=>''
2338
            )
2339
        );
2340
2341
        if(!$withinLock) $this->db->sql_query('COMMIT');
2342
2343
        return true;
2344
    }
2345
2346
    /**
2347
     * Delete processes marked as deleted
2348
     *
2349
     * @return void
2350
     */
2351
     public function CLI_deleteProcessesMarkedDeleted() {
2352
        $this->db->exec_DELETEquery('tx_crawler_process', 'deleted = 1');
2353
    }
2354
2355
    /**
2356
     * Check if there are still resources left for the process with the given id
2357
     * Used to determine timeouts and to ensure a proper cleanup if there's a timeout
2358
     *
2359
     * @param  string  identification string for the process
2360
     * @return boolean determines if the process is still active / has resources
2361
     *
2362
     * FIXME: Please remove Transaction, not needed as only a select query.
2363
     */
2364
    function CLI_checkIfProcessIsActive($pid) {
2365
        $ret = false;
2366
        $this->db->sql_query('BEGIN');
2367
        $res = $this->db->exec_SELECTquery(
2368
            'process_id,active,ttl',
2369
            'tx_crawler_process','process_id = \''.$pid.'\'  AND deleted=0',
2370
            '',
2371
            'ttl',
2372
            '0,1'
2373
        );
2374
        if($row = $this->db->sql_fetch_assoc($res))    {
2375
            $ret = intVal($row['active'])==1;
2376
        }
2377
        $this->db->sql_query('COMMIT');
2378
2379
        return $ret;
2380
    }
2381
2382
    /**
2383
     * Create a unique Id for the current process
2384
     *
2385
     * @return string  the ID
2386
     */
2387 2
    function CLI_buildProcessId() {
2388 2
        if(!$this->processID) {
2389 1
            $this->processID= \TYPO3\CMS\Core\Utility\GeneralUtility::shortMD5($this->microtime(true));
2390
        }
2391 2
        return $this->processID;
2392
    }
2393
2394
    /**
2395
     * @param bool $get_as_float
2396
     *
2397
     * @return mixed
2398
     */
2399
    protected function microtime($get_as_float = false )
2400
    {
2401
        return microtime($get_as_float);
2402
    }
2403
2404
    /**
2405
     * Prints a message to the stdout (only if debug-mode is enabled)
2406
     *
2407
     * @param  string $msg  the message
2408
     */
2409
    function CLI_debug($msg) {
2410
        if(intval($this->extensionSettings['processDebug'])) {
2411
            echo $msg."\n"; flush();
2412
        }
2413
    }
2414
2415
2416
2417
    /**
2418
     * Get URL content by making direct request to TYPO3.
2419
     *
2420
     * @param  string $url          Page URL
2421
     * @param  int    $crawlerId    Crawler-ID
2422
     * @return array
2423
     */
2424 2
    protected function sendDirectRequest($url, $crawlerId) {
2425 2
        $requestHeaders = $this->buildRequestHeaderArray(parse_url($url), $crawlerId);
0 ignored issues
show
Security Bug introduced by
It seems like parse_url($url) targeting parse_url() can also be of type false; however, tx_crawler_lib::buildRequestHeaderArray() does only seem to accept array, did you maybe forget to handle an error condition?
Loading history...
2426
2427 2
        $cmd  = escapeshellcmd($this->extensionSettings['phpPath']);
2428 2
        $cmd .= ' ';
2429 2
        $cmd .= escapeshellarg(\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::extPath('crawler') . 'cli/bootstrap.php');
2430 2
        $cmd .= ' ';
2431 2
        $cmd .= escapeshellarg($this->getFrontendBasePath());
2432 2
        $cmd .= ' ';
2433 2
        $cmd .= escapeshellarg($url);
2434 2
        $cmd .= ' ';
2435 2
        $cmd .= escapeshellarg(base64_encode(serialize($requestHeaders)));
2436
2437 2
        $startTime = microtime(true);
2438 2
        $content = $this->executeShellCommand($cmd);
2439 2
        $this->log($url . (microtime(true) - $startTime));
2440
2441
        $result = array(
2442 2
            'request' => implode("\r\n", $requestHeaders) . "\r\n\r\n",
2443 2
            'headers' => '',
2444 2
            'content' => $content
2445
        );
2446
2447 2
        return $result;
2448
    }
2449
2450
    /**
2451
     * Cleans up entries that stayed for too long in the queue. These are:
2452
     * - processed entries that are over 1.5 days in age
2453
     * - scheduled entries that are over 7 days old
2454
     *
2455
     * @return void
2456
     */
2457
    protected function cleanUpOldQueueEntries() {
2458
        $processedAgeInSeconds = $this->extensionSettings['cleanUpProcessedAge'] * 86400; // 24*60*60 Seconds in 24 hours
2459
        $scheduledAgeInSeconds = $this->extensionSettings['cleanUpScheduledAge'] * 86400;
2460
2461
        $now = time();
2462
        $condition = '(exec_time<>0 AND exec_time<' . ($now - $processedAgeInSeconds) . ') OR scheduled<=' . ($now - $scheduledAgeInSeconds);
2463
        $this->flushQueue($condition);
2464
    }
2465
2466
    /**
2467
     * Initializes a TypoScript Frontend necessary for using TypoScript and TypoLink functions
2468
     *
2469
     * @param int $id
2470
     * @param int $typeNum
2471
     *
2472
     * @return void
2473
     */
2474
    protected function initTSFE($id = 1, $typeNum = 0) {
2475
        \TYPO3\CMS\Frontend\Utility\EidUtility::initTCA();
2476
        if (!is_object($GLOBALS['TT'])) {
2477
            $GLOBALS['TT'] = new \TYPO3\CMS\Core\TimeTracker\NullTimeTracker;
0 ignored issues
show
Deprecated Code introduced by
The class TYPO3\CMS\Core\TimeTracker\NullTimeTracker has been deprecated with message: since TYPO3 v8, will be removed in v9

This class, trait or interface has been deprecated. The supplier of the file has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the type will be removed from the class and what other constant to use instead.

Loading history...
2478
            $GLOBALS['TT']->start();
0 ignored issues
show
Deprecated Code introduced by
The method TYPO3\CMS\Core\TimeTrack...ullTimeTracker::start() has been deprecated with message: since TYPO3 v8, will be removed in v9, use the regular time tracking

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2479
        }
2480
2481
        $GLOBALS['TSFE'] = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\\CMS\\Frontend\\Controller\\TypoScriptFrontendController',  $GLOBALS['TYPO3_CONF_VARS'], $id, $typeNum);
2482
        $GLOBALS['TSFE']->sys_page = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\\CMS\\Frontend\\Page\\PageRepository');
2483
        $GLOBALS['TSFE']->sys_page->init(TRUE);
2484
        $GLOBALS['TSFE']->connectToDB();
2485
        $GLOBALS['TSFE']->initFEuser();
2486
        $GLOBALS['TSFE']->determineId();
2487
        $GLOBALS['TSFE']->initTemplate();
2488
        $GLOBALS['TSFE']->rootLine = $GLOBALS['TSFE']->sys_page->getRootLine($id, '');
2489
        $GLOBALS['TSFE']->getConfigArray();
2490
        \TYPO3\CMS\Frontend\Page\PageGenerator::pagegenInit();
2491
    }
2492
}
2493
2494 1
if (defined('TYPO3_MODE') && $TYPO3_CONF_VARS[TYPO3_MODE]['XCLASS']['ext/crawler/class.tx_crawler_lib.php'])    {
2495
    include_once($TYPO3_CONF_VARS[TYPO3_MODE]['XCLASS']['ext/crawler/class.tx_crawler_lib.php']);
2496
}
2497