Completed
Push — issue/92 ( ebb8c8...de325a )
by Tomas Norre
09:56
created

CrawlerController   F

Complexity

Total Complexity 356

Size/Duplication

Total Lines 2627
Duplicated Lines 0 %

Coupling/Cohesion

Components 1
Dependencies 19

Test Coverage

Coverage 38.57%

Importance

Changes 0
Metric Value
dl 0
loc 2627
ccs 429
cts 1112
cp 0.3857
rs 0.6314
c 0
b 0
f 0
wmc 356
lcom 1
cbo 19

60 Methods

Rating   Name   Duplication   Size   Complexity  
A getAccessMode() 0 4 1
A setAccessMode() 0 4 1
A setDisabled() 0 10 3
A getDisabled() 0 8 2
A setProcessFilename() 0 4 1
A getProcessFilename() 0 4 1
A __construct() 0 22 3
A setExtensionSettings() 0 4 1
D checkIfPageShouldBeSkipped() 0 57 16
A getUrlsForPageRow() 0 15 3
A noUnprocessedQueueEntriesForPageWithConfigurationHashExist() 0 8 1
D urlListFromUrlArray() 0 115 21
A drawURLs_PIfilter() 0 12 4
B getPageTSconfigForId() 0 22 4
D getUrlsForPageId() 0 136 26
A getBaseUrlForConfigurationRecord() 0 20 4
C getConfigurationsForBranch() 0 45 11
A hasGroupAccess() 0 12 4
A parseParams() 0 15 3
F expandParameters() 0 110 24
C compileUrls() 0 25 7
B getLogEntriesForPageId() 0 29 6
B getLogEntriesForSetId() 0 29 6
B flushQueue() 0 15 5
A addQueueEntry_callBack() 0 19 3
B addUrl() 0 67 6
C getDuplicateRowsIfExist() 0 40 7
A getCurrentTime() 0 4 1
C readUrl() 0 80 13
A readUrlFromArray() 0 23 1
B readUrl_exec() 0 29 4
D requestUrl() 0 93 19
C getFrontendBasePath() 0 23 8
A executeShellCommand() 0 5 1
B getHttpResponseFromStream() 0 23 5
A log() 0 9 3
A buildRequestHeaderArray() 0 16 4
C getRequestUrlFrom302Header() 0 34 11
A fe_init() 0 17 4
C getPageTreeAndUrls() 0 87 8
D expandExcludeString() 0 45 9
D drawURLs_addRowsForPage() 0 109 15
D CLI_main() 0 41 10
F CLI_main_im() 0 98 17
B CLI_main_flush() 0 38 5
A getConfigurationKeys() 0 5 2
D CLI_run() 0 107 10
A CLI_runHooks() 0 12 4
B CLI_checkAndAcquireNewProcess() 0 56 5
B CLI_releaseProcesses() 0 62 5
A CLI_deleteProcessesMarkedDeleted() 0 4 1
A CLI_checkIfProcessIsActive() 0 19 2
A CLI_buildProcessId() 0 7 2
A microtime() 0 4 1
A CLI_debug() 0 7 2
B sendDirectRequest() 0 31 2
A cleanUpOldQueueEntries() 0 9 1
A initTSFE() 0 19 2
A getConfigurationHash() 0 5 1
A isCrawlingProtocolHttps() 0 12 4

How to fix   Complexity   

Complex Class

Complex classes like CrawlerController often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use CrawlerController, and based on these observations, apply Extract Interface, too.

1
<?php
2
namespace AOE\Crawler\Controller;
3
4
/***************************************************************
5
 *  Copyright notice
6
 *
7
 *  (c) 2017 AOE GmbH <[email protected]>
8
 *
9
 *  All rights reserved
10
 *
11
 *  This script is part of the TYPO3 project. The TYPO3 project is
12
 *  free software; you can redistribute it and/or modify
13
 *  it under the terms of the GNU General Public License as published by
14
 *  the Free Software Foundation; either version 3 of the License, or
15
 *  (at your option) any later version.
16
 *
17
 *  The GNU General Public License can be found at
18
 *  http://www.gnu.org/copyleft/gpl.html.
19
 *
20
 *  This script is distributed in the hope that it will be useful,
21
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
22
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
23
 *  GNU General Public License for more details.
24
 *
25
 *  This copyright notice MUST APPEAR in all copies of the script!
26
 ***************************************************************/
27
28
use AOE\Crawler\Command\CrawlerCommandLineController;
29
use AOE\Crawler\Command\FlushCommandLineController;
30
use AOE\Crawler\Command\QueueCommandLineController;
31
use AOE\Crawler\Domain\Model\Reason;
32
use AOE\Crawler\Domain\Repository\QueueRepository;
33
use AOE\Crawler\Event\EventDispatcher;
34
use AOE\Crawler\Utility\IconUtility;
35
use AOE\Crawler\Utility\SignalSlotUtility;
36
use TYPO3\CMS\Backend\Utility\BackendUtility;
37
use TYPO3\CMS\Backend\Tree\View\PageTreeView;
38
use TYPO3\CMS\Core\Authentication\BackendUserAuthentication;
39
use TYPO3\CMS\Core\Database\DatabaseConnection;
40
use TYPO3\CMS\Core\Log\LogLevel;
41
use TYPO3\CMS\Core\TimeTracker\NullTimeTracker;
42
use TYPO3\CMS\Core\Utility\DebugUtility;
43
use TYPO3\CMS\Core\Utility\ExtensionManagementUtility;
44
use TYPO3\CMS\Core\Utility\GeneralUtility;
45
use TYPO3\CMS\Core\Utility\MathUtility;
46
use TYPO3\CMS\Extbase\Object\ObjectManager;
47
use TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController;
48
use TYPO3\CMS\Frontend\Page\PageGenerator;
49
use TYPO3\CMS\Frontend\Page\PageRepository;
50
use TYPO3\CMS\Frontend\Utility\EidUtility;
51
52
/**
53
 * Class CrawlerController
54
 *
55
 * @package AOE\Crawler\Controller
56
 */
57
class CrawlerController
58
{
59
    const CLI_STATUS_NOTHING_PROCCESSED = 0;
60
    const CLI_STATUS_REMAIN = 1; //queue not empty
61
    const CLI_STATUS_PROCESSED = 2; //(some) queue items where processed
62
    const CLI_STATUS_ABORTED = 4; //instance didn't finish
63
    const CLI_STATUS_POLLABLE_PROCESSED = 8;
64
65
    /**
66
     * @var integer
67
     */
68
    public $setID = 0;
69
70
    /**
71
     * @var string
72
     */
73
    public $processID = '';
74
75
    /**
76
     * One hour is max stalled time for the CLI
77
     * If the process had the status "start" for 3600 seconds, it will be regarded stalled and a new process is started
78
     *
79
     * @var integer
80
     */
81
    public $max_CLI_exec_time = 3600;
82
83
    /**
84
     * @var array
85
     */
86
    public $duplicateTrack = [];
87
88
    /**
89
     * @var array
90
     */
91
    public $downloadUrls = [];
92
93
    /**
94
     * @var array
95
     */
96
    public $incomingProcInstructions = [];
97
98
    /**
99
     * @var array
100
     */
101
    public $incomingConfigurationSelection = [];
102
103
    /**
104
     * @var bool
105
     */
106
    public $registerQueueEntriesInternallyOnly = false;
107
108
    /**
109
     * @var array
110
     */
111
    public $queueEntries = [];
112
113
    /**
114
     * @var array
115
     */
116
    public $urlList = [];
117
118
    /**
119
     * @var boolean
120
     */
121
    public $debugMode = false;
122
123
    /**
124
     * @var array
125
     */
126
    public $extensionSettings = [];
127
128
    /**
129
     * Mount Point
130
     *
131
     * @var boolean
132
     */
133
    public $MP = false;
134
135
    /**
136
     * @var string
137
     */
138
    protected $processFilename;
139
140
    /**
141
     * Holds the internal access mode can be 'gui','cli' or 'cli_im'
142
     *
143
     * @var string
144
     */
145
    protected $accessMode;
146
147
    /**
148
     * @var DatabaseConnection
149
     */
150
    private $db;
151
152
    /**
153
     * @var BackendUserAuthentication
154
     */
155
    private $backendUser;
156
157
    /**
158
     * @var integer
159
     */
160
    private $scheduledTime = 0;
161
162
    /**
163
     * @var integer
164
     */
165
    private $reqMinute = 0;
166
167
    /**
168
     * @var bool
169
     */
170
    private $submitCrawlUrls = false;
171
172
    /**
173
     * @var bool
174
     */
175
    private $downloadCrawlUrls = false;
176
177
    /**
178
     * @var QueueRepository
179
     */
180
    protected  $queueRepository;
181
182
    /**
183
     * Method to set the accessMode can be gui, cli or cli_im
184
     *
185
     * @return string
186
     */
187 1
    public function getAccessMode()
188
    {
189 1
        return $this->accessMode;
190
    }
191
192
    /**
193
     * @param string $accessMode
194
     */
195 1
    public function setAccessMode($accessMode)
196
    {
197 1
        $this->accessMode = $accessMode;
198 1
    }
199
200
    /**
201
     * Set disabled status to prevent processes from being processed
202
     *
203
     * @param  bool $disabled (optional, defaults to true)
204
     * @return void
205
     */
206 3
    public function setDisabled($disabled = true)
207
    {
208 3
        if ($disabled) {
209 2
            GeneralUtility::writeFile($this->processFilename, '');
210
        } else {
211 1
            if (is_file($this->processFilename)) {
212 1
                unlink($this->processFilename);
213
            }
214
        }
215 3
    }
216
217
    /**
218
     * Get disable status
219
     *
220
     * @return bool true if disabled
221
     */
222 3
    public function getDisabled()
223
    {
224 3
        if (is_file($this->processFilename)) {
225 2
            return true;
226
        } else {
227 1
            return false;
228
        }
229
    }
230
231
    /**
232
     * @param string $filenameWithPath
233
     *
234
     * @return void
235
     */
236 4
    public function setProcessFilename($filenameWithPath)
237
    {
238 4
        $this->processFilename = $filenameWithPath;
239 4
    }
240
241
    /**
242
     * @return string
243
     */
244 1
    public function getProcessFilename()
245
    {
246 1
        return $this->processFilename;
247
    }
248
249
    /************************************
250
     *
251
     * Getting URLs based on Page TSconfig
252
     *
253
     ************************************/
254
255 33
    public function __construct()
256
    {
257 33
        $objectManager = GeneralUtility::makeInstance(ObjectManager::class);
258 33
        $this->queueRepository = $objectManager->get(QueueRepository::class);
259
260 28
        $this->db = $GLOBALS['TYPO3_DB'];
261 28
        $this->backendUser = $GLOBALS['BE_USER'];
262 28
        $this->processFilename = PATH_site . 'typo3temp/tx_crawler.proc';
263
264 28
        $settings = unserialize($GLOBALS['TYPO3_CONF_VARS']['EXT']['extConf']['crawler']);
265 28
        $settings = is_array($settings) ? $settings : [];
266
267
        // read ext_em_conf_template settings and set
268 28
        $this->setExtensionSettings($settings);
269
270
        // set defaults:
271 28
        if (MathUtility::convertToPositiveInteger($this->extensionSettings['countInARun']) == 0) {
272 21
            $this->extensionSettings['countInARun'] = 100;
273
        }
274
275 28
        $this->extensionSettings['processLimit'] = MathUtility::forceIntegerInRange($this->extensionSettings['processLimit'], 1, 99, 1);
276 28
    }
277
278
    /**
279
     * Sets the extensions settings (unserialized pendant of $TYPO3_CONF_VARS['EXT']['extConf']['crawler']).
280
     *
281
     * @param array $extensionSettings
282
     * @return void
283
     */
284 37
    public function setExtensionSettings(array $extensionSettings)
285
    {
286 37
        $this->extensionSettings = $extensionSettings;
287 37
    }
288
289
    /**
290
     * Check if the given page should be crawled
291
     *
292
     * @param array $pageRow
293
     * @return false|string false if the page should be crawled (not excluded), true / skipMessage if it should be skipped
294
     */
295 10
    public function checkIfPageShouldBeSkipped(array $pageRow)
296
    {
297 10
        $skipPage = false;
298 10
        $skipMessage = 'Skipped'; // message will be overwritten later
299
300
        // if page is hidden
301 10
        if (!$this->extensionSettings['crawlHiddenPages']) {
302 10
            if ($pageRow['hidden']) {
303 1
                $skipPage = true;
304 1
                $skipMessage = 'Because page is hidden';
305
            }
306
        }
307
308 10
        if (!$skipPage) {
309 9
            if (GeneralUtility::inList('3,4', $pageRow['doktype']) || $pageRow['doktype'] >= 199) {
310 3
                $skipPage = true;
311 3
                $skipMessage = 'Because doktype is not allowed';
312
            }
313
        }
314
315 10
        if (!$skipPage) {
316 6
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'])) {
317 2
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'] as $key => $doktypeList) {
318 1
                    if (GeneralUtility::inList($doktypeList, $pageRow['doktype'])) {
319 1
                        $skipPage = true;
320 1
                        $skipMessage = 'Doktype was excluded by "' . $key . '"';
321 1
                        break;
322
                    }
323
                }
324
            }
325
        }
326
327 10
        if (!$skipPage) {
328
            // veto hook
329 5
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'])) {
330
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'] as $key => $func) {
331
                    $params = [
332
                        'pageRow' => $pageRow
333
                    ];
334
                    // expects "false" if page is ok and "true" or a skipMessage if this page should _not_ be crawled
335
                    $veto = GeneralUtility::callUserFunction($func, $params, $this);
336
                    if ($veto !== false) {
337
                        $skipPage = true;
338
                        if (is_string($veto)) {
339
                            $skipMessage = $veto;
340
                        } else {
341
                            $skipMessage = 'Veto from hook "' . htmlspecialchars($key) . '"';
342
                        }
343
                        // no need to execute other hooks if a previous one return a veto
344
                        break;
345
                    }
346
                }
347
            }
348
        }
349
350 10
        return $skipPage ? $skipMessage : false;
351
    }
352
353
    /**
354
     * Wrapper method for getUrlsForPageId()
355
     * It returns an array of configurations and no urls!
356
     *
357
     * @param array $pageRow Page record with at least dok-type and uid columns.
358
     * @param string $skipMessage
359
     * @return array
360
     * @see getUrlsForPageId()
361
     */
362 6
    public function getUrlsForPageRow(array $pageRow, &$skipMessage = '')
363
    {
364 6
        $message = $this->checkIfPageShouldBeSkipped($pageRow);
365
366 6
        if ($message === false) {
367 5
            $forceSsl = ($pageRow['url_scheme'] === 2) ? true : false;
368 5
            $res = $this->getUrlsForPageId($pageRow['uid'], $forceSsl);
369 5
            $skipMessage = '';
370
        } else {
371 1
            $skipMessage = $message;
372 1
            $res = [];
373
        }
374
375 6
        return $res;
376
    }
377
378
    /**
379
     * This method is used to count if there are ANY unprocessed queue entries
380
     * of a given page_id and the configuration which matches a given hash.
381
     * If there if none, we can skip an inner detail check
382
     *
383
     * @param  int $uid
384
     * @param  string $configurationHash
385
     * @return boolean
386
     */
387 7
    protected function noUnprocessedQueueEntriesForPageWithConfigurationHashExist($uid, $configurationHash)
388
    {
389 7
        $configurationHash = $this->db->fullQuoteStr($configurationHash, 'tx_crawler_queue');
390 7
        $res = $this->db->exec_SELECTquery('count(*) as anz', 'tx_crawler_queue', "page_id=" . intval($uid) . " AND configuration_hash=" . $configurationHash . " AND exec_time=0");
391 7
        $row = $this->db->sql_fetch_assoc($res);
392
393 7
        return ($row['anz'] == 0);
394
    }
395
396
    /**
397
     * Creates a list of URLs from input array (and submits them to queue if asked for)
398
     * See Web > Info module script + "indexed_search"'s crawler hook-client using this!
399
     *
400
     * @param    array        Information about URLs from pageRow to crawl.
401
     * @param    array        Page row
402
     * @param    integer        Unix time to schedule indexing to, typically time()
403
     * @param    integer        Number of requests per minute (creates the interleave between requests)
404
     * @param    boolean        If set, submits the URLs to queue
405
     * @param    boolean        If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
406
     * @param    array        Array which is passed by reference and contains the an id per url to secure we will not crawl duplicates
407
     * @param    array        Array which will be filled with URLS for download if flag is set.
408
     * @param    array        Array of processing instructions
409
     * @return    string        List of URLs (meant for display in backend module)
410
     *
411
     */
412 4
    public function urlListFromUrlArray(
413
    array $vv,
414
    array $pageRow,
415
    $scheduledTime,
416
    $reqMinute,
417
    $submitCrawlUrls,
418
    $downloadCrawlUrls,
419
    array &$duplicateTrack,
420
    array &$downloadUrls,
421
    array $incomingProcInstructions
422
    ) {
423 4
        $urlList = '';
424
        // realurl support (thanks to Ingo Renner)
425 4
        if (ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
426
427
            /** @var tx_realurl $urlObj */
428
            $urlObj = GeneralUtility::makeInstance('tx_realurl');
429
430
            if (!empty($vv['subCfg']['baseUrl'])) {
431
                $urlParts = parse_url($vv['subCfg']['baseUrl']);
432
                $host = strtolower($urlParts['host']);
433
                $urlObj->host = $host;
434
435
                // First pass, finding configuration OR pointer string:
436
                $urlObj->extConf = isset($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
437
438
                // If it turned out to be a string pointer, then look up the real config:
439
                if (is_string($urlObj->extConf)) {
440
                    $urlObj->extConf = is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
441
                }
442
            }
443
444
            if (!$GLOBALS['TSFE']->sys_page) {
445
                $GLOBALS['TSFE']->sys_page = GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\PageRepository');
446
            }
447
            if (!$GLOBALS['TSFE']->csConvObj) {
448
                $GLOBALS['TSFE']->csConvObj = GeneralUtility::makeInstance('TYPO3\CMS\Core\Charset\CharsetConverter');
449
            }
450
            if (!$GLOBALS['TSFE']->tmpl->rootLine[0]['uid']) {
451
                $GLOBALS['TSFE']->tmpl->rootLine[0]['uid'] = $urlObj->extConf['pagePath']['rootpage_id'];
452
            }
453
        }
454
455 4
        if (is_array($vv['URLs'])) {
456 4
            $configurationHash = $this->getConfigurationHash($vv);
457 4
            $skipInnerCheck = $this->noUnprocessedQueueEntriesForPageWithConfigurationHashExist($pageRow['uid'], $configurationHash);
458
459 4
            foreach ($vv['URLs'] as $urlQuery) {
460 4
                if ($this->drawURLs_PIfilter($vv['subCfg']['procInstrFilter'], $incomingProcInstructions)) {
461
462
                    // Calculate cHash:
463 4
                    if ($vv['subCfg']['cHash']) {
464
                        /* @var $cacheHash \TYPO3\CMS\Frontend\Page\CacheHashCalculator */
465
                        $cacheHash = GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\CacheHashCalculator');
466
                        $urlQuery .= '&cHash=' . $cacheHash->generateForParameters($urlQuery);
467
                    }
468
469
                    // Create key by which to determine unique-ness:
470 4
                    $uKey = $urlQuery . '|' . $vv['subCfg']['userGroups'] . '|' . $vv['subCfg']['baseUrl'] . '|' . $vv['subCfg']['procInstrFilter'];
471
472
                    // realurl support (thanks to Ingo Renner)
473 4
                    $urlQuery = 'index.php' . $urlQuery;
474 4
                    if (ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
475
                        $params = [
476
                            'LD' => [
477
                                'totalURL' => $urlQuery
478
                            ],
479
                            'TCEmainHook' => true
480
                        ];
481
                        $urlObj->encodeSpURL($params);
0 ignored issues
show
Bug introduced by
The variable $urlObj does not seem to be defined for all execution paths leading up to this point.

If you define a variable conditionally, it can happen that it is not defined for all execution paths.

Let’s take a look at an example:

function myFunction($a) {
    switch ($a) {
        case 'foo':
            $x = 1;
            break;

        case 'bar':
            $x = 2;
            break;
    }

    // $x is potentially undefined here.
    echo $x;
}

In the above example, the variable $x is defined if you pass “foo” or “bar” as argument for $a. However, since the switch statement has no default case statement, if you pass any other value, the variable $x would be undefined.

Available Fixes

  1. Check for existence of the variable explicitly:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        if (isset($x)) { // Make sure it's always set.
            echo $x;
        }
    }
    
  2. Define a default value for the variable:

    function myFunction($a) {
        $x = ''; // Set a default which gets overridden for certain paths.
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        echo $x;
    }
    
  3. Add a value for the missing path:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
    
            // We add support for the missing case.
            default:
                $x = '';
                break;
        }
    
        echo $x;
    }
    
Loading history...
482
                        $urlQuery = $params['LD']['totalURL'];
483
                    }
484
485
                    // Scheduled time:
486 4
                    $schTime = $scheduledTime + round(count($duplicateTrack) * (60 / $reqMinute));
487 4
                    $schTime = floor($schTime / 60) * 60;
488
489 4
                    if (isset($duplicateTrack[$uKey])) {
490
491
                        //if the url key is registered just display it and do not resubmit is
492
                        $urlList = '<em><span class="typo3-dimmed">' . htmlspecialchars($urlQuery) . '</span></em><br/>';
493
                    } else {
494 4
                        $urlList = '[' . date('d.m.y H:i', $schTime) . '] ' . htmlspecialchars($urlQuery);
495 4
                        $this->urlList[] = '[' . date('d.m.y H:i', $schTime) . '] ' . $urlQuery;
496
497 4
                        $theUrl = ($vv['subCfg']['baseUrl'] ? $vv['subCfg']['baseUrl'] : GeneralUtility::getIndpEnv('TYPO3_SITE_URL')) . $urlQuery;
498
499
                        // Submit for crawling!
500 4
                        if ($submitCrawlUrls) {
501 4
                            $added = $this->addUrl(
502 4
                            $pageRow['uid'],
503 4
                            $theUrl,
504 4
                            $vv['subCfg'],
505 4
                            $scheduledTime,
506 4
                            $configurationHash,
507 4
                            $skipInnerCheck
508
                            );
509 4
                            if ($added === false) {
510 4
                                $urlList .= ' (Url already existed)';
511
                            }
512
                        } elseif ($downloadCrawlUrls) {
513
                            $downloadUrls[$theUrl] = $theUrl;
514
                        }
515
516 4
                        $urlList .= '<br />';
517
                    }
518 4
                    $duplicateTrack[$uKey] = true;
519
                }
520
            }
521
        } else {
522
            $urlList = 'ERROR - no URL generated';
523
        }
524
525 4
        return $urlList;
526
    }
527
528
    /**
529
     * Returns true if input processing instruction is among registered ones.
530
     *
531
     * @param string $piString PI to test
532
     * @param array $incomingProcInstructions Processing instructions
533
     * @return boolean
534
     */
535 5
    public function drawURLs_PIfilter($piString, array $incomingProcInstructions)
536
    {
537 5
        if (empty($incomingProcInstructions)) {
538 1
            return true;
539
        }
540
541 4
        foreach ($incomingProcInstructions as $pi) {
542 4
            if (GeneralUtility::inList($piString, $pi)) {
543 4
                return true;
544
            }
545
        }
546 2
    }
547
548 4
    public function getPageTSconfigForId($id)
549
    {
550 4
        if (!$this->MP) {
551 4
            $pageTSconfig = BackendUtility::getPagesTSconfig($id);
552
        } else {
553
            list(, $mountPointId) = explode('-', $this->MP);
554
            $pageTSconfig = BackendUtility::getPagesTSconfig($mountPointId);
555
        }
556
557
        // Call a hook to alter configuration
558 4
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'])) {
559
            $params = [
560
                'pageId' => $id,
561
                'pageTSConfig' => &$pageTSconfig
562
            ];
563
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'] as $userFunc) {
564
                GeneralUtility::callUserFunction($userFunc, $params, $this);
565
            }
566
        }
567
568 4
        return $pageTSconfig;
569
    }
570
571
    /**
572
     * This methods returns an array of configurations.
573
     * And no urls!
574
     *
575
     * @param integer $id Page ID
576
     * @param bool $forceSsl Use https
577
     * @return array
578
     */
579 4
    protected function getUrlsForPageId($id, $forceSsl = false)
580
    {
581
582
        /**
583
         * Get configuration from tsConfig
584
         */
585
586
        // Get page TSconfig for page ID:
587 4
        $pageTSconfig = $this->getPageTSconfigForId($id);
588
589 4
        $res = [];
590
591 4
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.'])) {
592 3
            $crawlerCfg = $pageTSconfig['tx_crawler.']['crawlerCfg.'];
593
594 3
            if (is_array($crawlerCfg['paramSets.'])) {
595 3
                foreach ($crawlerCfg['paramSets.'] as $key => $values) {
596 3
                    if (is_array($values)) {
597 3
                        $key = str_replace('.', '', $key);
598
                        // Sub configuration for a single configuration string:
599 3
                        $subCfg = (array)$crawlerCfg['paramSets.'][$key . '.'];
600 3
                        $subCfg['key'] = $key;
601
602 3
                        if (strcmp($subCfg['procInstrFilter'], '')) {
603 3
                            $subCfg['procInstrFilter'] = implode(',', GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']));
604
                        }
605 3
                        $pidOnlyList = implode(',', GeneralUtility::trimExplode(',', $subCfg['pidsOnly'], true));
606
607
                        // process configuration if it is not page-specific or if the specific page is the current page:
608 3
                        if (!strcmp($subCfg['pidsOnly'], '') || GeneralUtility::inList($pidOnlyList, $id)) {
609
610
                                // add trailing slash if not present
611 3
                            if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
612
                                $subCfg['baseUrl'] .= '/';
613
                            }
614
615
                            // Explode, process etc.:
616 3
                            $res[$key] = [];
617 3
                            $res[$key]['subCfg'] = $subCfg;
618 3
                            $res[$key]['paramParsed'] = $this->parseParams($values);
0 ignored issues
show
Documentation introduced by
$values is of type array, but the function expects a string.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
619 3
                            $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
620 3
                            $res[$key]['origin'] = 'pagets';
621
622
                            // recognize MP value
623 3
                            if (!$this->MP) {
624 3
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
625
                            } else {
626 3
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id . '&MP=' . $this->MP]);
627
                            }
628
                        }
629
                    }
630
                }
631
            }
632
        }
633
634
        /**
635
         * Get configuration from tx_crawler_configuration records
636
         */
637
638
        // get records along the rootline
639 4
        $rootLine = BackendUtility::BEgetRootLine($id);
640
641 4
        foreach ($rootLine as $page) {
642 4
            $configurationRecordsForCurrentPage = BackendUtility::getRecordsByField(
643 4
                'tx_crawler_configuration',
644 4
                'pid',
645 4
                intval($page['uid']),
646 4
                BackendUtility::BEenableFields('tx_crawler_configuration') . BackendUtility::deleteClause('tx_crawler_configuration')
647
            );
648
649 4
            if (is_array($configurationRecordsForCurrentPage)) {
650 1
                foreach ($configurationRecordsForCurrentPage as $configurationRecord) {
651
652
                        // check access to the configuration record
653 1
                    if (empty($configurationRecord['begroups']) || $GLOBALS['BE_USER']->isAdmin() || $this->hasGroupAccess($GLOBALS['BE_USER']->user['usergroup_cached_list'], $configurationRecord['begroups'])) {
654 1
                        $pidOnlyList = implode(',', GeneralUtility::trimExplode(',', $configurationRecord['pidsonly'], true));
655
656
                        // process configuration if it is not page-specific or if the specific page is the current page:
657 1
                        if (!strcmp($configurationRecord['pidsonly'], '') || GeneralUtility::inList($pidOnlyList, $id)) {
658 1
                            $key = $configurationRecord['name'];
659
660
                            // don't overwrite previously defined paramSets
661 1
                            if (!isset($res[$key])) {
662
663
                                    /* @var $TSparserObject \TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser */
664 1
                                $TSparserObject = GeneralUtility::makeInstance('TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser');
665 1
                                $TSparserObject->parse($configurationRecord['processing_instruction_parameters_ts']);
666
667 1
                                $isCrawlingProtocolHttps = $this->isCrawlingProtocolHttps($configurationRecord['force_ssl'], $forceSsl);
668
669
                                $subCfg = [
670 1
                                    'procInstrFilter' => $configurationRecord['processing_instruction_filter'],
671 1
                                    'procInstrParams.' => $TSparserObject->setup,
672 1
                                    'baseUrl' => $this->getBaseUrlForConfigurationRecord(
673 1
                                        $configurationRecord['base_url'],
674 1
                                        $configurationRecord['sys_domain_base_url'],
675 1
                                        $isCrawlingProtocolHttps
676
                                    ),
677 1
                                    'realurl' => $configurationRecord['realurl'],
678 1
                                    'cHash' => $configurationRecord['chash'],
679 1
                                    'userGroups' => $configurationRecord['fegroups'],
680 1
                                    'exclude' => $configurationRecord['exclude'],
681 1
                                    'rootTemplatePid' => (int) $configurationRecord['root_template_pid'],
682 1
                                    'key' => $key
683
                                ];
684
685
                                // add trailing slash if not present
686 1
                                if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
687
                                    $subCfg['baseUrl'] .= '/';
688
                                }
689 1
                                if (!in_array($id, $this->expandExcludeString($subCfg['exclude']))) {
690 1
                                    $res[$key] = [];
691 1
                                    $res[$key]['subCfg'] = $subCfg;
692 1
                                    $res[$key]['paramParsed'] = $this->parseParams($configurationRecord['configuration']);
693 1
                                    $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
694 1
                                    $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
695 4
                                    $res[$key]['origin'] = 'tx_crawler_configuration_' . $configurationRecord['uid'];
696
                                }
697
                            }
698
                        }
699
                    }
700
                }
701
            }
702
        }
703
704 4
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'])) {
705
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'] as $func) {
706
                $params = [
707
                    'res' => &$res,
708
                ];
709
                GeneralUtility::callUserFunction($func, $params, $this);
710
            }
711
        }
712
713 4
        return $res;
714
    }
715
716
    /**
717
     * Checks if a domain record exist and returns the base-url based on the record. If not the given baseUrl string is used.
718
     *
719
     * @param string $baseUrl
720
     * @param integer $sysDomainUid
721
     * @param bool $ssl
722
     * @return string
723
     */
724 4
    protected function getBaseUrlForConfigurationRecord($baseUrl, $sysDomainUid, $ssl = false)
725
    {
726 4
        $sysDomainUid = intval($sysDomainUid);
727 4
        $urlScheme = ($ssl === false) ? 'http' : 'https';
728
729 4
        if ($sysDomainUid > 0) {
730 2
            $res = $this->db->exec_SELECTquery(
731 2
                '*',
732 2
                'sys_domain',
733 2
                'uid = ' . $sysDomainUid .
734 2
                BackendUtility::BEenableFields('sys_domain') .
735 2
                BackendUtility::deleteClause('sys_domain')
736
            );
737 2
            $row = $this->db->sql_fetch_assoc($res);
738 2
            if ($row['domainName'] != '') {
739 1
                return $urlScheme . '://' . $row['domainName'];
740
            }
741
        }
742 3
        return $baseUrl;
743
    }
744
745
    public function getConfigurationsForBranch($rootid, $depth)
746
    {
747
        $configurationsForBranch = [];
748
749
        $pageTSconfig = $this->getPageTSconfigForId($rootid);
750
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'])) {
751
            $sets = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'];
752
            if (is_array($sets)) {
753
                foreach ($sets as $key => $value) {
754
                    if (!is_array($value)) {
755
                        continue;
756
                    }
757
                    $configurationsForBranch[] = substr($key, -1) == '.' ? substr($key, 0, -1) : $key;
758
                }
759
            }
760
        }
761
        $pids = [];
762
        $rootLine = BackendUtility::BEgetRootLine($rootid);
763
        foreach ($rootLine as $node) {
764
            $pids[] = $node['uid'];
765
        }
766
        /* @var PageTreeView $tree */
767
        $tree = GeneralUtility::makeInstance(PageTreeView::class);
768
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
769
        $tree->init('AND ' . $perms_clause);
770
        $tree->getTree($rootid, $depth, '');
771
        foreach ($tree->tree as $node) {
772
            $pids[] = $node['row']['uid'];
773
        }
774
775
        $res = $this->db->exec_SELECTquery(
776
            '*',
777
            'tx_crawler_configuration',
778
            'pid IN (' . implode(',', $pids) . ') ' .
779
            BackendUtility::BEenableFields('tx_crawler_configuration') .
780
            BackendUtility::deleteClause('tx_crawler_configuration') . ' ' .
781
            BackendUtility::versioningPlaceholderClause('tx_crawler_configuration') . ' '
782
        );
783
784
        while ($row = $this->db->sql_fetch_assoc($res)) {
785
            $configurationsForBranch[] = $row['name'];
786
        }
787
        $this->db->sql_free_result($res);
788
        return $configurationsForBranch;
789
    }
790
791
    /**
792
     * Check if a user has access to an item
793
     * (e.g. get the group list of the current logged in user from $GLOBALS['TSFE']->gr_list)
794
     *
795
     * @see \TYPO3\CMS\Frontend\Page\PageRepository::getMultipleGroupsWhereClause()
796
     * @param  string $groupList    Comma-separated list of (fe_)group UIDs from a user
797
     * @param  string $accessList   Comma-separated list of (fe_)group UIDs of the item to access
798
     * @return bool                 TRUE if at least one of the users group UIDs is in the access list or the access list is empty
799
     */
800 3
    public function hasGroupAccess($groupList, $accessList)
801
    {
802 3
        if (empty($accessList)) {
803 1
            return true;
804
        }
805 2
        foreach (GeneralUtility::intExplode(',', $groupList) as $groupUid) {
806 2
            if (GeneralUtility::inList($accessList, $groupUid)) {
807 2
                return true;
808
            }
809
        }
810 1
        return false;
811
    }
812
813
    /**
814
     * Parse GET vars of input Query into array with key=>value pairs
815
     *
816
     * @param string $inputQuery Input query string
817
     * @return array
818
     */
819 7
    public function parseParams($inputQuery)
820
    {
821
        // Extract all GET parameters into an ARRAY:
822 7
        $paramKeyValues = [];
823 7
        $GETparams = explode('&', $inputQuery);
824
825 7
        foreach ($GETparams as $paramAndValue) {
826 4
            list($p, $v) = explode('=', $paramAndValue, 2);
827 4
            if (strlen($p)) {
828 4
                $paramKeyValues[rawurldecode($p)] = rawurldecode($v);
829
            }
830
        }
831
832 7
        return $paramKeyValues;
833
    }
834
835
    /**
836
     * Will expand the parameters configuration to individual values. This follows a certain syntax of the value of each parameter.
837
     * Syntax of values:
838
     * - Basically: If the value is wrapped in [...] it will be expanded according to the following syntax, otherwise the value is taken literally
839
     * - Configuration is splitted by "|" and the parts are processed individually and finally added together
840
     * - For each configuration part:
841
     *         - "[int]-[int]" = Integer range, will be expanded to all values in between, values included, starting from low to high (max. 1000). Example "1-34" or "-40--30"
842
     *         - "_TABLE:[TCA table name];[_PID:[optional page id, default is current page]];[_ENABLELANG:1]" = Look up of table records from PID, filtering out deleted records. Example "_TABLE:tt_content; _PID:123"
843
     *        _ENABLELANG:1 picks only original records without their language overlays
844
     *         - Default: Literal value
845
     *
846
     * @param array $paramArray Array with key (GET var name) and values (value of GET var which is configuration for expansion)
847
     * @param integer $pid Current page ID
848
     * @return array
849
     */
850 4
    public function expandParameters($paramArray, $pid)
851
    {
852 4
        global $TCA;
853
854
        // Traverse parameter names:
855 4
        foreach ($paramArray as $p => $v) {
856 1
            $v = trim($v);
857
858
            // If value is encapsulated in square brackets it means there are some ranges of values to find, otherwise the value is literal
859 1
            if (substr($v, 0, 1) === '[' && substr($v, -1) === ']') {
860
                // So, find the value inside brackets and reset the paramArray value as an array.
861 1
                $v = substr($v, 1, -1);
862 1
                $paramArray[$p] = [];
863
864
                // Explode parts and traverse them:
865 1
                $parts = explode('|', $v);
866 1
                foreach ($parts as $pV) {
867
868
                        // Look for integer range: (fx. 1-34 or -40--30 // reads minus 40 to minus 30)
869 1
                    if (preg_match('/^(-?[0-9]+)\s*-\s*(-?[0-9]+)$/', trim($pV), $reg)) {
870
871
                        // Swap if first is larger than last:
872
                        if ($reg[1] > $reg[2]) {
873
                            $temp = $reg[2];
874
                            $reg[2] = $reg[1];
875
                            $reg[1] = $temp;
876
                        }
877
878
                        // Traverse range, add values:
879
                        $runAwayBrake = 1000; // Limit to size of range!
880
                        for ($a = $reg[1]; $a <= $reg[2];$a++) {
881
                            $paramArray[$p][] = $a;
882
                            $runAwayBrake--;
883
                            if ($runAwayBrake <= 0) {
884
                                break;
885
                            }
886
                        }
887 1
                    } elseif (substr(trim($pV), 0, 7) == '_TABLE:') {
888
889
                        // Parse parameters:
890
                        $subparts = GeneralUtility::trimExplode(';', $pV);
891
                        $subpartParams = [];
892
                        foreach ($subparts as $spV) {
893
                            list($pKey, $pVal) = GeneralUtility::trimExplode(':', $spV);
894
                            $subpartParams[$pKey] = $pVal;
895
                        }
896
897
                        // Table exists:
898
                        if (isset($TCA[$subpartParams['_TABLE']])) {
899
                            $lookUpPid = isset($subpartParams['_PID']) ? intval($subpartParams['_PID']) : $pid;
900
                            $pidField = isset($subpartParams['_PIDFIELD']) ? trim($subpartParams['_PIDFIELD']) : 'pid';
901
                            $where = isset($subpartParams['_WHERE']) ? $subpartParams['_WHERE'] : '';
902
                            $addTable = isset($subpartParams['_ADDTABLE']) ? $subpartParams['_ADDTABLE'] : '';
903
904
                            $fieldName = $subpartParams['_FIELD'] ? $subpartParams['_FIELD'] : 'uid';
905
                            if ($fieldName === 'uid' || $TCA[$subpartParams['_TABLE']]['columns'][$fieldName]) {
906
                                $andWhereLanguage = '';
907
                                $transOrigPointerField = $TCA[$subpartParams['_TABLE']]['ctrl']['transOrigPointerField'];
908
909
                                if ($subpartParams['_ENABLELANG'] && $transOrigPointerField) {
910
                                    $andWhereLanguage = ' AND ' . $this->db->quoteStr($transOrigPointerField, $subpartParams['_TABLE']) . ' <= 0 ';
911
                                }
912
913
                                $where = $this->db->quoteStr($pidField, $subpartParams['_TABLE']) . '=' . intval($lookUpPid) . ' ' .
914
                                    $andWhereLanguage . $where;
915
916
                                $rows = $this->db->exec_SELECTgetRows(
917
                                    $fieldName,
918
                                    $subpartParams['_TABLE'] . $addTable,
919
                                    $where . BackendUtility::deleteClause($subpartParams['_TABLE']),
920
                                    '',
921
                                    '',
922
                                    '',
923
                                    $fieldName
924
                                );
925
926
                                if (is_array($rows)) {
927
                                    $paramArray[$p] = array_merge($paramArray[$p], array_keys($rows));
928
                                }
929
                            }
930
                        }
931
                    } else { // Just add value:
932 1
                        $paramArray[$p][] = $pV;
933
                    }
934
                    // Hook for processing own expandParameters place holder
935 1
                    if (is_array($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'])) {
936
                        $_params = [
937
                            'pObj' => &$this,
938
                            'paramArray' => &$paramArray,
939
                            'currentKey' => $p,
940
                            'currentValue' => $pV,
941
                            'pid' => $pid
942
                        ];
943
                        foreach ($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'] as $key => $_funcRef) {
944 1
                            GeneralUtility::callUserFunction($_funcRef, $_params, $this);
945
                        }
946
                    }
947
                }
948
949
                // Make unique set of values and sort array by key:
950 1
                $paramArray[$p] = array_unique($paramArray[$p]);
951 1
                ksort($paramArray);
952
            } else {
953
                // Set the literal value as only value in array:
954 1
                $paramArray[$p] = [$v];
955
            }
956
        }
957
958 4
        return $paramArray;
959
    }
960
961
    /**
962
     * Compiling URLs from parameter array (output of expandParameters())
963
     * The number of URLs will be the multiplication of the number of parameter values for each key
964
     *
965
     * @param array $paramArray Output of expandParameters(): Array with keys (GET var names) and for each an array of values
966
     * @param array $urls URLs accumulated in this array (for recursion)
967
     * @return array
968
     */
969 7
    public function compileUrls($paramArray, $urls = [])
970
    {
971 7
        if (count($paramArray) && is_array($urls)) {
972
            // shift first off stack:
973 3
            reset($paramArray);
974 3
            $varName = key($paramArray);
975 3
            $valueSet = array_shift($paramArray);
976
977
            // Traverse value set:
978 3
            $newUrls = [];
979 3
            foreach ($urls as $url) {
980 2
                foreach ($valueSet as $val) {
981 2
                    $newUrls[] = $url . (strcmp($val, '') ? '&' . rawurlencode($varName) . '=' . rawurlencode($val) : '');
982
983 2
                    if (count($newUrls) > MathUtility::forceIntegerInRange($this->extensionSettings['maxCompileUrls'], 1, 1000000000, 10000)) {
984 2
                        break;
985
                    }
986
                }
987
            }
988 3
            $urls = $newUrls;
989 3
            $urls = $this->compileUrls($paramArray, $urls);
990
        }
991
992 7
        return $urls;
993
    }
994
995
    /************************************
996
     *
997
     * Crawler log
998
     *
999
     ************************************/
1000
1001
    /**
1002
     * Return array of records from crawler queue for input page ID
1003
     *
1004
     * @param integer $id Page ID for which to look up log entries.
1005
     * @param string$filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
1006
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
1007
     * @param boolean $doFullFlush
1008
     * @param integer $itemsPerPage Limit the amount of entries per page default is 10
1009
     * @return array
1010
     */
1011 4
    public function getLogEntriesForPageId($id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
1012
    {
1013
        switch ($filter) {
1014 4
            case 'pending':
1015
                $addWhere = ' AND exec_time=0';
1016
                break;
1017 4
            case 'finished':
1018
                $addWhere = ' AND exec_time>0';
1019
                break;
1020
            default:
1021 4
                $addWhere = '';
1022 4
                break;
1023
        }
1024
1025
        // FIXME: Write unit test that ensures that the right records are deleted.
1026 4
        if ($doFlush) {
1027 2
            $this->flushQueue(($doFullFlush ? '1=1' : ('page_id=' . intval($id))) . $addWhere);
1028 2
            return [];
1029
        } else {
1030 2
            return $this->db->exec_SELECTgetRows(
1031 2
                '*',
1032 2
                'tx_crawler_queue',
1033 2
                'page_id=' . intval($id) . $addWhere,
1034 2
                '',
1035 2
                'scheduled DESC',
1036 2
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
1037
            );
1038
        }
1039
    }
1040
1041
    /**
1042
     * Return array of records from crawler queue for input set ID
1043
     *
1044
     * @param integer $set_id Set ID for which to look up log entries.
1045
     * @param string $filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
1046
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
1047
     * @param integer $itemsPerPage Limit the amount of entires per page default is 10
1048
     * @return array
1049
     */
1050 6
    public function getLogEntriesForSetId($set_id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
1051
    {
1052
        // FIXME: Write Unit tests for Filters
1053
        switch ($filter) {
1054 6
            case 'pending':
1055 1
                $addWhere = ' AND exec_time=0';
1056 1
                break;
1057 5
            case 'finished':
1058 1
                $addWhere = ' AND exec_time>0';
1059 1
                break;
1060
            default:
1061 4
                $addWhere = '';
1062 4
                break;
1063
        }
1064
        // FIXME: Write unit test that ensures that the right records are deleted.
1065 6
        if ($doFlush) {
1066 4
            $this->flushQueue($doFullFlush ? '' : ('set_id=' . intval($set_id) . $addWhere));
1067 4
            return [];
1068
        } else {
1069 2
            return $this->db->exec_SELECTgetRows(
1070 2
                '*',
1071 2
                'tx_crawler_queue',
1072 2
                'set_id=' . intval($set_id) . $addWhere,
1073 2
                '',
1074 2
                'scheduled DESC',
1075 2
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
1076
            );
1077
        }
1078
    }
1079
1080
    /**
1081
     * Removes queue entries
1082
     *
1083
     * @param string $where SQL related filter for the entries which should be removed
1084
     * @return void
1085
     */
1086 10
    protected function flushQueue($where = '')
1087
    {
1088 10
        $realWhere = strlen($where) > 0 ? $where : '1=1';
1089
1090 10
        if (EventDispatcher::getInstance()->hasObserver('queueEntryFlush')) {
1091
            $groups = $this->db->exec_SELECTgetRows('DISTINCT set_id', 'tx_crawler_queue', $realWhere);
1092
            if (is_array($groups)) {
1093
                foreach ($groups as $group) {
1094
                    EventDispatcher::getInstance()->post('queueEntryFlush', $group['set_id'], $this->db->exec_SELECTgetRows('uid, set_id', 'tx_crawler_queue', $realWhere . ' AND set_id="' . $group['set_id'] . '"'));
1095
                }
1096
            }
1097
        }
1098
1099 10
        $this->db->exec_DELETEquery('tx_crawler_queue', $realWhere);
1100 10
    }
1101
1102
    /**
1103
     * Adding call back entries to log (called from hooks typically, see indexed search class "class.crawler.php"
1104
     *
1105
     * @param integer $setId Set ID
1106
     * @param array $params Parameters to pass to call back function
1107
     * @param string $callBack Call back object reference, eg. 'EXT:indexed_search/class.crawler.php:&tx_indexedsearch_crawler'
1108
     * @param integer $page_id Page ID to attach it to
1109
     * @param integer $schedule Time at which to activate
1110
     * @return void
1111
     */
1112
    public function addQueueEntry_callBack($setId, $params, $callBack, $page_id = 0, $schedule = 0)
1113
    {
1114
        if (!is_array($params)) {
1115
            $params = [];
1116
        }
1117
        $params['_CALLBACKOBJ'] = $callBack;
1118
1119
        // Compile value array:
1120
        $fieldArray = [
1121
            'page_id' => intval($page_id),
1122
            'parameters' => serialize($params),
1123
            'scheduled' => intval($schedule) ? intval($schedule) : $this->getCurrentTime(),
1124
            'exec_time' => 0,
1125
            'set_id' => intval($setId),
1126
            'result_data' => '',
1127
        ];
1128
1129
        $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1130
    }
1131
1132
    /************************************
1133
     *
1134
     * URL setting
1135
     *
1136
     ************************************/
1137
1138
    /**
1139
     * Setting a URL for crawling:
1140
     *
1141
     * @param integer $id Page ID
1142
     * @param string $url Complete URL
1143
     * @param array $subCfg Sub configuration array (from TS config)
1144
     * @param integer $tstamp Scheduled-time
1145
     * @param string $configurationHash (optional) configuration hash
1146
     * @param bool $skipInnerDuplicationCheck (optional) skip inner duplication check
1147
     * @return bool
1148
     */
1149 4
    public function addUrl(
1150
        $id,
1151
        $url,
1152
        array $subCfg,
1153
        $tstamp,
1154
        $configurationHash = '',
1155
        $skipInnerDuplicationCheck = false
1156
    ) {
1157 4
        $urlAdded = false;
1158 4
        $rows = [];
1159
1160
        // Creating parameters:
1161
        $parameters = [
1162 4
            'url' => $url
1163
        ];
1164
1165
        // fe user group simulation:
1166 4
        $uGs = implode(',', array_unique(GeneralUtility::intExplode(',', $subCfg['userGroups'], true)));
1167 4
        if ($uGs) {
1168
            $parameters['feUserGroupList'] = $uGs;
1169
        }
1170
1171
        // Setting processing instructions
1172 4
        $parameters['procInstructions'] = GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']);
1173 4
        if (is_array($subCfg['procInstrParams.'])) {
1174 4
            $parameters['procInstrParams'] = $subCfg['procInstrParams.'];
1175
        }
1176
1177
        // Possible TypoScript Template Parents
1178 4
        $parameters['rootTemplatePid'] = $subCfg['rootTemplatePid'];
1179
1180
        // Compile value array:
1181 4
        $parameters_serialized = serialize($parameters);
1182
        $fieldArray = [
1183 4
            'page_id' => intval($id),
1184 4
            'parameters' => $parameters_serialized,
1185 4
            'parameters_hash' => GeneralUtility::shortMD5($parameters_serialized),
1186 4
            'configuration_hash' => $configurationHash,
1187 4
            'scheduled' => $tstamp,
1188 4
            'exec_time' => 0,
1189 4
            'set_id' => intval($this->setID),
1190 4
            'result_data' => '',
1191 4
            'configuration' => $subCfg['key'],
1192
        ];
1193
1194 4
        if ($this->registerQueueEntriesInternallyOnly) {
1195
            //the entries will only be registered and not stored to the database
1196
            $this->queueEntries[] = $fieldArray;
1197
        } else {
1198 4
            if (!$skipInnerDuplicationCheck) {
1199
                // check if there is already an equal entry
1200 4
                $rows = $this->getDuplicateRowsIfExist($tstamp, $fieldArray);
1201
            }
1202
1203 4
            if (count($rows) == 0) {
1204 4
                $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1205 4
                $uid = $this->db->sql_insert_id();
1206 4
                $rows[] = $uid;
1207 4
                $urlAdded = true;
1208 4
                EventDispatcher::getInstance()->post('urlAddedToQueue', $this->setID, ['uid' => $uid, 'fieldArray' => $fieldArray]);
1209
            } else {
1210 2
                EventDispatcher::getInstance()->post('duplicateUrlInQueue', $this->setID, ['rows' => $rows, 'fieldArray' => $fieldArray]);
1211
            }
1212
        }
1213
1214 4
        return $urlAdded;
1215
    }
1216
1217
    /**
1218
     * This method determines duplicates for a queue entry with the same parameters and this timestamp.
1219
     * If the timestamp is in the past, it will check if there is any unprocessed queue entry in the past.
1220
     * If the timestamp is in the future it will check, if the queued entry has exactly the same timestamp
1221
     *
1222
     * @param int $tstamp
1223
     * @param array $fieldArray
1224
     *
1225
     * @return array
1226
     */
1227 4
    protected function getDuplicateRowsIfExist($tstamp, $fieldArray)
1228
    {
1229 4
        $rows = [];
1230
1231 4
        $currentTime = $this->getCurrentTime();
1232
1233
        //if this entry is scheduled with "now"
1234 4
        if ($tstamp <= $currentTime) {
1235 1
            if ($this->extensionSettings['enableTimeslot']) {
1236 1
                $timeBegin = $currentTime - 100;
1237 1
                $timeEnd = $currentTime + 100;
1238 1
                $where = ' ((scheduled BETWEEN ' . $timeBegin . ' AND ' . $timeEnd . ' ) OR scheduled <= ' . $currentTime . ') ';
1239
            } else {
1240 1
                $where = 'scheduled <= ' . $currentTime;
1241
            }
1242 3
        } elseif ($tstamp > $currentTime) {
1243
            //entry with a timestamp in the future need to have the same schedule time
1244 3
            $where = 'scheduled = ' . $tstamp ;
1245
        }
1246
1247 4
        if (!empty($where)) {
1248 4
            $result = $this->db->exec_SELECTgetRows(
1249 4
                'qid',
1250 4
                'tx_crawler_queue',
1251
                $where .
1252 4
                ' AND NOT exec_time' .
1253 4
                ' AND NOT process_id ' .
1254 4
                ' AND page_id=' . intval($fieldArray['page_id']) .
1255 4
                ' AND parameters_hash = ' . $this->db->fullQuoteStr($fieldArray['parameters_hash'], 'tx_crawler_queue')
1256
            );
1257
1258 4
            if (is_array($result)) {
1259 4
                foreach ($result as $value) {
1260 2
                    $rows[] = $value['qid'];
1261
                }
1262
            }
1263
        }
1264
1265 4
        return $rows;
1266
    }
1267
1268
    /**
1269
     * Returns the current system time
1270
     *
1271
     * @return int
1272
     */
1273
    public function getCurrentTime()
1274
    {
1275
        return time();
1276
    }
1277
1278
    /************************************
1279
     *
1280
     * URL reading
1281
     *
1282
     ************************************/
1283
1284
    /**
1285
     * Read URL for single queue entry
1286
     *
1287
     * @param integer $queueId
1288
     * @param boolean $force If set, will process even if exec_time has been set!
1289
     * @return integer
1290
     */
1291
    public function readUrl($queueId, $force = false)
1292
    {
1293
        $ret = 0;
1294
        if ($this->debugMode) {
1295
            GeneralUtility::devlog('crawler-readurl start ' . microtime(true), __FUNCTION__);
1296
        }
1297
        // Get entry:
1298
        list($queueRec) = $this->db->exec_SELECTgetRows(
1299
            '*',
1300
            'tx_crawler_queue',
1301
            'qid=' . intval($queueId) . ($force ? '' : ' AND exec_time=0 AND process_scheduled > 0')
1302
        );
1303
1304
        if (!is_array($queueRec)) {
1305
            return;
1306
        }
1307
1308
        $parameters = unserialize($queueRec['parameters']);
1309
        if ($parameters['rootTemplatePid']) {
1310
            $this->initTSFE((int)$parameters['rootTemplatePid']);
1311
        } else {
1312
            GeneralUtility::sysLog(
1313
                'Page with (' . $queueRec['page_id'] . ') could not be crawled, please check your crawler configuration. Perhaps no Root Template Pid is set',
1314
                'crawler',
1315
                GeneralUtility::SYSLOG_SEVERITY_WARNING
1316
            );
1317
        }
1318
1319
        SignalSlotUtility::emitSignal(
1320
            __CLASS__,
1321
            SignalSlotUtility::SIGNNAL_QUEUEITEM_PREPROCESS,
1322
            [$queueId, &$queueRec]
1323
        );
1324
1325
        // Set exec_time to lock record:
1326
        $field_array = ['exec_time' => $this->getCurrentTime()];
1327
1328
        if (isset($this->processID)) {
1329
            //if mulitprocessing is used we need to store the id of the process which has handled this entry
1330
            $field_array['process_id_completed'] = $this->processID;
1331
        }
1332
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1333
1334
        $result = $this->readUrl_exec($queueRec);
1335
        $resultData = unserialize($result['content']);
1336
1337
        //atm there's no need to point to specific pollable extensions
1338
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'])) {
1339
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'] as $pollable) {
1340
                // only check the success value if the instruction is runnig
1341
                // it is important to name the pollSuccess key same as the procInstructions key
1342
                if (is_array($resultData['parameters']['procInstructions']) && in_array(
1343
                    $pollable,
1344
                        $resultData['parameters']['procInstructions']
1345
                )
1346
                ) {
1347
                    if (!empty($resultData['success'][$pollable]) && $resultData['success'][$pollable]) {
1348
                        $ret |= self::CLI_STATUS_POLLABLE_PROCESSED;
1349
                    }
1350
                }
1351
            }
1352
        }
1353
1354
        // Set result in log which also denotes the end of the processing of this entry.
1355
        $field_array = ['result_data' => serialize($result)];
1356
1357
        SignalSlotUtility::emitSignal(
1358
            __CLASS__,
1359
            SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1360
            [$queueId, &$field_array]
1361
        );
1362
1363
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1364
1365
        if ($this->debugMode) {
1366
            GeneralUtility::devlog('crawler-readurl stop ' . microtime(true), __FUNCTION__);
1367
        }
1368
1369
        return $ret;
1370
    }
1371
1372
    /**
1373
     * Read URL for not-yet-inserted log-entry
1374
     *
1375
     * @param array $field_array Queue field array,
1376
     *
1377
     * @return string
1378
     */
1379
    public function readUrlFromArray($field_array)
1380
    {
1381
1382
            // Set exec_time to lock record:
1383
        $field_array['exec_time'] = $this->getCurrentTime();
1384
        $this->db->exec_INSERTquery('tx_crawler_queue', $field_array);
1385
        $queueId = $field_array['qid'] = $this->db->sql_insert_id();
1386
1387
        $result = $this->readUrl_exec($field_array);
1388
1389
        // Set result in log which also denotes the end of the processing of this entry.
1390
        $field_array = ['result_data' => serialize($result)];
1391
1392
        SignalSlotUtility::emitSignal(
1393
            __CLASS__,
1394
            SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1395
            [$queueId, &$field_array]
1396
        );
1397
1398
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1399
1400
        return $result;
1401
    }
1402
1403
    /**
1404
     * Read URL for a queue record
1405
     *
1406
     * @param array $queueRec Queue record
1407
     * @return string
1408
     */
1409
    public function readUrl_exec($queueRec)
1410
    {
1411
        // Decode parameters:
1412
        $parameters = unserialize($queueRec['parameters']);
1413
        $result = 'ERROR';
1414
        if (is_array($parameters)) {
1415
            if ($parameters['_CALLBACKOBJ']) { // Calling object:
1416
                $objRef = $parameters['_CALLBACKOBJ'];
1417
                $callBackObj = &GeneralUtility::getUserObj($objRef);
1418
                if (is_object($callBackObj)) {
1419
                    unset($parameters['_CALLBACKOBJ']);
1420
                    $result = ['content' => serialize($callBackObj->crawler_execute($parameters, $this))];
1421
                } else {
1422
                    $result = ['content' => 'No object: ' . $objRef];
1423
                }
1424
            } else { // Regular FE request:
1425
1426
                // Prepare:
1427
                $crawlerId = $queueRec['qid'] . ':' . md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey']);
1428
1429
                // Get result:
1430
                $result = $this->requestUrl($parameters['url'], $crawlerId);
1431
1432
                EventDispatcher::getInstance()->post('urlCrawled', $queueRec['set_id'], ['url' => $parameters['url'], 'result' => $result]);
1433
            }
1434
        }
1435
1436
        return $result;
1437
    }
1438
1439
    /**
1440
     * Gets the content of a URL.
1441
     *
1442
     * @param string $originalUrl URL to read
1443
     * @param string $crawlerId Crawler ID string (qid + hash to verify)
1444
     * @param integer $timeout Timeout time
1445
     * @param integer $recursion Recursion limiter for 302 redirects
1446
     * @return array
1447
     */
1448 2
    public function requestUrl($originalUrl, $crawlerId, $timeout = 2, $recursion = 10)
1449
    {
1450 2
        if (!$recursion) {
1451
            return false;
1452
        }
1453
1454
        // Parse URL, checking for scheme:
1455 2
        $url = parse_url($originalUrl);
1456
1457 2
        if ($url === false) {
1458
            if (TYPO3_DLOG) {
1459
                GeneralUtility::devLog(sprintf('Could not parse_url() for string "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1460
            }
1461
            return false;
1462
        }
1463
1464 2
        if (!in_array($url['scheme'], ['','http','https'])) {
1465
            if (TYPO3_DLOG) {
1466
                GeneralUtility::devLog(sprintf('Scheme does not match for url "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1467
            }
1468
            return false;
1469
        }
1470
1471
        // direct request
1472 2
        if ($this->extensionSettings['makeDirectRequests']) {
1473 2
            $result = $this->sendDirectRequest($originalUrl, $crawlerId);
1474 2
            return $result;
1475
        }
1476
1477
        $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1478
1479
        // thanks to Pierrick Caillon for adding proxy support
1480
        $rurl = $url;
1481
1482
        if ($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlUse'] && $GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']) {
1483
            $rurl = parse_url($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']);
1484
            $url['path'] = $url['scheme'] . '://' . $url['host'] . ($url['port'] > 0 ? ':' . $url['port'] : '') . $url['path'];
1485
            $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1486
        }
1487
1488
        $host = $rurl['host'];
1489
1490
        if ($url['scheme'] == 'https') {
1491
            $host = 'ssl://' . $host;
1492
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 443;
1493
        } else {
1494
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 80;
1495
        }
1496
1497
        $startTime = microtime(true);
1498
        $fp = fsockopen($host, $port, $errno, $errstr, $timeout);
1499
1500
        if (!$fp) {
1501
            if (TYPO3_DLOG) {
1502
                GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1503
            }
1504
            return false;
1505
        } else {
1506
            // Request message:
1507
            $msg = implode("\r\n", $reqHeaders) . "\r\n\r\n";
1508
            fputs($fp, $msg);
1509
1510
            // Read response:
1511
            $d = $this->getHttpResponseFromStream($fp);
1512
            fclose($fp);
1513
1514
            $time = microtime(true) - $startTime;
1515
            $this->log($originalUrl . ' ' . $time);
1516
1517
            // Implode content and headers:
1518
            $result = [
1519
                'request' => $msg,
1520
                'headers' => implode('', $d['headers']),
1521
                'content' => implode('', (array)$d['content'])
1522
            ];
1523
1524
            if (($this->extensionSettings['follow30x']) && ($newUrl = $this->getRequestUrlFrom302Header($d['headers'], $url['user'], $url['pass']))) {
1525
                $result = array_merge(['parentRequest' => $result], $this->requestUrl($newUrl, $crawlerId, $recursion--));
0 ignored issues
show
Bug introduced by
It seems like $newUrl defined by $this->getRequestUrlFrom...['user'], $url['pass']) on line 1524 can also be of type boolean; however, AOE\Crawler\Controller\C...ontroller::requestUrl() does only seem to accept string, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
1526
                $newRequestUrl = $this->requestUrl($newUrl, $crawlerId, $timeout, --$recursion);
0 ignored issues
show
Bug introduced by
It seems like $newUrl defined by $this->getRequestUrlFrom...['user'], $url['pass']) on line 1524 can also be of type boolean; however, AOE\Crawler\Controller\C...ontroller::requestUrl() does only seem to accept string, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
1527
1528
                if (is_array($newRequestUrl)) {
1529
                    $result = array_merge(['parentRequest' => $result], $newRequestUrl);
1530
                } else {
1531
                    if (TYPO3_DLOG) {
1532
                        GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1533
                    }
1534
                    return false;
1535
                }
1536
            }
1537
1538
            return $result;
1539
        }
1540
    }
1541
1542
    /**
1543
     * Gets the base path of the website frontend.
1544
     * (e.g. if you call http://mydomain.com/cms/index.php in
1545
     * the browser the base path is "/cms/")
1546
     *
1547
     * @return string Base path of the website frontend
1548
     */
1549
    protected function getFrontendBasePath()
1550
    {
1551
        $frontendBasePath = '/';
1552
1553
        // Get the path from the extension settings:
1554
        if (isset($this->extensionSettings['frontendBasePath']) && $this->extensionSettings['frontendBasePath']) {
1555
            $frontendBasePath = $this->extensionSettings['frontendBasePath'];
1556
            // If empty, try to use config.absRefPrefix:
1557
        } elseif (isset($GLOBALS['TSFE']->absRefPrefix) && !empty($GLOBALS['TSFE']->absRefPrefix)) {
1558
            $frontendBasePath = $GLOBALS['TSFE']->absRefPrefix;
1559
            // If not in CLI mode the base path can be determined from $_SERVER environment:
1560
        } elseif (!defined('TYPO3_REQUESTTYPE_CLI') || !TYPO3_REQUESTTYPE_CLI) {
1561
            $frontendBasePath = GeneralUtility::getIndpEnv('TYPO3_SITE_PATH');
1562
        }
1563
1564
        // Base path must be '/<pathSegements>/':
1565
        if ($frontendBasePath != '/') {
1566
            $frontendBasePath = '/' . ltrim($frontendBasePath, '/');
1567
            $frontendBasePath = rtrim($frontendBasePath, '/') . '/';
1568
        }
1569
1570
        return $frontendBasePath;
1571
    }
1572
1573
    /**
1574
     * Executes a shell command and returns the outputted result.
1575
     *
1576
     * @param string $command Shell command to be executed
1577
     * @return string Outputted result of the command execution
1578
     */
1579
    protected function executeShellCommand($command)
1580
    {
1581
        $result = shell_exec($command);
1582
        return $result;
1583
    }
1584
1585
    /**
1586
     * Reads HTTP response from the given stream.
1587
     *
1588
     * @param  resource $streamPointer  Pointer to connection stream.
1589
     * @return array                    Associative array with the following items:
1590
     *                                  headers <array> Response headers sent by server.
1591
     *                                  content <array> Content, with each line as an array item.
1592
     */
1593 1
    protected function getHttpResponseFromStream($streamPointer)
1594
    {
1595 1
        $response = ['headers' => [], 'content' => []];
1596
1597 1
        if (is_resource($streamPointer)) {
1598
            // read headers
1599 1
            while ($line = fgets($streamPointer, '2048')) {
1600 1
                $line = trim($line);
1601 1
                if ($line !== '') {
1602 1
                    $response['headers'][] = $line;
1603
                } else {
1604 1
                    break;
1605
                }
1606
            }
1607
1608
            // read content
1609 1
            while ($line = fgets($streamPointer, '2048')) {
1610 1
                $response['content'][] = $line;
1611
            }
1612
        }
1613
1614 1
        return $response;
1615
    }
1616
1617
    /**
1618
     * @param message
1619
     */
1620 2
    protected function log($message)
1621
    {
1622 2
        if (!empty($this->extensionSettings['logFileName'])) {
1623
            $fileResult = @file_put_contents($this->extensionSettings['logFileName'], date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);
1624
            if (!$fileResult) {
1625
                GeneralUtility::devLog('File "' . $this->extensionSettings['logFileName'] . '" could not be written, please check file permissions.', 'crawler', LogLevel::INFO);
1626
            }
1627
        }
1628 2
    }
1629
1630
    /**
1631
     * Builds HTTP request headers.
1632
     *
1633
     * @param array $url
1634
     * @param string $crawlerId
1635
     *
1636
     * @return array
1637
     */
1638 6
    protected function buildRequestHeaderArray(array $url, $crawlerId)
1639
    {
1640 6
        $reqHeaders = [];
1641 6
        $reqHeaders[] = 'GET ' . $url['path'] . ($url['query'] ? '?' . $url['query'] : '') . ' HTTP/1.0';
1642 6
        $reqHeaders[] = 'Host: ' . $url['host'];
1643 6
        if (stristr($url['query'], 'ADMCMD_previewWS')) {
1644 2
            $reqHeaders[] = 'Cookie: $Version="1"; be_typo_user="1"; $Path=/';
1645
        }
1646 6
        $reqHeaders[] = 'Connection: close';
1647 6
        if ($url['user'] != '') {
1648 2
            $reqHeaders[] = 'Authorization: Basic ' . base64_encode($url['user'] . ':' . $url['pass']);
1649
        }
1650 6
        $reqHeaders[] = 'X-T3crawler: ' . $crawlerId;
1651 6
        $reqHeaders[] = 'User-Agent: TYPO3 crawler';
1652 6
        return $reqHeaders;
1653
    }
1654
1655
    /**
1656
     * Check if the submitted HTTP-Header contains a redirect location and built new crawler-url
1657
     *
1658
     * @param array $headers HTTP Header
1659
     * @param string $user HTTP Auth. User
1660
     * @param string $pass HTTP Auth. Password
1661
     * @return bool|string
1662
     */
1663 12
    protected function getRequestUrlFrom302Header($headers, $user = '', $pass = '')
1664
    {
1665 12
        $header = [];
1666 12
        if (!is_array($headers)) {
1667 1
            return false;
1668
        }
1669 11
        if (!(stristr($headers[0], '301 Moved') || stristr($headers[0], '302 Found') || stristr($headers[0], '302 Moved'))) {
1670 2
            return false;
1671
        }
1672
1673 9
        foreach ($headers as $hl) {
1674 9
            $tmp = explode(": ", $hl);
1675 9
            $header[trim($tmp[0])] = trim($tmp[1]);
1676 9
            if (trim($tmp[0]) == 'Location') {
1677 9
                break;
1678
            }
1679
        }
1680 9
        if (!array_key_exists('Location', $header)) {
1681 3
            return false;
1682
        }
1683
1684 6
        if ($user != '') {
1685 3
            if (!($tmp = parse_url($header['Location']))) {
1686 1
                return false;
1687
            }
1688 2
            $newUrl = $tmp['scheme'] . '://' . $user . ':' . $pass . '@' . $tmp['host'] . $tmp['path'];
1689 2
            if ($tmp['query'] != '') {
1690 2
                $newUrl .= '?' . $tmp['query'];
1691
            }
1692
        } else {
1693 3
            $newUrl = $header['Location'];
1694
        }
1695 5
        return $newUrl;
1696
    }
1697
1698
    /**************************
1699
     *
1700
     * tslib_fe hooks:
1701
     *
1702
     **************************/
1703
1704
    /**
1705
     * Initialization hook (called after database connection)
1706
     * Takes the "HTTP_X_T3CRAWLER" header and looks up queue record and verifies if the session comes from the system (by comparing hashes)
1707
     *
1708
     * @param array $params Parameters from frontend
1709
     * @param object $ref TSFE object (reference under PHP5)
1710
     * @return void
1711
     *
1712
     * FIXME: Look like this is not used, in commit 9910d3f40cce15f4e9b7bcd0488bf21f31d53ebc it's added as public,
1713
     * FIXME: I think this can be removed. (TNM)
1714
     */
1715
    public function fe_init(&$params, $ref)
0 ignored issues
show
Unused Code introduced by
The parameter $ref is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
1716
    {
1717
        // Authenticate crawler request:
1718
        if (isset($_SERVER['HTTP_X_T3CRAWLER'])) {
1719
            list($queueId, $hash) = explode(':', $_SERVER['HTTP_X_T3CRAWLER']);
1720
            list($queueRec) = $this->db->exec_SELECTgetSingleRow('*', 'tx_crawler_queue', 'qid=' . intval($queueId));
1721
1722
            // If a crawler record was found and hash was matching, set it up:
1723
            if (is_array($queueRec) && $hash === md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey'])) {
1724
                $params['pObj']->applicationData['tx_crawler']['running'] = true;
1725
                $params['pObj']->applicationData['tx_crawler']['parameters'] = unserialize($queueRec['parameters']);
1726
                $params['pObj']->applicationData['tx_crawler']['log'] = [];
1727
            } else {
1728
                die('No crawler entry found!');
0 ignored issues
show
Coding Style Compatibility introduced by
The method fe_init() contains an exit expression.

An exit expression should only be used in rare cases. For example, if you write a short command line script.

In most cases however, using an exit expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage.

Loading history...
1729
            }
1730
        }
1731
    }
1732
1733
    /*****************************
1734
     *
1735
     * Compiling URLs to crawl - tools
1736
     *
1737
     *****************************/
1738
1739
    /**
1740
     * @param integer $id Root page id to start from.
1741
     * @param integer $depth Depth of tree, 0=only id-page, 1= on sublevel, 99 = infinite
1742
     * @param integer $scheduledTime Unix Time when the URL is timed to be visited when put in queue
1743
     * @param integer $reqMinute Number of requests per minute (creates the interleave between requests)
1744
     * @param boolean $submitCrawlUrls If set, submits the URLs to queue in database (real crawling)
1745
     * @param boolean $downloadCrawlUrls If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
1746
     * @param array $incomingProcInstructions Array of processing instructions
1747
     * @param array $configurationSelection Array of configuration keys
1748
     * @return string
1749
     */
1750
    public function getPageTreeAndUrls(
1751
        $id,
1752
        $depth,
1753
        $scheduledTime,
1754
        $reqMinute,
1755
        $submitCrawlUrls,
1756
        $downloadCrawlUrls,
1757
        array $incomingProcInstructions,
1758
        array $configurationSelection
1759
    ) {
1760
        global $BACK_PATH;
1761
        global $LANG;
1762
        if (!is_object($LANG)) {
1763
            $LANG = GeneralUtility::makeInstance('language');
1764
            $LANG->init(0);
1765
        }
1766
        $this->scheduledTime = $scheduledTime;
1767
        $this->reqMinute = $reqMinute;
1768
        $this->submitCrawlUrls = $submitCrawlUrls;
1769
        $this->downloadCrawlUrls = $downloadCrawlUrls;
1770
        $this->incomingProcInstructions = $incomingProcInstructions;
1771
        $this->incomingConfigurationSelection = $configurationSelection;
1772
1773
        $this->duplicateTrack = [];
1774
        $this->downloadUrls = [];
1775
1776
        // Drawing tree:
1777
        /* @var PageTreeView $tree */
1778
        $tree = GeneralUtility::makeInstance(PageTreeView::class);
1779
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
1780
        $tree->init('AND ' . $perms_clause);
1781
1782
        $pageInfo = BackendUtility::readPageAccess($id, $perms_clause);
1783
        if (is_array($pageInfo)) {
1784
            // Set root row:
1785
            $tree->tree[] = [
1786
                'row' => $pageInfo,
1787
                'HTML' => IconUtility::getIconForRecord('pages', $pageInfo)
1788
            ];
1789
        }
1790
1791
        // Get branch beneath:
1792
        if ($depth) {
1793
            $tree->getTree($id, $depth, '');
1794
        }
1795
1796
        // Traverse page tree:
1797
        $code = '';
1798
1799
        foreach ($tree->tree as $data) {
1800
            $this->MP = false;
1801
1802
            // recognize mount points
1803
            if ($data['row']['doktype'] == 7) {
1804
                $mountpage = $this->db->exec_SELECTgetRows('*', 'pages', 'uid = ' . $data['row']['uid']);
1805
1806
                // fetch mounted pages
1807
                $this->MP = $mountpage[0]['mount_pid'] . '-' . $data['row']['uid'];
0 ignored issues
show
Documentation Bug introduced by
The property $MP was declared of type boolean, but $mountpage[0]['mount_pid...' . $data['row']['uid'] is of type string. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
1808
1809
                $mountTree = GeneralUtility::makeInstance(PageTreeView::class);
1810
                $mountTree->init('AND ' . $perms_clause);
1811
                $mountTree->getTree($mountpage[0]['mount_pid'], $depth, '');
1812
1813
                foreach ($mountTree->tree as $mountData) {
1814
                    $code .= $this->drawURLs_addRowsForPage(
1815
                        $mountData['row'],
1816
                        $mountData['HTML'] . BackendUtility::getRecordTitle('pages', $mountData['row'], true)
1817
                    );
1818
                }
1819
1820
                // replace page when mount_pid_ol is enabled
1821
                if ($mountpage[0]['mount_pid_ol']) {
1822
                    $data['row']['uid'] = $mountpage[0]['mount_pid'];
1823
                } else {
1824
                    // if the mount_pid_ol is not set the MP must not be used for the mountpoint page
1825
                    $this->MP = false;
1826
                }
1827
            }
1828
1829
            $code .= $this->drawURLs_addRowsForPage(
1830
                $data['row'],
1831
                $data['HTML'] . BackendUtility::getRecordTitle('pages', $data['row'], true)
1832
            );
1833
        }
1834
1835
        return $code;
1836
    }
1837
1838
    /**
1839
     * Expands exclude string
1840
     *
1841
     * @param string $excludeString Exclude string
1842
     * @return array
1843
     */
1844 1
    public function expandExcludeString($excludeString)
1845
    {
1846
        // internal static caches;
1847 1
        static $expandedExcludeStringCache;
1848 1
        static $treeCache;
1849
1850 1
        if (empty($expandedExcludeStringCache[$excludeString])) {
1851 1
            $pidList = [];
1852
1853 1
            if (!empty($excludeString)) {
1854
                /** @var PageTreeView $tree */
1855
                $tree = GeneralUtility::makeInstance(PageTreeView::class);
1856
                $tree->init('AND ' . $this->backendUser->getPagePermsClause(1));
1857
1858
                $excludeParts = GeneralUtility::trimExplode(',', $excludeString);
1859
1860
                foreach ($excludeParts as $excludePart) {
1861
                    list($pid, $depth) = GeneralUtility::trimExplode('+', $excludePart);
1862
1863
                    // default is "page only" = "depth=0"
1864
                    if (empty($depth)) {
1865
                        $depth = (stristr($excludePart, '+')) ? 99 : 0;
1866
                    }
1867
1868
                    $pidList[] = $pid;
1869
1870
                    if ($depth > 0) {
1871
                        if (empty($treeCache[$pid][$depth])) {
1872
                            $tree->reset();
1873
                            $tree->getTree($pid, $depth);
1874
                            $treeCache[$pid][$depth] = $tree->tree;
1875
                        }
1876
1877
                        foreach ($treeCache[$pid][$depth] as $data) {
1878
                            $pidList[] = $data['row']['uid'];
1879
                        }
1880
                    }
1881
                }
1882
            }
1883
1884 1
            $expandedExcludeStringCache[$excludeString] = array_unique($pidList);
1885
        }
1886
1887 1
        return $expandedExcludeStringCache[$excludeString];
1888
    }
1889
1890
    /**
1891
     * Create the rows for display of the page tree
1892
     * For each page a number of rows are shown displaying GET variable configuration
1893
     *
1894
     * @param    array        Page row
1895
     * @param    string        Page icon and title for row
1896
     * @return    string        HTML <tr> content (one or more)
1897
     */
1898
    public function drawURLs_addRowsForPage(array $pageRow, $pageTitleAndIcon)
1899
    {
1900
        $skipMessage = '';
1901
1902
        // Get list of configurations
1903
        $configurations = $this->getUrlsForPageRow($pageRow, $skipMessage);
1904
1905
        if (count($this->incomingConfigurationSelection) > 0) {
1906
            // remove configuration that does not match the current selection
1907
            foreach ($configurations as $confKey => $confArray) {
1908
                if (!in_array($confKey, $this->incomingConfigurationSelection)) {
1909
                    unset($configurations[$confKey]);
1910
                }
1911
            }
1912
        }
1913
1914
        // Traverse parameter combinations:
1915
        $c = 0;
1916
        $content = '';
1917
        if (count($configurations)) {
1918
            foreach ($configurations as $confKey => $confArray) {
1919
1920
                    // Title column:
1921
                if (!$c) {
1922
                    $titleClm = '<td rowspan="' . count($configurations) . '">' . $pageTitleAndIcon . '</td>';
1923
                } else {
1924
                    $titleClm = '';
1925
                }
1926
1927
                if (!in_array($pageRow['uid'], $this->expandExcludeString($confArray['subCfg']['exclude']))) {
1928
1929
                        // URL list:
1930
                    $urlList = $this->urlListFromUrlArray(
1931
                        $confArray,
1932
                        $pageRow,
1933
                        $this->scheduledTime,
1934
                        $this->reqMinute,
1935
                        $this->submitCrawlUrls,
1936
                        $this->downloadCrawlUrls,
1937
                        $this->duplicateTrack,
1938
                        $this->downloadUrls,
1939
                        $this->incomingProcInstructions // if empty the urls won't be filtered by processing instructions
1940
                    );
1941
1942
                    // Expanded parameters:
1943
                    $paramExpanded = '';
1944
                    $calcAccu = [];
1945
                    $calcRes = 1;
1946
                    foreach ($confArray['paramExpanded'] as $gVar => $gVal) {
1947
                        $paramExpanded .= '
1948
                            <tr>
1949
                                <td class="bgColor4-20">' . htmlspecialchars('&' . $gVar . '=') . '<br/>' .
1950
                                                '(' . count($gVal) . ')' .
1951
                                                '</td>
1952
                                <td class="bgColor4" nowrap="nowrap">' . nl2br(htmlspecialchars(implode(chr(10), $gVal))) . '</td>
1953
                            </tr>
1954
                        ';
1955
                        $calcRes *= count($gVal);
1956
                        $calcAccu[] = count($gVal);
1957
                    }
1958
                    $paramExpanded = '<table class="lrPadding c-list param-expanded">' . $paramExpanded . '</table>';
1959
                    $paramExpanded .= 'Comb: ' . implode('*', $calcAccu) . '=' . $calcRes;
1960
1961
                    // Options
1962
                    $optionValues = '';
1963
                    if ($confArray['subCfg']['userGroups']) {
1964
                        $optionValues .= 'User Groups: ' . $confArray['subCfg']['userGroups'] . '<br/>';
1965
                    }
1966
                    if ($confArray['subCfg']['baseUrl']) {
1967
                        $optionValues .= 'Base Url: ' . $confArray['subCfg']['baseUrl'] . '<br/>';
1968
                    }
1969
                    if ($confArray['subCfg']['procInstrFilter']) {
1970
                        $optionValues .= 'ProcInstr: ' . $confArray['subCfg']['procInstrFilter'] . '<br/>';
1971
                    }
1972
1973
                    // Compile row:
1974
                    $content .= '
1975
                        <tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
1976
                            ' . $titleClm . '
1977
                            <td>' . htmlspecialchars($confKey) . '</td>
1978
                            <td>' . nl2br(htmlspecialchars(rawurldecode(trim(str_replace('&', chr(10) . '&', GeneralUtility::implodeArrayForUrl('', $confArray['paramParsed'])))))) . '</td>
1979
                            <td>' . $paramExpanded . '</td>
1980
                            <td nowrap="nowrap">' . $urlList . '</td>
1981
                            <td nowrap="nowrap">' . $optionValues . '</td>
1982
                            <td nowrap="nowrap">' . DebugUtility::viewArray($confArray['subCfg']['procInstrParams.']) . '</td>
1983
                        </tr>';
1984
                } else {
1985
                    $content .= '<tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
1986
                            ' . $titleClm . '
1987
                            <td>' . htmlspecialchars($confKey) . '</td>
1988
                            <td colspan="5"><em>No entries</em> (Page is excluded in this configuration)</td>
1989
                        </tr>';
1990
                }
1991
1992
                $c++;
1993
            }
1994
        } else {
1995
            $message = !empty($skipMessage) ? ' (' . $skipMessage . ')' : '';
1996
1997
            // Compile row:
1998
            $content .= '
1999
                <tr class="bgColor-20" style="border-bottom: 1px solid black;">
2000
                    <td>' . $pageTitleAndIcon . '</td>
2001
                    <td colspan="6"><em>No entries</em>' . $message . '</td>
2002
                </tr>';
2003
        }
2004
2005
        return $content;
2006
    }
2007
2008
    /*****************************
2009
     *
2010
     * CLI functions
2011
     *
2012
     *****************************/
2013
2014
    /**
2015
     * Main function for running from Command Line PHP script (cron job)
2016
     * See ext/crawler/cli/crawler_cli.phpsh for details
2017
     *
2018
     * @return int number of remaining items or false if error
2019
     */
2020
    public function CLI_main()
2021
    {
2022
        $this->setAccessMode('cli');
2023
        $result = self::CLI_STATUS_NOTHING_PROCCESSED;
2024
        $cliObj = GeneralUtility::makeInstance(CrawlerCommandLineController::class);
2025
2026
        if (isset($cliObj->cli_args['-h']) || isset($cliObj->cli_args['--help'])) {
2027
            $cliObj->cli_validateArgs();
2028
            $cliObj->cli_help();
2029
            exit;
0 ignored issues
show
Coding Style Compatibility introduced by
The method CLI_main() contains an exit expression.

An exit expression should only be used in rare cases. For example, if you write a short command line script.

In most cases however, using an exit expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage.

Loading history...
2030
        }
2031
2032
        if (!$this->getDisabled() && $this->CLI_checkAndAcquireNewProcess($this->CLI_buildProcessId())) {
2033
            $countInARun = $cliObj->cli_argValue('--countInARun') ? intval($cliObj->cli_argValue('--countInARun')) : $this->extensionSettings['countInARun'];
2034
            // Seconds
2035
            $sleepAfterFinish = $cliObj->cli_argValue('--sleepAfterFinish') ? intval($cliObj->cli_argValue('--sleepAfterFinish')) : $this->extensionSettings['sleepAfterFinish'];
2036
            // Milliseconds
2037
            $sleepTime = $cliObj->cli_argValue('--sleepTime') ? intval($cliObj->cli_argValue('--sleepTime')) : $this->extensionSettings['sleepTime'];
2038
2039
            try {
2040
                // Run process:
2041
                $result = $this->CLI_run($countInARun, $sleepTime, $sleepAfterFinish);
2042
            } catch (\Exception $e) {
2043
                $this->CLI_debug(get_class($e) . ': ' . $e->getMessage());
2044
                $result = self::CLI_STATUS_ABORTED;
2045
            }
2046
2047
            // Cleanup
2048
            $this->db->exec_DELETEquery('tx_crawler_process', 'assigned_items_count = 0');
2049
2050
            //TODO can't we do that in a clean way?
2051
            $releaseStatus = $this->CLI_releaseProcesses($this->CLI_buildProcessId());
0 ignored issues
show
Unused Code introduced by
$releaseStatus is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
2052
2053
            $this->CLI_debug("Unprocessed Items remaining:" . $this->queueRepository->countUnprocessedItems() . " (" . $this->CLI_buildProcessId() . ")");
2054
            $result |= ($this->queueRepository->countUnprocessedItems() > 0 ? self::CLI_STATUS_REMAIN : self::CLI_STATUS_NOTHING_PROCCESSED);
2055
        } else {
2056
            $result |= self::CLI_STATUS_ABORTED;
2057
        }
2058
2059
        return $result;
2060
    }
2061
2062
    /**
2063
     * Function executed by crawler_im.php cli script.
2064
     *
2065
     * @return void
2066
     */
2067
    public function CLI_main_im()
2068
    {
2069
        $this->setAccessMode('cli_im');
2070
2071
        $cliObj = GeneralUtility::makeInstance(QueueCommandLineController::class);
2072
2073
        // Force user to admin state and set workspace to "Live":
2074
        $this->backendUser->user['admin'] = 1;
2075
        $this->backendUser->setWorkspace(0);
2076
2077
        // Print help
2078
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2079
            $cliObj->cli_validateArgs();
2080
            $cliObj->cli_help();
2081
            exit;
0 ignored issues
show
Coding Style Compatibility introduced by
The method CLI_main_im() contains an exit expression.

An exit expression should only be used in rare cases. For example, if you write a short command line script.

In most cases however, using an exit expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage.

Loading history...
2082
        }
2083
2084
        $cliObj->cli_validateArgs();
2085
2086
        if ($cliObj->cli_argValue('-o') === 'exec') {
2087
            $this->registerQueueEntriesInternallyOnly = true;
2088
        }
2089
2090
        if (isset($cliObj->cli_args['_DEFAULT'][2])) {
2091
            // Crawler is called over TYPO3 BE
2092
            $pageId = MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][2], 0);
2093
        } else {
2094
            // Crawler is called over cli
2095
            $pageId = MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2096
        }
2097
2098
        $configurationKeys = $this->getConfigurationKeys($cliObj);
2099
2100
        if (!is_array($configurationKeys)) {
2101
            $configurations = $this->getUrlsForPageId($pageId);
2102
            if (is_array($configurations)) {
2103
                $configurationKeys = array_keys($configurations);
2104
            } else {
2105
                $configurationKeys = [];
2106
            }
2107
        }
2108
2109
        if ($cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec') {
2110
            $reason = new Reason();
2111
            $reason->setReason(Reason::REASON_GUI_SUBMIT);
2112
            $reason->setDetailText('The cli script of the crawler added to the queue');
2113
            EventDispatcher::getInstance()->post(
2114
                'invokeQueueChange',
2115
                $this->setID,
2116
                ['reason' => $reason]
2117
            );
2118
        }
2119
2120
        if ($this->extensionSettings['cleanUpOldQueueEntries']) {
2121
            $this->cleanUpOldQueueEntries();
2122
        }
2123
2124
        $this->setID = (int) GeneralUtility::md5int(microtime());
2125
        $this->getPageTreeAndUrls(
2126
            $pageId,
2127
            MathUtility::forceIntegerInRange($cliObj->cli_argValue('-d'), 0, 99),
2128
            $this->getCurrentTime(),
2129
            MathUtility::forceIntegerInRange($cliObj->cli_isArg('-n') ? $cliObj->cli_argValue('-n') : 30, 1, 1000),
2130
            $cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec',
2131
            $cliObj->cli_argValue('-o') === 'url',
2132
            GeneralUtility::trimExplode(',', $cliObj->cli_argValue('-proc'), true),
2133
            $configurationKeys
2134
        );
2135
2136
        if ($cliObj->cli_argValue('-o') === 'url') {
2137
            $cliObj->cli_echo(implode(chr(10), $this->downloadUrls) . chr(10), true);
2138
        } elseif ($cliObj->cli_argValue('-o') === 'exec') {
2139
            $cliObj->cli_echo("Executing " . count($this->urlList) . " requests right away:\n\n");
2140
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2141
            $cliObj->cli_echo("\nProcessing:\n");
2142
2143
            foreach ($this->queueEntries as $queueRec) {
2144
                $p = unserialize($queueRec['parameters']);
2145
                $cliObj->cli_echo($p['url'] . ' (' . implode(',', $p['procInstructions']) . ') => ');
2146
2147
                $result = $this->readUrlFromArray($queueRec);
2148
2149
                $requestResult = unserialize($result['content']);
2150
                if (is_array($requestResult)) {
2151
                    $resLog = is_array($requestResult['log']) ? chr(10) . chr(9) . chr(9) . implode(chr(10) . chr(9) . chr(9), $requestResult['log']) : '';
2152
                    $cliObj->cli_echo('OK: ' . $resLog . chr(10));
2153
                } else {
2154
                    $cliObj->cli_echo('Error checking Crawler Result: ' . substr(preg_replace('/\s+/', ' ', strip_tags($result['content'])), 0, 30000) . '...' . chr(10));
2155
                }
2156
            }
2157
        } elseif ($cliObj->cli_argValue('-o') === 'queue') {
2158
            $cliObj->cli_echo("Putting " . count($this->urlList) . " entries in queue:\n\n");
2159
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2160
        } else {
2161
            $cliObj->cli_echo(count($this->urlList) . " entries found for processing. (Use -o to decide action):\n\n", true);
2162
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10), true);
2163
        }
2164
    }
2165
2166
    /**
2167
     * Function executed by crawler_im.php cli script.
2168
     *
2169
     * @return bool
2170
     */
2171
    public function CLI_main_flush()
2172
    {
2173
        $this->setAccessMode('cli_flush');
2174
        $cliObj = GeneralUtility::makeInstance(FlushCommandLineController::class);
2175
2176
        // Force user to admin state and set workspace to "Live":
2177
        $this->backendUser->user['admin'] = 1;
2178
        $this->backendUser->setWorkspace(0);
2179
2180
        // Print help
2181
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2182
            $cliObj->cli_validateArgs();
2183
            $cliObj->cli_help();
2184
            exit;
0 ignored issues
show
Coding Style Compatibility introduced by
The method CLI_main_flush() contains an exit expression.

An exit expression should only be used in rare cases. For example, if you write a short command line script.

In most cases however, using an exit expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage.

Loading history...
2185
        }
2186
2187
        $cliObj->cli_validateArgs();
2188
        $pageId = MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2189
        $fullFlush = ($pageId == 0);
2190
2191
        $mode = $cliObj->cli_argValue('-o');
2192
2193
        switch ($mode) {
2194
            case 'all':
2195
                $result = $this->getLogEntriesForPageId($pageId, '', true, $fullFlush);
2196
                break;
2197
            case 'finished':
2198
            case 'pending':
2199
                $result = $this->getLogEntriesForPageId($pageId, $mode, true, $fullFlush);
2200
                break;
2201
            default:
2202
                $cliObj->cli_validateArgs();
2203
                $cliObj->cli_help();
2204
                $result = false;
2205
        }
2206
2207
        return $result !== false;
2208
    }
2209
2210
    /**
2211
     * Obtains configuration keys from the CLI arguments
2212
     *
2213
     * @param  QueueCommandLineController $cliObj    Command line object
2214
     * @return mixed                        Array of keys or null if no keys found
2215
     */
2216
    protected function getConfigurationKeys(QueueCommandLineController &$cliObj)
2217
    {
2218
        $parameter = trim($cliObj->cli_argValue('-conf'));
2219
        return ($parameter != '' ? GeneralUtility::trimExplode(',', $parameter) : []);
2220
    }
2221
2222
    /**
2223
     * Running the functionality of the CLI (crawling URLs from queue)
2224
     *
2225
     * @param int $countInARun
2226
     * @param int $sleepTime
2227
     * @param int $sleepAfterFinish
2228
     * @return string
2229
     */
2230
    public function CLI_run($countInARun, $sleepTime, $sleepAfterFinish)
2231
    {
2232
        $result = 0;
2233
        $counter = 0;
2234
2235
        // First, run hooks:
2236
        $this->CLI_runHooks();
2237
2238
        // Clean up the queue
2239
        if (intval($this->extensionSettings['purgeQueueDays']) > 0) {
2240
            $purgeDate = $this->getCurrentTime() - 24 * 60 * 60 * intval($this->extensionSettings['purgeQueueDays']);
2241
            $del = $this->db->exec_DELETEquery(
2242
                'tx_crawler_queue',
2243
                'exec_time!=0 AND exec_time<' . $purgeDate
2244
            );
2245
            if (false == $del) {
2246
                GeneralUtility::devLog('Records could not be deleted.', 'crawler', LogLevel::INFO);
2247
            }
2248
        }
2249
2250
        // Select entries:
2251
        //TODO Shouldn't this reside within the transaction?
2252
        $rows = $this->db->exec_SELECTgetRows(
2253
            'qid,scheduled',
2254
            'tx_crawler_queue',
2255
            'exec_time=0
2256
                AND process_scheduled= 0
2257
                AND scheduled<=' . $this->getCurrentTime(),
2258
            '',
2259
            'scheduled, qid',
2260
        intval($countInARun)
2261
        );
2262
2263
        if (count($rows) > 0) {
2264
            $quidList = [];
2265
2266
            foreach ($rows as $r) {
0 ignored issues
show
Bug introduced by
The expression $rows of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
2267
                $quidList[] = $r['qid'];
2268
            }
2269
2270
            $processId = $this->CLI_buildProcessId();
2271
2272
            //reserve queue entries for process
2273
            $this->db->sql_query('BEGIN');
2274
            //TODO make sure we're not taking assigned queue-entires
2275
            $this->db->exec_UPDATEquery(
2276
                'tx_crawler_queue',
2277
                'qid IN (' . implode(',', $quidList) . ')',
2278
                [
2279
                    'process_scheduled' => intval($this->getCurrentTime()),
2280
                    'process_id' => $processId
2281
                ]
2282
            );
2283
2284
            //save the number of assigned queue entrys to determine who many have been processed later
2285
            $numberOfAffectedRows = $this->db->sql_affected_rows();
2286
            $this->db->exec_UPDATEquery(
2287
                'tx_crawler_process',
2288
                "process_id = '" . $processId . "'",
2289
                [
2290
                    'assigned_items_count' => intval($numberOfAffectedRows)
2291
                ]
2292
            );
2293
2294
            if ($numberOfAffectedRows == count($quidList)) {
2295
                $this->db->sql_query('COMMIT');
2296
            } else {
2297
                $this->db->sql_query('ROLLBACK');
2298
                $this->CLI_debug("Nothing processed due to multi-process collision (" . $this->CLI_buildProcessId() . ")");
2299
                return ($result | self::CLI_STATUS_ABORTED);
2300
            }
2301
2302
            foreach ($rows as $r) {
0 ignored issues
show
Bug introduced by
The expression $rows of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
2303
                $result |= $this->readUrl($r['qid']);
2304
2305
                $counter++;
2306
                usleep(intval($sleepTime)); // Just to relax the system
2307
2308
                // if during the start and the current read url the cli has been disable we need to return from the function
2309
                // mark the process NOT as ended.
2310
                if ($this->getDisabled()) {
2311
                    return ($result | self::CLI_STATUS_ABORTED);
2312
                }
2313
2314
                if (!$this->CLI_checkIfProcessIsActive($this->CLI_buildProcessId())) {
2315
                    $this->CLI_debug("conflict / timeout (" . $this->CLI_buildProcessId() . ")");
2316
2317
                    //TODO might need an additional returncode
2318
                    $result |= self::CLI_STATUS_ABORTED;
2319
                    break; //possible timeout
2320
                }
2321
            }
2322
2323
            sleep(intval($sleepAfterFinish));
2324
2325
            $msg = 'Rows: ' . $counter;
2326
            $this->CLI_debug($msg . " (" . $this->CLI_buildProcessId() . ")");
2327
        } else {
2328
            $this->CLI_debug("Nothing within queue which needs to be processed (" . $this->CLI_buildProcessId() . ")");
2329
        }
2330
2331
        if ($counter > 0) {
2332
            $result |= self::CLI_STATUS_PROCESSED;
2333
        }
2334
2335
        return $result;
2336
    }
2337
2338
    /**
2339
     * Activate hooks
2340
     *
2341
     * @return void
2342
     */
2343
    public function CLI_runHooks()
2344
    {
2345
        global $TYPO3_CONF_VARS;
2346
        if (is_array($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'])) {
2347
            foreach ($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'] as $objRef) {
2348
                $hookObj = &GeneralUtility::getUserObj($objRef);
2349
                if (is_object($hookObj)) {
2350
                    $hookObj->crawler_init($this);
2351
                }
2352
            }
2353
        }
2354
    }
2355
2356
    /**
2357
     * Try to acquire a new process with the given id
2358
     * also performs some auto-cleanup for orphan processes
2359
     * @todo preemption might not be the most elegant way to clean up
2360
     *
2361
     * @param string $id identification string for the process
2362
     * @return boolean
2363
     */
2364
    public function CLI_checkAndAcquireNewProcess($id)
2365
    {
2366
        $ret = true;
2367
2368
        $systemProcessId = getmypid();
2369
        if ($systemProcessId < 1) {
2370
            return false;
2371
        }
2372
2373
        $processCount = 0;
2374
        $orphanProcesses = [];
2375
2376
        $this->db->sql_query('BEGIN');
2377
2378
        $res = $this->db->exec_SELECTquery(
2379
            'process_id,ttl',
2380
            'tx_crawler_process',
2381
            'active=1 AND deleted=0'
2382
            );
2383
2384
        $currentTime = $this->getCurrentTime();
2385
2386
        while ($row = $this->db->sql_fetch_assoc($res)) {
2387
            if ($row['ttl'] < $currentTime) {
2388
                $orphanProcesses[] = $row['process_id'];
2389
            } else {
2390
                $processCount++;
2391
            }
2392
        }
2393
2394
        // if there are less than allowed active processes then add a new one
2395
        if ($processCount < intval($this->extensionSettings['processLimit'])) {
2396
            $this->CLI_debug("add process " . $this->CLI_buildProcessId() . " (" . ($processCount + 1) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2397
2398
            // create new process record
2399
            $this->db->exec_INSERTquery(
2400
                'tx_crawler_process',
2401
                [
2402
                    'process_id' => $id,
2403
                    'active' => '1',
2404
                    'ttl' => ($currentTime + intval($this->extensionSettings['processMaxRunTime'])),
2405
                    'system_process_id' => $systemProcessId
2406
                ]
2407
                );
2408
        } else {
2409
            $this->CLI_debug("Processlimit reached (" . ($processCount) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2410
            $ret = false;
2411
        }
2412
2413
        $this->CLI_releaseProcesses($orphanProcesses, true); // maybe this should be somehow included into the current lock
2414
        $this->CLI_deleteProcessesMarkedDeleted();
2415
2416
        $this->db->sql_query('COMMIT');
2417
2418
        return $ret;
2419
    }
2420
2421
    /**
2422
     * Release a process and the required resources
2423
     *
2424
     * @param  mixed    $releaseIds   string with a single process-id or array with multiple process-ids
2425
     * @param  boolean  $withinLock   show whether the DB-actions are included within an existing lock
2426
     * @return boolean
2427
     */
2428
    public function CLI_releaseProcesses($releaseIds, $withinLock = false)
2429
    {
2430
        if (!is_array($releaseIds)) {
2431
            $releaseIds = [$releaseIds];
2432
        }
2433
2434
        if (!count($releaseIds) > 0) {
2435
            return false;   //nothing to release
2436
        }
2437
2438
        if (!$withinLock) {
2439
            $this->db->sql_query('BEGIN');
2440
        }
2441
2442
        // some kind of 2nd chance algo - this way you need at least 2 processes to have a real cleanup
2443
        // this ensures that a single process can't mess up the entire process table
2444
2445
        // mark all processes as deleted which have no "waiting" queue-entires and which are not active
2446
        $this->db->exec_UPDATEquery(
2447
            'tx_crawler_queue',
2448
            'process_id IN (SELECT process_id FROM tx_crawler_process WHERE active=0 AND deleted=0)',
2449
            [
2450
                'process_scheduled' => 0,
2451
                'process_id' => ''
2452
            ]
2453
        );
2454
        $this->db->exec_UPDATEquery(
2455
            'tx_crawler_process',
2456
            'active=0 AND deleted=0
2457
            AND NOT EXISTS (
2458
                SELECT * FROM tx_crawler_queue
2459
                WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
2460
                AND tx_crawler_queue.exec_time = 0
2461
            )',
2462
            [
2463
                'deleted' => '1',
2464
                'system_process_id' => 0
2465
            ]
2466
        );
2467
        // mark all requested processes as non-active
2468
        $this->db->exec_UPDATEquery(
2469
            'tx_crawler_process',
2470
            'process_id IN (\'' . implode('\',\'', $releaseIds) . '\') AND deleted=0',
2471
            [
2472
                'active' => '0'
2473
            ]
2474
        );
2475
        $this->db->exec_UPDATEquery(
2476
            'tx_crawler_queue',
2477
            'exec_time=0 AND process_id IN ("' . implode('","', $releaseIds) . '")',
2478
            [
2479
                'process_scheduled' => 0,
2480
                'process_id' => ''
2481
            ]
2482
        );
2483
2484
        if (!$withinLock) {
2485
            $this->db->sql_query('COMMIT');
2486
        }
2487
2488
        return true;
2489
    }
2490
2491
    /**
2492
     * Delete processes marked as deleted
2493
     *
2494
     * @return void
2495
     */
2496 1
    public function CLI_deleteProcessesMarkedDeleted()
2497
    {
2498 1
        $this->db->exec_DELETEquery('tx_crawler_process', 'deleted = 1');
2499 1
    }
2500
2501
    /**
2502
     * Check if there are still resources left for the process with the given id
2503
     * Used to determine timeouts and to ensure a proper cleanup if there's a timeout
2504
     *
2505
     * @param  string  identification string for the process
2506
     * @return boolean determines if the process is still active / has resources
2507
     *
2508
     * FIXME: Please remove Transaction, not needed as only a select query.
2509
     */
2510
    public function CLI_checkIfProcessIsActive($pid)
2511
    {
2512
        $ret = false;
2513
        $this->db->sql_query('BEGIN');
2514
        $res = $this->db->exec_SELECTquery(
2515
            'process_id,active,ttl',
2516
            'tx_crawler_process',
2517
            'process_id = \'' . $pid . '\'  AND deleted=0',
2518
            '',
2519
            'ttl',
2520
            '0,1'
2521
        );
2522
        if ($row = $this->db->sql_fetch_assoc($res)) {
2523
            $ret = intVal($row['active']) == 1;
2524
        }
2525
        $this->db->sql_query('COMMIT');
2526
2527
        return $ret;
2528
    }
2529
2530
    /**
2531
     * Create a unique Id for the current process
2532
     *
2533
     * @return string  the ID
2534
     */
2535 2
    public function CLI_buildProcessId()
2536
    {
2537 2
        if (!$this->processID) {
2538 1
            $this->processID = GeneralUtility::shortMD5($this->microtime(true));
2539
        }
2540 2
        return $this->processID;
2541
    }
2542
2543
    /**
2544
     * @param bool $get_as_float
2545
     *
2546
     * @return mixed
2547
     */
2548
    protected function microtime($get_as_float = false)
2549
    {
2550
        return microtime($get_as_float);
2551
    }
2552
2553
    /**
2554
     * Prints a message to the stdout (only if debug-mode is enabled)
2555
     *
2556
     * @param  string $msg  the message
2557
     */
2558
    public function CLI_debug($msg)
2559
    {
2560
        if (intval($this->extensionSettings['processDebug'])) {
2561
            echo $msg . "\n";
2562
            flush();
2563
        }
2564
    }
2565
2566
    /**
2567
     * Get URL content by making direct request to TYPO3.
2568
     *
2569
     * @param  string $url          Page URL
2570
     * @param  int    $crawlerId    Crawler-ID
2571
     * @return array
2572
     */
2573 2
    protected function sendDirectRequest($url, $crawlerId)
2574
    {
2575 2
        $parsedUrl = parse_url($url);
2576 2
        if (!is_array($parsedUrl)) {
2577
            return [];
2578
        }
2579
2580 2
        $requestHeaders = $this->buildRequestHeaderArray($parsedUrl, $crawlerId);
2581
2582 2
        $cmd = escapeshellcmd($this->extensionSettings['phpPath']);
2583 2
        $cmd .= ' ';
2584 2
        $cmd .= escapeshellarg(ExtensionManagementUtility::extPath('crawler') . 'cli/bootstrap.php');
2585 2
        $cmd .= ' ';
2586 2
        $cmd .= escapeshellarg($this->getFrontendBasePath());
2587 2
        $cmd .= ' ';
2588 2
        $cmd .= escapeshellarg($url);
2589 2
        $cmd .= ' ';
2590 2
        $cmd .= escapeshellarg(base64_encode(serialize($requestHeaders)));
2591
2592 2
        $startTime = microtime(true);
2593 2
        $content = $this->executeShellCommand($cmd);
2594 2
        $this->log($url . ' ' . (microtime(true) - $startTime));
2595
2596
        $result = [
2597 2
            'request' => implode("\r\n", $requestHeaders) . "\r\n\r\n",
2598 2
            'headers' => '',
2599 2
            'content' => $content
2600
        ];
2601
2602 2
        return $result;
2603
    }
2604
2605
    /**
2606
     * Cleans up entries that stayed for too long in the queue. These are:
2607
     * - processed entries that are over 1.5 days in age
2608
     * - scheduled entries that are over 7 days old
2609
     *
2610
     * @return void
2611
     */
2612
    protected function cleanUpOldQueueEntries()
2613
    {
2614
        $processedAgeInSeconds = $this->extensionSettings['cleanUpProcessedAge'] * 86400; // 24*60*60 Seconds in 24 hours
2615
        $scheduledAgeInSeconds = $this->extensionSettings['cleanUpScheduledAge'] * 86400;
2616
2617
        $now = time();
2618
        $condition = '(exec_time<>0 AND exec_time<' . ($now - $processedAgeInSeconds) . ') OR scheduled<=' . ($now - $scheduledAgeInSeconds);
2619
        $this->flushQueue($condition);
2620
    }
2621
2622
    /**
2623
     * Initializes a TypoScript Frontend necessary for using TypoScript and TypoLink functions
2624
     *
2625
     * @param int $id
2626
     * @param int $typeNum
2627
     *
2628
     * @return void
2629
     */
2630
    protected function initTSFE($id = 1, $typeNum = 0)
2631
    {
2632
        EidUtility::initTCA();
2633
        if (!is_object($GLOBALS['TT'])) {
2634
            $GLOBALS['TT'] = new NullTimeTracker();
2635
            $GLOBALS['TT']->start();
2636
        }
2637
2638
        $GLOBALS['TSFE'] = GeneralUtility::makeInstance(TypoScriptFrontendController::class, $GLOBALS['TYPO3_CONF_VARS'], $id, $typeNum);
2639
        $GLOBALS['TSFE']->sys_page = GeneralUtility::makeInstance(PageRepository::class);
2640
        $GLOBALS['TSFE']->sys_page->init(true);
2641
        $GLOBALS['TSFE']->connectToDB();
2642
        $GLOBALS['TSFE']->initFEuser();
2643
        $GLOBALS['TSFE']->determineId();
2644
        $GLOBALS['TSFE']->initTemplate();
2645
        $GLOBALS['TSFE']->rootLine = $GLOBALS['TSFE']->sys_page->getRootLine($id, '');
2646
        $GLOBALS['TSFE']->getConfigArray();
2647
        PageGenerator::pagegenInit();
2648
    }
2649
2650
    /**
2651
     * Returns a md5 hash generated from a serialized configuration array.
2652
     *
2653
     * @param array $configuration
2654
     *
2655
     * @return string
2656
     */
2657 9
    protected function getConfigurationHash(array $configuration) {
2658 9
        unset($configuration['paramExpanded']);
2659 9
        unset($configuration['URLs']);
2660 9
        return md5(serialize($configuration));
2661
    }
2662
2663
    /**
2664
     * Check whether the Crawling Protocol should be http or https
2665
     *
2666
     * @param $crawlerConfiguration
2667
     * @param $pageConfiguration
2668
     *
2669
     * @return bool
2670
     */
2671 1
    protected function isCrawlingProtocolHttps($crawlerConfiguration, $pageConfiguration) {
2672
        switch($crawlerConfiguration) {
2673 1
            case -1:
2674
                return false;
2675 1
            case 0:
2676 1
                return $pageConfiguration;
2677
            case 1:
2678
                return true;
2679
            default:
2680
                return false;
2681
        }
2682
    }
2683
}
2684