Completed
Push — bugfix/domain-model-configurat... ( b1f2da )
by Tomas Norre
04:36
created

CrawlerController   F

Complexity

Total Complexity 358

Size/Duplication

Total Lines 2713
Duplicated Lines 0 %

Coupling/Cohesion

Components 1
Dependencies 21

Test Coverage

Coverage 43.62%

Importance

Changes 0
Metric Value
dl 0
loc 2713
ccs 499
cts 1144
cp 0.4362
rs 0.8
c 0
b 0
f 0
wmc 358
lcom 1
cbo 21

60 Methods

Rating   Name   Duplication   Size   Complexity  
A setExtensionSettings() 0 4 1
F checkIfPageShouldBeSkipped() 0 57 16
A getUrlsForPageRow() 0 15 3
A addQueueEntry_callBack() 0 19 3
A getAccessMode() 0 4 1
A setAccessMode() 0 4 1
A setDisabled() 0 10 3
A getDisabled() 0 8 2
A setProcessFilename() 0 4 1
A getProcessFilename() 0 4 1
A __construct() 0 24 3
A noUnprocessedQueueEntriesForPageWithConfigurationHashExist() 0 8 1
F urlListFromUrlArray() 0 113 20
A drawURLs_PIfilter() 0 12 4
A getPageTSconfigForId() 0 22 4
F getUrlsForPageId() 0 130 25
A getBaseUrlForConfigurationRecord() 0 20 4
B getConfigurationsForBranch() 0 45 11
A hasGroupAccess() 0 12 4
A parseParams() 0 15 3
F expandParameters() 0 110 24
B compileUrls() 0 25 7
B getLogEntriesForPageId() 0 29 6
B getLogEntriesForSetId() 0 29 6
B flushQueue() 0 33 8
B addUrl() 0 86 6
B getDuplicateRowsIfExist() 0 40 7
A getCurrentTime() 0 4 1
C readUrl() 0 82 13
A readUrlFromArray() 0 24 1
A readUrl_exec() 0 38 4
D requestUrl() 0 93 19
B getFrontendBasePath() 0 23 8
A executeShellCommand() 0 5 1
A getHttpResponseFromStream() 0 23 5
A log() 0 9 3
A buildRequestHeaderArray() 0 16 4
B getRequestUrlFrom302Header() 0 34 11
A fe_init() 0 17 4
B getPageTreeAndUrls() 0 87 8
B expandExcludeString() 0 45 9
D drawURLs_addRowsForPage() 0 109 15
B CLI_main() 0 41 10
F CLI_main_im() 0 108 17
A CLI_main_flush() 0 38 5
A getConfigurationKeys() 0 5 2
C CLI_run() 0 108 10
A CLI_runHooks() 0 12 4
B CLI_checkAndAcquireNewProcess() 0 56 5
B CLI_releaseProcesses() 0 62 5
A CLI_deleteProcessesMarkedDeleted() 0 4 1
A CLI_checkIfProcessIsActive() 0 19 2
A CLI_buildProcessId() 0 7 2
A microtime() 0 4 1
A CLI_debug() 0 7 2
A sendDirectRequest() 0 31 2
A cleanUpOldQueueEntries() 0 9 1
A initTSFE() 0 25 3
A getConfigurationHash() 0 6 1
A isCrawlingProtocolHttps() 0 13 4

How to fix   Complexity   

Complex Class

Complex classes like CrawlerController often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use CrawlerController, and based on these observations, apply Extract Interface, too.

1
<?php
2
namespace AOE\Crawler\Controller;
3
4
/***************************************************************
5
 *  Copyright notice
6
 *
7
 *  (c) 2017 AOE GmbH <[email protected]>
8
 *
9
 *  All rights reserved
10
 *
11
 *  This script is part of the TYPO3 project. The TYPO3 project is
12
 *  free software; you can redistribute it and/or modify
13
 *  it under the terms of the GNU General Public License as published by
14
 *  the Free Software Foundation; either version 3 of the License, or
15
 *  (at your option) any later version.
16
 *
17
 *  The GNU General Public License can be found at
18
 *  http://www.gnu.org/copyleft/gpl.html.
19
 *
20
 *  This script is distributed in the hope that it will be useful,
21
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
22
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
23
 *  GNU General Public License for more details.
24
 *
25
 *  This copyright notice MUST APPEAR in all copies of the script!
26
 ***************************************************************/
27
28
use AOE\Crawler\Command\CrawlerCommandLineController;
29
use AOE\Crawler\Command\FlushCommandLineController;
30
use AOE\Crawler\Command\QueueCommandLineController;
31
use AOE\Crawler\Domain\Model\Configuration;
32
use AOE\Crawler\Domain\Model\Reason;
33
use AOE\Crawler\Domain\Repository\ConfigurationRepository;
34
use AOE\Crawler\Domain\Repository\ProcessRepository;
35
use AOE\Crawler\Domain\Repository\QueueRepository;
36
use AOE\Crawler\Event\EventDispatcher;
37
use AOE\Crawler\Utility\IconUtility;
38
use AOE\Crawler\Utility\SignalSlotUtility;
39
use TYPO3\CMS\Backend\Tree\View\PageTreeView;
40
use TYPO3\CMS\Backend\Utility\BackendUtility;
41
use TYPO3\CMS\Core\Authentication\BackendUserAuthentication;
42
use TYPO3\CMS\Core\Database\DatabaseConnection;
43
use TYPO3\CMS\Core\Log\LogLevel;
44
use TYPO3\CMS\Core\TimeTracker\NullTimeTracker;
45
use TYPO3\CMS\Core\TimeTracker\TimeTracker;
46
use TYPO3\CMS\Core\Utility\DebugUtility;
47
use TYPO3\CMS\Core\Utility\ExtensionManagementUtility;
48
use TYPO3\CMS\Core\Utility\GeneralUtility;
49
use TYPO3\CMS\Core\Utility\MathUtility;
50
use TYPO3\CMS\Core\Utility\VersionNumberUtility;
51
use TYPO3\CMS\Extbase\Object\ObjectManager;
52
use TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController;
53
use TYPO3\CMS\Frontend\Page\PageGenerator;
54
use TYPO3\CMS\Frontend\Page\PageRepository;
55
use TYPO3\CMS\Frontend\Utility\EidUtility;
56
use TYPO3\CMS\Lang\LanguageService;
57
58
/**
59
 * Class CrawlerController
60
 *
61
 * @package AOE\Crawler\Controller
62
 */
63
class CrawlerController
64
{
65
    const CLI_STATUS_NOTHING_PROCCESSED = 0;
66
    const CLI_STATUS_REMAIN = 1; //queue not empty
67
    const CLI_STATUS_PROCESSED = 2; //(some) queue items where processed
68
    const CLI_STATUS_ABORTED = 4; //instance didn't finish
69
    const CLI_STATUS_POLLABLE_PROCESSED = 8;
70
71
    /**
72
     * @var integer
73
     */
74
    public $setID = 0;
75
76
    /**
77
     * @var string
78
     */
79
    public $processID = '';
80
81
    /**
82
     * One hour is max stalled time for the CLI
83
     * If the process had the status "start" for 3600 seconds, it will be regarded stalled and a new process is started
84
     *
85
     * @var integer
86
     */
87
    public $max_CLI_exec_time = 3600;
88
89
    /**
90
     * @var array
91
     */
92
    public $duplicateTrack = [];
93
94
    /**
95
     * @var array
96
     */
97
    public $downloadUrls = [];
98
99
    /**
100
     * @var array
101
     */
102
    public $incomingProcInstructions = [];
103
104
    /**
105
     * @var array
106
     */
107
    public $incomingConfigurationSelection = [];
108
109
    /**
110
     * @var bool
111
     */
112
    public $registerQueueEntriesInternallyOnly = false;
113
114
    /**
115
     * @var array
116
     */
117
    public $queueEntries = [];
118
119
    /**
120
     * @var array
121
     */
122
    public $urlList = [];
123
124
    /**
125
     * @var boolean
126
     */
127
    public $debugMode = false;
128
129
    /**
130
     * @var array
131
     */
132
    public $extensionSettings = [];
133
134
    /**
135
     * Mount Point
136
     *
137
     * @var boolean
138
     */
139
    public $MP = false;
140
141
    /**
142
     * @var string
143
     */
144
    protected $processFilename;
145
146
    /**
147
     * Holds the internal access mode can be 'gui','cli' or 'cli_im'
148
     *
149
     * @var string
150
     */
151
    protected $accessMode;
152
153
    /**
154
     * @var DatabaseConnection
155
     */
156
    private $db;
157
158
    /**
159
     * @var BackendUserAuthentication
160
     */
161
    private $backendUser;
162
163
    /**
164
     * @var integer
165
     */
166
    private $scheduledTime = 0;
167
168
    /**
169
     * @var integer
170
     */
171
    private $reqMinute = 0;
172
173
    /**
174
     * @var bool
175
     */
176
    private $submitCrawlUrls = false;
177
178
    /**
179
     * @var bool
180
     */
181
    private $downloadCrawlUrls = false;
182
183
    /**
184
     * @var QueueRepository
185
     */
186
    protected $queueRepository;
187
188
    /**
189
     * @var ProcessRepository
190
     */
191
    protected $processRepository;
192
193
    /**
194
     * @var ConfigurationRepository
195
     */
196
    protected $configurationRepository;
197
198
    /**
199
     * Method to set the accessMode can be gui, cli or cli_im
200
     *
201
     * @return string
202
     */
203 1
    public function getAccessMode()
204
    {
205 1
        return $this->accessMode;
206
    }
207
208
    /**
209
     * @param string $accessMode
210
     */
211 1
    public function setAccessMode($accessMode)
212
    {
213 1
        $this->accessMode = $accessMode;
214 1
    }
215
216
    /**
217
     * Set disabled status to prevent processes from being processed
218
     *
219
     * @param  bool $disabled (optional, defaults to true)
220
     * @return void
221
     */
222 3
    public function setDisabled($disabled = true)
223
    {
224 3
        if ($disabled) {
225 2
            GeneralUtility::writeFile($this->processFilename, '');
226
        } else {
227 1
            if (is_file($this->processFilename)) {
228 1
                unlink($this->processFilename);
229
            }
230
        }
231 3
    }
232
233
    /**
234
     * Get disable status
235
     *
236
     * @return bool true if disabled
237
     */
238 3
    public function getDisabled()
239
    {
240 3
        if (is_file($this->processFilename)) {
241 2
            return true;
242
        } else {
243 1
            return false;
244
        }
245
    }
246
247
    /**
248
     * @param string $filenameWithPath
249
     *
250
     * @return void
251
     */
252 4
    public function setProcessFilename($filenameWithPath)
253
    {
254 4
        $this->processFilename = $filenameWithPath;
255 4
    }
256
257
    /**
258
     * @return string
259
     */
260 1
    public function getProcessFilename()
261
    {
262 1
        return $this->processFilename;
263
    }
264
265
    /************************************
266
     *
267
     * Getting URLs based on Page TSconfig
268
     *
269
     ************************************/
270
271 47
    public function __construct()
272
    {
273 47
        $objectManager = GeneralUtility::makeInstance(ObjectManager::class);
274 47
        $this->queueRepository = $objectManager->get(QueueRepository::class);
275 47
        $this->configurationRepository = $objectManager->get(ConfigurationRepository::class);
276 47
        $this->processRepository = $objectManager->get(ProcessRepository::class);
277
278 47
        $this->db = $GLOBALS['TYPO3_DB'];
279 47
        $this->backendUser = $GLOBALS['BE_USER'];
280 47
        $this->processFilename = PATH_site . 'typo3temp/tx_crawler.proc';
281
282 47
        $settings = unserialize($GLOBALS['TYPO3_CONF_VARS']['EXT']['extConf']['crawler']);
283 47
        $settings = is_array($settings) ? $settings : [];
284
285
        // read ext_em_conf_template settings and set
286 47
        $this->setExtensionSettings($settings);
287
288
        // set defaults:
289 47
        if (MathUtility::convertToPositiveInteger($this->extensionSettings['countInARun']) == 0) {
290 40
            $this->extensionSettings['countInARun'] = 100;
291
        }
292
293 47
        $this->extensionSettings['processLimit'] = MathUtility::forceIntegerInRange($this->extensionSettings['processLimit'], 1, 99, 1);
294 47
    }
295
296
    /**
297
     * Sets the extensions settings (unserialized pendant of $TYPO3_CONF_VARS['EXT']['extConf']['crawler']).
298
     *
299
     * @param array $extensionSettings
300
     * @return void
301
     */
302 56
    public function setExtensionSettings(array $extensionSettings)
303
    {
304 56
        $this->extensionSettings = $extensionSettings;
305 56
    }
306
307
    /**
308
     * Check if the given page should be crawled
309
     *
310
     * @param array $pageRow
311
     * @return false|string false if the page should be crawled (not excluded), true / skipMessage if it should be skipped
312
     */
313 10
    public function checkIfPageShouldBeSkipped(array $pageRow)
314
    {
315 10
        $skipPage = false;
316 10
        $skipMessage = 'Skipped'; // message will be overwritten later
317
318
        // if page is hidden
319 10
        if (!$this->extensionSettings['crawlHiddenPages']) {
320 10
            if ($pageRow['hidden']) {
321 1
                $skipPage = true;
322 1
                $skipMessage = 'Because page is hidden';
323
            }
324
        }
325
326 10
        if (!$skipPage) {
327 9
            if (GeneralUtility::inList('3,4', $pageRow['doktype']) || $pageRow['doktype'] >= 199) {
328 3
                $skipPage = true;
329 3
                $skipMessage = 'Because doktype is not allowed';
330
            }
331
        }
332
333 10
        if (!$skipPage) {
334 6
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'])) {
335 2
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'] as $key => $doktypeList) {
336 1
                    if (GeneralUtility::inList($doktypeList, $pageRow['doktype'])) {
337 1
                        $skipPage = true;
338 1
                        $skipMessage = 'Doktype was excluded by "' . $key . '"';
339 1
                        break;
340
                    }
341
                }
342
            }
343
        }
344
345 10
        if (!$skipPage) {
346
            // veto hook
347 5
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'])) {
348
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'] as $key => $func) {
349
                    $params = [
350
                        'pageRow' => $pageRow,
351
                    ];
352
                    // expects "false" if page is ok and "true" or a skipMessage if this page should _not_ be crawled
353
                    $veto = GeneralUtility::callUserFunction($func, $params, $this);
354
                    if ($veto !== false) {
355
                        $skipPage = true;
356
                        if (is_string($veto)) {
357
                            $skipMessage = $veto;
358
                        } else {
359
                            $skipMessage = 'Veto from hook "' . htmlspecialchars($key) . '"';
360
                        }
361
                        // no need to execute other hooks if a previous one return a veto
362
                        break;
363
                    }
364
                }
365
            }
366
        }
367
368 10
        return $skipPage ? $skipMessage : false;
369
    }
370
371
    /**
372
     * Wrapper method for getUrlsForPageId()
373
     * It returns an array of configurations and no urls!
374
     *
375
     * @param array $pageRow Page record with at least dok-type and uid columns.
376
     * @param string $skipMessage
377
     * @return array
378
     * @see getUrlsForPageId()
379
     */
380 6
    public function getUrlsForPageRow(array $pageRow, &$skipMessage = '')
381
    {
382 6
        $message = $this->checkIfPageShouldBeSkipped($pageRow);
383
384 6
        if ($message === false) {
385 5
            $forceSsl = ($pageRow['url_scheme'] === 2) ? true : false;
386 5
            $res = $this->getUrlsForPageId($pageRow['uid'], $forceSsl);
387 5
            $skipMessage = '';
388
        } else {
389 1
            $skipMessage = $message;
390 1
            $res = [];
391
        }
392
393 6
        return $res;
394
    }
395
396
    /**
397
     * This method is used to count if there are ANY unprocessed queue entries
398
     * of a given page_id and the configuration which matches a given hash.
399
     * If there if none, we can skip an inner detail check
400
     *
401
     * @param  int $uid
402
     * @param  string $configurationHash
403
     * @return boolean
404
     */
405 7
    protected function noUnprocessedQueueEntriesForPageWithConfigurationHashExist($uid, $configurationHash)
406
    {
407 7
        $configurationHash = $this->db->fullQuoteStr($configurationHash, 'tx_crawler_queue');
408 7
        $res = $this->db->exec_SELECTquery('count(*) as anz', 'tx_crawler_queue', "page_id=" . intval($uid) . " AND configuration_hash=" . $configurationHash . " AND exec_time=0");
409 7
        $row = $this->db->sql_fetch_assoc($res);
410
411 7
        return ($row['anz'] == 0);
412
    }
413
414
    /**
415
     * Creates a list of URLs from input array (and submits them to queue if asked for)
416
     * See Web > Info module script + "indexed_search"'s crawler hook-client using this!
417
     *
418
     * @param    array        Information about URLs from pageRow to crawl.
419
     * @param    array        Page row
420
     * @param    integer        Unix time to schedule indexing to, typically time()
421
     * @param    integer        Number of requests per minute (creates the interleave between requests)
422
     * @param    boolean        If set, submits the URLs to queue
423
     * @param    boolean        If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
424
     * @param    array        Array which is passed by reference and contains the an id per url to secure we will not crawl duplicates
425
     * @param    array        Array which will be filled with URLS for download if flag is set.
426
     * @param    array        Array of processing instructions
427
     * @return    string        List of URLs (meant for display in backend module)
428
     *
429
     */
430 4
    public function urlListFromUrlArray(
431
        array $vv,
432
        array $pageRow,
433
        $scheduledTime,
434
        $reqMinute,
435
        $submitCrawlUrls,
436
        $downloadCrawlUrls,
437
        array &$duplicateTrack,
438
        array &$downloadUrls,
439
        array $incomingProcInstructions
440
    ) {
441 4
        $urlList = '';
442
        // realurl support (thanks to Ingo Renner)
443 4
        if (ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
444
445
            /** @var tx_realurl $urlObj */
446
            $urlObj = GeneralUtility::makeInstance('tx_realurl');
447
448
            if (!empty($vv['subCfg']['baseUrl'])) {
449
                $urlParts = parse_url($vv['subCfg']['baseUrl']);
450
                $host = strtolower($urlParts['host']);
451
                $urlObj->host = $host;
452
453
                // First pass, finding configuration OR pointer string:
454
                $urlObj->extConf = isset($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
455
456
                // If it turned out to be a string pointer, then look up the real config:
457
                if (is_string($urlObj->extConf)) {
458
                    $urlObj->extConf = is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
459
                }
460
            }
461
462
            if (!$GLOBALS['TSFE']->sys_page) {
463
                $GLOBALS['TSFE']->sys_page = GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\PageRepository');
464
            }
465
466
            if (!$GLOBALS['TSFE']->tmpl->rootLine[0]['uid']) {
467
                $GLOBALS['TSFE']->tmpl->rootLine[0]['uid'] = $urlObj->extConf['pagePath']['rootpage_id'];
468
            }
469
        }
470
471 4
        if (is_array($vv['URLs'])) {
472 4
            $configurationHash = $this->getConfigurationHash($vv);
473 4
            $skipInnerCheck = $this->noUnprocessedQueueEntriesForPageWithConfigurationHashExist($pageRow['uid'], $configurationHash);
474
475 4
            foreach ($vv['URLs'] as $urlQuery) {
476 4
                if ($this->drawURLs_PIfilter($vv['subCfg']['procInstrFilter'], $incomingProcInstructions)) {
477
478
                    // Calculate cHash:
479 4
                    if ($vv['subCfg']['cHash']) {
480
                        /* @var $cacheHash \TYPO3\CMS\Frontend\Page\CacheHashCalculator */
481
                        $cacheHash = GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\CacheHashCalculator');
482
                        $urlQuery .= '&cHash=' . $cacheHash->generateForParameters($urlQuery);
483
                    }
484
485
                    // Create key by which to determine unique-ness:
486 4
                    $uKey = $urlQuery . '|' . $vv['subCfg']['userGroups'] . '|' . $vv['subCfg']['baseUrl'] . '|' . $vv['subCfg']['procInstrFilter'];
487
488
                    // realurl support (thanks to Ingo Renner)
489 4
                    $urlQuery = 'index.php' . $urlQuery;
490 4
                    if (ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
491
                        $params = [
492
                            'LD' => [
493
                                'totalURL' => $urlQuery,
494
                            ],
495
                            'TCEmainHook' => true,
496
                        ];
497
                        $urlObj->encodeSpURL($params);
0 ignored issues
show
Bug introduced by
The variable $urlObj does not seem to be defined for all execution paths leading up to this point.

If you define a variable conditionally, it can happen that it is not defined for all execution paths.

Let’s take a look at an example:

function myFunction($a) {
    switch ($a) {
        case 'foo':
            $x = 1;
            break;

        case 'bar':
            $x = 2;
            break;
    }

    // $x is potentially undefined here.
    echo $x;
}

In the above example, the variable $x is defined if you pass “foo” or “bar” as argument for $a. However, since the switch statement has no default case statement, if you pass any other value, the variable $x would be undefined.

Available Fixes

  1. Check for existence of the variable explicitly:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        if (isset($x)) { // Make sure it's always set.
            echo $x;
        }
    }
    
  2. Define a default value for the variable:

    function myFunction($a) {
        $x = ''; // Set a default which gets overridden for certain paths.
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        echo $x;
    }
    
  3. Add a value for the missing path:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
    
            // We add support for the missing case.
            default:
                $x = '';
                break;
        }
    
        echo $x;
    }
    
Loading history...
498
                        $urlQuery = $params['LD']['totalURL'];
499
                    }
500
501
                    // Scheduled time:
502 4
                    $schTime = $scheduledTime + round(count($duplicateTrack) * (60 / $reqMinute));
503 4
                    $schTime = floor($schTime / 60) * 60;
504
505 4
                    if (isset($duplicateTrack[$uKey])) {
506
507
                        //if the url key is registered just display it and do not resubmit is
508
                        $urlList = '<em><span class="typo3-dimmed">' . htmlspecialchars($urlQuery) . '</span></em><br/>';
509
                    } else {
510 4
                        $urlList = '[' . date('d.m.y H:i', $schTime) . '] ' . htmlspecialchars($urlQuery);
511 4
                        $this->urlList[] = '[' . date('d.m.y H:i', $schTime) . '] ' . $urlQuery;
512
513 4
                        $theUrl = ($vv['subCfg']['baseUrl'] ? $vv['subCfg']['baseUrl'] : GeneralUtility::getIndpEnv('TYPO3_SITE_URL')) . $urlQuery;
514
515
                        // Submit for crawling!
516 4
                        if ($submitCrawlUrls) {
517 4
                            $added = $this->addUrl(
518 4
                                $pageRow['uid'],
519 4
                                $theUrl,
520 4
                                $vv['subCfg'],
521 4
                                $scheduledTime,
522 4
                                $configurationHash,
523 4
                                $skipInnerCheck
524
                            );
525 4
                            if ($added === false) {
526 4
                                $urlList .= ' (Url already existed)';
527
                            }
528
                        } elseif ($downloadCrawlUrls) {
529
                            $downloadUrls[$theUrl] = $theUrl;
530
                        }
531
532 4
                        $urlList .= '<br />';
533
                    }
534 4
                    $duplicateTrack[$uKey] = true;
535
                }
536
            }
537
        } else {
538
            $urlList = 'ERROR - no URL generated';
539
        }
540
541 4
        return $urlList;
542
    }
543
544
    /**
545
     * Returns true if input processing instruction is among registered ones.
546
     *
547
     * @param string $piString PI to test
548
     * @param array $incomingProcInstructions Processing instructions
549
     * @return boolean
550
     */
551 5
    public function drawURLs_PIfilter($piString, array $incomingProcInstructions)
552
    {
553 5
        if (empty($incomingProcInstructions)) {
554 1
            return true;
555
        }
556
557 4
        foreach ($incomingProcInstructions as $pi) {
558 4
            if (GeneralUtility::inList($piString, $pi)) {
559 4
                return true;
560
            }
561
        }
562 2
    }
563
564 5
    public function getPageTSconfigForId($id)
565
    {
566 5
        if (!$this->MP) {
567 5
            $pageTSconfig = BackendUtility::getPagesTSconfig($id);
568
        } else {
569
            list(, $mountPointId) = explode('-', $this->MP);
570
            $pageTSconfig = BackendUtility::getPagesTSconfig($mountPointId);
571
        }
572
573
        // Call a hook to alter configuration
574 5
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'])) {
575
            $params = [
576
                'pageId' => $id,
577
                'pageTSConfig' => &$pageTSconfig,
578
            ];
579
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'] as $userFunc) {
580
                GeneralUtility::callUserFunction($userFunc, $params, $this);
581
            }
582
        }
583
584 5
        return $pageTSconfig;
585
    }
586
587
    /**
588
     * This methods returns an array of configurations.
589
     * And no urls!
590
     *
591
     * @param integer $id Page ID
592
     * @param bool $forceSsl Use https
593
     * @return array
594
     *
595
     * TODO: Should be switched back to protected - TNM 2018-11-16
596
     */
597 4
    public function getUrlsForPageId($id, $forceSsl = false)
598
    {
599
600
        /**
601
         * Get configuration from tsConfig
602
         */
603
604
        // Get page TSconfig for page ID:
605 4
        $pageTSconfig = $this->getPageTSconfigForId($id);
606
607 4
        $res = [];
608
609 4
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.'])) {
610 3
            $crawlerCfg = $pageTSconfig['tx_crawler.']['crawlerCfg.'];
611
612 3
            if (is_array($crawlerCfg['paramSets.'])) {
613 3
                foreach ($crawlerCfg['paramSets.'] as $key => $values) {
614 3
                    if (is_array($values)) {
615 3
                        $key = str_replace('.', '', $key);
616
                        // Sub configuration for a single configuration string:
617 3
                        $subCfg = (array)$crawlerCfg['paramSets.'][$key . '.'];
618 3
                        $subCfg['key'] = $key;
619
620 3
                        if (strcmp($subCfg['procInstrFilter'], '')) {
621 3
                            $subCfg['procInstrFilter'] = implode(',', GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']));
622
                        }
623 3
                        $pidOnlyList = implode(',', GeneralUtility::trimExplode(',', $subCfg['pidsOnly'], true));
624
625
                        // process configuration if it is not page-specific or if the specific page is the current page:
626 3
                        if (!strcmp($subCfg['pidsOnly'], '') || GeneralUtility::inList($pidOnlyList, $id)) {
627
628
                                // add trailing slash if not present
629 3
                            if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
630
                                $subCfg['baseUrl'] .= '/';
631
                            }
632
633
                            // Explode, process etc.:
634 3
                            $res[$key] = [];
635 3
                            $res[$key]['subCfg'] = $subCfg;
636 3
                            $res[$key]['paramParsed'] = $this->parseParams($crawlerCfg['paramSets.'][$key]);
637 3
                            $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
638 3
                            $res[$key]['origin'] = 'pagets';
639
640
                            // recognize MP value
641 3
                            if (!$this->MP) {
642 3
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
643
                            } else {
644 3
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id . '&MP=' . $this->MP]);
645
                            }
646
                        }
647
                    }
648
                }
649
            }
650
        }
651
652
        /**
653
         * Get configuration from tx_crawler_configuration records
654
         */
655
656
        // get records along the rootline
657 4
        $rootLine = BackendUtility::BEgetRootLine($id);
658 4
        foreach ($rootLine as $page) {
659 4
            $configurationRecordsForCurrentPage = $this->configurationRepository->getConfigurationRecordsPageUid($page['uid'])->toArray();
660
661
            /** @var Configuration $configurationRecord */
662 4
            foreach ($configurationRecordsForCurrentPage as $configurationRecord) {
663
664
                // check access to the configuration record
665 1
                if (empty($configurationRecord->getBeGroups()) || $GLOBALS['BE_USER']->isAdmin() || $this->hasGroupAccess($GLOBALS['BE_USER']->user['usergroup_cached_list'], $configurationRecord->getBeGroups())) {
666 1
                    $pidOnlyList = implode(',', GeneralUtility::trimExplode(',', $configurationRecord->getPidsOnly(), true));
667
668
                    // process configuration if it is not page-specific or if the specific page is the current page:
669 1
                    if (!strcmp($configurationRecord->getPidsOnly(), '') || GeneralUtility::inList($pidOnlyList, $id)) {
670 1
                        $key = $configurationRecord->getName();
671
672
                        // don't overwrite previously defined paramSets
673 1
                        if (!isset($res[$key])) {
674
675
                            /* @var $TSparserObject \TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser */
676 1
                            $TSparserObject = GeneralUtility::makeInstance('TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser');
677
                            // Todo: Check where the field processing_instructions_parameters_ts comes from.
678 1
                            $TSparserObject->parse($configurationRecord->getProcessingInstructionFilter()); //['processing_instruction_parameters_ts']);
679
680 1
                            $isCrawlingProtocolHttps = $this->isCrawlingProtocolHttps($configurationRecord->isForceSsl(), $forceSsl);
681
682
                            $subCfg = [
683 1
                                'procInstrFilter' => $configurationRecord->getProcessingInstructionFilter(),
684 1
                                'procInstrParams.' => $TSparserObject->setup,
685 1
                                'baseUrl' => $this->getBaseUrlForConfigurationRecord(
686 1
                                    $configurationRecord->getBaseUrl(),
687 1
                                    $configurationRecord->getSysDomainBaseUrl(),
688 1
                                    $isCrawlingProtocolHttps
689
                                ),
690 1
                                'realurl' => $configurationRecord->getRealUrl(),
691 1
                                'cHash' => $configurationRecord->getCHash(),
692 1
                                'userGroups' => $configurationRecord->getFeGroups(),
693 1
                                'exclude' => $configurationRecord->getExclude(),
694 1
                                'rootTemplatePid' => (int)$configurationRecord->getRootTemplatePid(),
695 1
                                'key' => $key,
696
                            ];
697
698
                            // add trailing slash if not present
699 1
                            if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
700
                                $subCfg['baseUrl'] .= '/';
701
                            }
702 1
                            if (!in_array($id, $this->expandExcludeString($subCfg['exclude']))) {
703 1
                                $res[$key] = [];
704 1
                                $res[$key]['subCfg'] = $subCfg;
705 1
                                $res[$key]['paramParsed'] = $this->parseParams($configurationRecord->getConfiguration());
706 1
                                $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
707 1
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
708 4
                                $res[$key]['origin'] = 'tx_crawler_configuration_' . $configurationRecord->getUid();
709
                            }
710
                        }
711
                    }
712
                }
713
            }
714
        }
715
716 4
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'])) {
717
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'] as $func) {
718
                $params = [
719
                    'res' => &$res,
720
                ];
721
                GeneralUtility::callUserFunction($func, $params, $this);
722
            }
723
        }
724
725 4
        return $res;
726
    }
727
728
    /**
729
     * Checks if a domain record exist and returns the base-url based on the record. If not the given baseUrl string is used.
730
     *
731
     * @param string $baseUrl
732
     * @param integer $sysDomainUid
733
     * @param bool $ssl
734
     * @return string
735
     */
736 4
    protected function getBaseUrlForConfigurationRecord($baseUrl, $sysDomainUid, $ssl = false)
737
    {
738 4
        $sysDomainUid = intval($sysDomainUid);
739 4
        $urlScheme = ($ssl === false) ? 'http' : 'https';
740
741 4
        if ($sysDomainUid > 0) {
742 2
            $res = $this->db->exec_SELECTquery(
743 2
                '*',
744 2
                'sys_domain',
745 2
                'uid = ' . $sysDomainUid .
746 2
                BackendUtility::BEenableFields('sys_domain') .
747 2
                BackendUtility::deleteClause('sys_domain')
748
            );
749 2
            $row = $this->db->sql_fetch_assoc($res);
750 2
            if ($row['domainName'] != '') {
751 1
                return $urlScheme . '://' . $row['domainName'];
752
            }
753
        }
754 3
        return $baseUrl;
755
    }
756
757 1
    public function getConfigurationsForBranch($rootid, $depth)
758
    {
759 1
        $configurationsForBranch = [];
760
761 1
        $pageTSconfig = $this->getPageTSconfigForId($rootid);
762 1
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'])) {
763
            $sets = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'];
764
            if (is_array($sets)) {
765
                foreach ($sets as $key => $value) {
766
                    if (!is_array($value)) {
767
                        continue;
768
                    }
769
                    $configurationsForBranch[] = substr($key, -1) == '.' ? substr($key, 0, -1) : $key;
770
                }
771
            }
772
        }
773 1
        $pids = [];
774 1
        $rootLine = BackendUtility::BEgetRootLine($rootid);
775 1
        foreach ($rootLine as $node) {
776 1
            $pids[] = $node['uid'];
777
        }
778
        /* @var PageTreeView $tree */
779 1
        $tree = GeneralUtility::makeInstance(PageTreeView::class);
780 1
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
781 1
        $tree->init('AND ' . $perms_clause);
782 1
        $tree->getTree($rootid, $depth, '');
783 1
        foreach ($tree->tree as $node) {
784
            $pids[] = $node['row']['uid'];
785
        }
786
787 1
        $res = $this->db->exec_SELECTquery(
788 1
            '*',
789 1
            'tx_crawler_configuration',
790 1
            'pid IN (' . implode(',', $pids) . ') ' .
791 1
            BackendUtility::BEenableFields('tx_crawler_configuration') .
792 1
            BackendUtility::deleteClause('tx_crawler_configuration') . ' ' .
793 1
            BackendUtility::versioningPlaceholderClause('tx_crawler_configuration') . ' '
794
        );
795
796 1
        while ($row = $this->db->sql_fetch_assoc($res)) {
797 1
            $configurationsForBranch[] = $row['name'];
798
        }
799 1
        $this->db->sql_free_result($res);
800 1
        return $configurationsForBranch;
801
    }
802
803
    /**
804
     * Check if a user has access to an item
805
     * (e.g. get the group list of the current logged in user from $GLOBALS['TSFE']->gr_list)
806
     *
807
     * @see \TYPO3\CMS\Frontend\Page\PageRepository::getMultipleGroupsWhereClause()
808
     * @param  string $groupList    Comma-separated list of (fe_)group UIDs from a user
809
     * @param  string $accessList   Comma-separated list of (fe_)group UIDs of the item to access
810
     * @return bool                 TRUE if at least one of the users group UIDs is in the access list or the access list is empty
811
     */
812 3
    public function hasGroupAccess($groupList, $accessList)
813
    {
814 3
        if (empty($accessList)) {
815 1
            return true;
816
        }
817 2
        foreach (GeneralUtility::intExplode(',', $groupList) as $groupUid) {
818 2
            if (GeneralUtility::inList($accessList, $groupUid)) {
819 2
                return true;
820
            }
821
        }
822 1
        return false;
823
    }
824
825
    /**
826
     * Parse GET vars of input Query into array with key=>value pairs
827
     *
828
     * @param string $inputQuery Input query string
829
     * @return array
830
     */
831 7
    public function parseParams($inputQuery)
832
    {
833
        // Extract all GET parameters into an ARRAY:
834 7
        $paramKeyValues = [];
835 7
        $GETparams = explode('&', $inputQuery);
836
837 7
        foreach ($GETparams as $paramAndValue) {
838 7
            list($p, $v) = explode('=', $paramAndValue, 2);
839 7
            if (strlen($p)) {
840 7
                $paramKeyValues[rawurldecode($p)] = rawurldecode($v);
841
            }
842
        }
843
844 7
        return $paramKeyValues;
845
    }
846
847
    /**
848
     * Will expand the parameters configuration to individual values. This follows a certain syntax of the value of each parameter.
849
     * Syntax of values:
850
     * - Basically: If the value is wrapped in [...] it will be expanded according to the following syntax, otherwise the value is taken literally
851
     * - Configuration is splitted by "|" and the parts are processed individually and finally added together
852
     * - For each configuration part:
853
     *         - "[int]-[int]" = Integer range, will be expanded to all values in between, values included, starting from low to high (max. 1000). Example "1-34" or "-40--30"
854
     *         - "_TABLE:[TCA table name];[_PID:[optional page id, default is current page]];[_ENABLELANG:1]" = Look up of table records from PID, filtering out deleted records. Example "_TABLE:tt_content; _PID:123"
855
     *        _ENABLELANG:1 picks only original records without their language overlays
856
     *         - Default: Literal value
857
     *
858
     * @param array $paramArray Array with key (GET var name) and values (value of GET var which is configuration for expansion)
859
     * @param integer $pid Current page ID
860
     * @return array
861
     */
862 8
    public function expandParameters($paramArray, $pid)
863
    {
864 8
        global $TCA;
865
866
        // Traverse parameter names:
867 8
        foreach ($paramArray as $p => $v) {
868 8
            $v = trim($v);
869
870
            // If value is encapsulated in square brackets it means there are some ranges of values to find, otherwise the value is literal
871 8
            if (substr($v, 0, 1) === '[' && substr($v, -1) === ']') {
872
                // So, find the value inside brackets and reset the paramArray value as an array.
873 8
                $v = substr($v, 1, -1);
874 8
                $paramArray[$p] = [];
875
876
                // Explode parts and traverse them:
877 8
                $parts = explode('|', $v);
878 8
                foreach ($parts as $pV) {
879
880
                        // Look for integer range: (fx. 1-34 or -40--30 // reads minus 40 to minus 30)
881 8
                    if (preg_match('/^(-?[0-9]+)\s*-\s*(-?[0-9]+)$/', trim($pV), $reg)) {
882
883
                        // Swap if first is larger than last:
884 1
                        if ($reg[1] > $reg[2]) {
885
                            $temp = $reg[2];
886
                            $reg[2] = $reg[1];
887
                            $reg[1] = $temp;
888
                        }
889
890
                        // Traverse range, add values:
891 1
                        $runAwayBrake = 1000; // Limit to size of range!
892 1
                        for ($a = $reg[1]; $a <= $reg[2];$a++) {
893 1
                            $paramArray[$p][] = $a;
894 1
                            $runAwayBrake--;
895 1
                            if ($runAwayBrake <= 0) {
896
                                break;
897
                            }
898
                        }
899 7
                    } elseif (substr(trim($pV), 0, 7) == '_TABLE:') {
900
901
                        // Parse parameters:
902 3
                        $subparts = GeneralUtility::trimExplode(';', $pV);
903 3
                        $subpartParams = [];
904 3
                        foreach ($subparts as $spV) {
905 3
                            list($pKey, $pVal) = GeneralUtility::trimExplode(':', $spV);
906 3
                            $subpartParams[$pKey] = $pVal;
907
                        }
908
909
                        // Table exists:
910 3
                        if (isset($TCA[$subpartParams['_TABLE']])) {
911 3
                            $lookUpPid = isset($subpartParams['_PID']) ? intval($subpartParams['_PID']) : $pid;
912 3
                            $pidField = isset($subpartParams['_PIDFIELD']) ? trim($subpartParams['_PIDFIELD']) : 'pid';
913 3
                            $where = isset($subpartParams['_WHERE']) ? $subpartParams['_WHERE'] : '';
914 3
                            $addTable = isset($subpartParams['_ADDTABLE']) ? $subpartParams['_ADDTABLE'] : '';
915
916 3
                            $fieldName = $subpartParams['_FIELD'] ? $subpartParams['_FIELD'] : 'uid';
917 3
                            if ($fieldName === 'uid' || $TCA[$subpartParams['_TABLE']]['columns'][$fieldName]) {
918 3
                                $andWhereLanguage = '';
919 3
                                $transOrigPointerField = $TCA[$subpartParams['_TABLE']]['ctrl']['transOrigPointerField'];
920
921 3
                                if ($subpartParams['_ENABLELANG'] && $transOrigPointerField) {
922
                                    $andWhereLanguage = ' AND ' . $this->db->quoteStr($transOrigPointerField, $subpartParams['_TABLE']) . ' <= 0 ';
923
                                }
924
925 3
                                $where = $this->db->quoteStr($pidField, $subpartParams['_TABLE']) . '=' . intval($lookUpPid) . ' ' .
926 3
                                    $andWhereLanguage . $where;
927
928 3
                                $rows = $this->db->exec_SELECTgetRows(
929 3
                                    $fieldName,
930 3
                                    $subpartParams['_TABLE'] . $addTable,
931 3
                                    $where . BackendUtility::deleteClause($subpartParams['_TABLE']),
932 3
                                    '',
933 3
                                    '',
934 3
                                    '',
935 3
                                    $fieldName
936
                                );
937
938 3
                                if (is_array($rows)) {
939 3
                                    $paramArray[$p] = array_merge($paramArray[$p], array_keys($rows));
940
                                }
941
                            }
942
                        }
943
                    } else { // Just add value:
944 4
                        $paramArray[$p][] = $pV;
945
                    }
946
                    // Hook for processing own expandParameters place holder
947 8
                    if (is_array($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'])) {
948
                        $_params = [
949
                            'pObj' => &$this,
950
                            'paramArray' => &$paramArray,
951
                            'currentKey' => $p,
952
                            'currentValue' => $pV,
953
                            'pid' => $pid,
954
                        ];
955
                        foreach ($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'] as $key => $_funcRef) {
956 8
                            GeneralUtility::callUserFunction($_funcRef, $_params, $this);
957
                        }
958
                    }
959
                }
960
961
                // Make unique set of values and sort array by key:
962 8
                $paramArray[$p] = array_unique($paramArray[$p]);
963 8
                ksort($paramArray);
964
            } else {
965
                // Set the literal value as only value in array:
966 8
                $paramArray[$p] = [$v];
967
            }
968
        }
969
970 8
        return $paramArray;
971
    }
972
973
    /**
974
     * Compiling URLs from parameter array (output of expandParameters())
975
     * The number of URLs will be the multiplication of the number of parameter values for each key
976
     *
977
     * @param array $paramArray Output of expandParameters(): Array with keys (GET var names) and for each an array of values
978
     * @param array $urls URLs accumulated in this array (for recursion)
979
     * @return array
980
     */
981 7
    public function compileUrls($paramArray, $urls = [])
982
    {
983 7
        if (count($paramArray) && is_array($urls)) {
984
            // shift first off stack:
985 6
            reset($paramArray);
986 6
            $varName = key($paramArray);
987 6
            $valueSet = array_shift($paramArray);
988
989
            // Traverse value set:
990 6
            $newUrls = [];
991 6
            foreach ($urls as $url) {
992 5
                foreach ($valueSet as $val) {
993 5
                    $newUrls[] = $url . (strcmp($val, '') ? '&' . rawurlencode($varName) . '=' . rawurlencode($val) : '');
994
995 5
                    if (count($newUrls) > MathUtility::forceIntegerInRange($this->extensionSettings['maxCompileUrls'], 1, 1000000000, 10000)) {
996 5
                        break;
997
                    }
998
                }
999
            }
1000 6
            $urls = $newUrls;
1001 6
            $urls = $this->compileUrls($paramArray, $urls);
1002
        }
1003
1004 7
        return $urls;
1005
    }
1006
1007
    /************************************
1008
     *
1009
     * Crawler log
1010
     *
1011
     ************************************/
1012
1013
    /**
1014
     * Return array of records from crawler queue for input page ID
1015
     *
1016
     * @param integer $id Page ID for which to look up log entries.
1017
     * @param string$filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
1018
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
1019
     * @param boolean $doFullFlush
1020
     * @param integer $itemsPerPage Limit the amount of entries per page default is 10
1021
     * @return array
1022
     */
1023 4
    public function getLogEntriesForPageId($id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
1024
    {
1025
        switch ($filter) {
1026 4
            case 'pending':
1027
                $addWhere = ' AND exec_time=0';
1028
                break;
1029 4
            case 'finished':
1030
                $addWhere = ' AND exec_time>0';
1031
                break;
1032
            default:
1033 4
                $addWhere = '';
1034 4
                break;
1035
        }
1036
1037
        // FIXME: Write unit test that ensures that the right records are deleted.
1038 4
        if ($doFlush) {
1039 2
            $this->flushQueue(($doFullFlush ? '1=1' : ('page_id=' . intval($id))) . $addWhere);
1040 2
            return [];
1041
        } else {
1042 2
            return $this->db->exec_SELECTgetRows(
1043 2
                '*',
1044 2
                'tx_crawler_queue',
1045 2
                'page_id=' . intval($id) . $addWhere,
1046 2
                '',
1047 2
                'scheduled DESC',
1048 2
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
1049
            );
1050
        }
1051
    }
1052
1053
    /**
1054
     * Return array of records from crawler queue for input set ID
1055
     *
1056
     * @param integer $set_id Set ID for which to look up log entries.
1057
     * @param string $filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
1058
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
1059
     * @param integer $itemsPerPage Limit the amount of entires per page default is 10
1060
     * @return array
1061
     */
1062 6
    public function getLogEntriesForSetId($set_id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
1063
    {
1064
        // FIXME: Write Unit tests for Filters
1065
        switch ($filter) {
1066 6
            case 'pending':
1067 1
                $addWhere = ' AND exec_time=0';
1068 1
                break;
1069 5
            case 'finished':
1070 1
                $addWhere = ' AND exec_time>0';
1071 1
                break;
1072
            default:
1073 4
                $addWhere = '';
1074 4
                break;
1075
        }
1076
        // FIXME: Write unit test that ensures that the right records are deleted.
1077 6
        if ($doFlush) {
1078 4
            $this->flushQueue($doFullFlush ? '' : ('set_id=' . intval($set_id) . $addWhere));
1079 4
            return [];
1080
        } else {
1081 2
            return $this->db->exec_SELECTgetRows(
1082 2
                '*',
1083 2
                'tx_crawler_queue',
1084 2
                'set_id=' . intval($set_id) . $addWhere,
1085 2
                '',
1086 2
                'scheduled DESC',
1087 2
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
1088
            );
1089
        }
1090
    }
1091
1092
    /**
1093
     * Removes queue entries
1094
     *
1095
     * @param string $where SQL related filter for the entries which should be removed
1096
     * @return void
1097
     */
1098 10
    protected function flushQueue($where = '')
1099
    {
1100 10
        $realWhere = strlen($where) > 0 ? $where : '1=1';
1101
1102 10
        if (EventDispatcher::getInstance()->hasObserver('queueEntryFlush') || SignalSlotUtility::hasSignal(__CLASS__, SignalSlotUtility::SIGNAL_QUEUE_ENTRY_FLUSH)) {
1103
            $groups = $this->db->exec_SELECTgetRows('DISTINCT set_id', 'tx_crawler_queue', $realWhere);
1104
            if (is_array($groups)) {
1105
                foreach ($groups as $group) {
1106
1107
                    // The event dispatcher is deprecated since crawler v6.4.0, will be removed in crawler v7.0.0.
1108
                    // Please use the Signal instead.
1109
                    if (EventDispatcher::getInstance()->hasObserver('queueEntryFlush')) {
1110
                        EventDispatcher::getInstance()->post(
1111
                            'queueEntryFlush',
1112
                            $group['set_id'],
1113
                            $this->db->exec_SELECTgetRows('uid, set_id', 'tx_crawler_queue', $realWhere . ' AND set_id="' . $group['set_id'] . '"')
1114
                        );
1115
                    }
1116
1117
                    if (SignalSlotUtility::hasSignal(__CLASS__, SignalSlotUtility::SIGNAL_QUEUE_ENTRY_FLUSH)) {
1118
                        $signalInputArray = $this->db->exec_SELECTgetRows('uid, set_id', 'tx_crawler_queue', $realWhere . ' AND set_id="' . $group['set_id'] . '"');
1119
                        SignalSlotUtility::emitSignal(
1120
                            __CLASS__,
1121
                            SignalSlotUtility::SIGNAL_QUEUE_ENTRY_FLUSH,
1122
                            $signalInputArray
0 ignored issues
show
Bug introduced by
It seems like $signalInputArray defined by $this->db->exec_SELECTge...$group['set_id'] . '"') on line 1118 can also be of type null; however, AOE\Crawler\Utility\Sign...otUtility::emitSignal() does only seem to accept array, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
1123
                        );
1124
                    }
1125
                }
1126
            }
1127
        }
1128
1129 10
        $GLOBALS['TYPO3_DB']->exec_DELETEquery('tx_crawler_queue', $realWhere);
1130 10
    }
1131
1132
    /**
1133
     * Adding call back entries to log (called from hooks typically, see indexed search class "class.crawler.php"
1134
     *
1135
     * @param integer $setId Set ID
1136
     * @param array $params Parameters to pass to call back function
1137
     * @param string $callBack Call back object reference, eg. 'EXT:indexed_search/class.crawler.php:&tx_indexedsearch_crawler'
1138
     * @param integer $page_id Page ID to attach it to
1139
     * @param integer $schedule Time at which to activate
1140
     * @return void
1141
     */
1142
    public function addQueueEntry_callBack($setId, $params, $callBack, $page_id = 0, $schedule = 0)
1143
    {
1144
        if (!is_array($params)) {
1145
            $params = [];
1146
        }
1147
        $params['_CALLBACKOBJ'] = $callBack;
1148
1149
        // Compile value array:
1150
        $fieldArray = [
1151
            'page_id' => intval($page_id),
1152
            'parameters' => serialize($params),
1153
            'scheduled' => intval($schedule) ? intval($schedule) : $this->getCurrentTime(),
1154
            'exec_time' => 0,
1155
            'set_id' => intval($setId),
1156
            'result_data' => '',
1157
        ];
1158
1159
        $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1160
    }
1161
1162
    /************************************
1163
     *
1164
     * URL setting
1165
     *
1166
     ************************************/
1167
1168
    /**
1169
     * Setting a URL for crawling:
1170
     *
1171
     * @param integer $id Page ID
1172
     * @param string $url Complete URL
1173
     * @param array $subCfg Sub configuration array (from TS config)
1174
     * @param integer $tstamp Scheduled-time
1175
     * @param string $configurationHash (optional) configuration hash
1176
     * @param bool $skipInnerDuplicationCheck (optional) skip inner duplication check
1177
     * @return bool
1178
     */
1179 8
    public function addUrl(
1180
        $id,
1181
        $url,
1182
        array $subCfg,
1183
        $tstamp,
1184
        $configurationHash = '',
1185
        $skipInnerDuplicationCheck = false
1186
    ) {
1187 8
        $urlAdded = false;
1188 8
        $rows = [];
1189
1190
        // Creating parameters:
1191
        $parameters = [
1192 8
            'url' => $url,
1193
        ];
1194
1195
        // fe user group simulation:
1196 8
        $uGs = implode(',', array_unique(GeneralUtility::intExplode(',', $subCfg['userGroups'], true)));
1197 8
        if ($uGs) {
1198 1
            $parameters['feUserGroupList'] = $uGs;
1199
        }
1200
1201
        // Setting processing instructions
1202 8
        $parameters['procInstructions'] = GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']);
1203 8
        if (is_array($subCfg['procInstrParams.'])) {
1204 5
            $parameters['procInstrParams'] = $subCfg['procInstrParams.'];
1205
        }
1206
1207
        // Possible TypoScript Template Parents
1208 8
        $parameters['rootTemplatePid'] = $subCfg['rootTemplatePid'];
1209
1210
        // Compile value array:
1211 8
        $parameters_serialized = serialize($parameters);
1212
        $fieldArray = [
1213 8
            'page_id' => intval($id),
1214 8
            'parameters' => $parameters_serialized,
1215 8
            'parameters_hash' => GeneralUtility::shortMD5($parameters_serialized),
1216 8
            'configuration_hash' => $configurationHash,
1217 8
            'scheduled' => $tstamp,
1218 8
            'exec_time' => 0,
1219 8
            'set_id' => intval($this->setID),
1220 8
            'result_data' => '',
1221 8
            'configuration' => $subCfg['key'],
1222
        ];
1223
1224 8
        if ($this->registerQueueEntriesInternallyOnly) {
1225
            //the entries will only be registered and not stored to the database
1226 1
            $this->queueEntries[] = $fieldArray;
1227
        } else {
1228 7
            if (!$skipInnerDuplicationCheck) {
1229
                // check if there is already an equal entry
1230 6
                $rows = $this->getDuplicateRowsIfExist($tstamp, $fieldArray);
1231
            }
1232
1233 7
            if (count($rows) == 0) {
1234 6
                $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1235 6
                $uid = $this->db->sql_insert_id();
1236 6
                $rows[] = $uid;
1237 6
                $urlAdded = true;
1238
1239
                // The event dispatcher is deprecated since crawler v6.4.0, will be removed in crawler v7.0.0.
1240
                // Please use the Signal instead.
1241 6
                EventDispatcher::getInstance()->post('urlAddedToQueue', $this->setID, ['uid' => $uid, 'fieldArray' => $fieldArray]);
1242
1243 6
                $signalPayload = ['uid' => $uid, 'fieldArray' => $fieldArray];
1244 6
                SignalSlotUtility::emitSignal(
1245 6
                    __CLASS__,
1246 6
                    SignalSlotUtility::SIGNAL_URL_ADDED_TO_QUEUE,
1247 6
                    $signalPayload
1248
                );
1249
            } else {
1250
                // The event dispatcher is deprecated since crawler v6.4.0, will be removed in crawler v7.0.0.
1251
                // Please use the Signal instead.
1252 3
                EventDispatcher::getInstance()->post('duplicateUrlInQueue', $this->setID, ['rows' => $rows, 'fieldArray' => $fieldArray]);
1253
1254 3
                $signalPayload = ['rows' => $rows, 'fieldArray' => $fieldArray];
1255 3
                SignalSlotUtility::emitSignal(
1256 3
                    __CLASS__,
1257 3
                    SignalSlotUtility::SIGNAL_DUPLICATE_URL_IN_QUEUE,
1258 3
                    $signalPayload
1259
                );
1260
            }
1261
        }
1262
1263 8
        return $urlAdded;
1264
    }
1265
1266
    /**
1267
     * This method determines duplicates for a queue entry with the same parameters and this timestamp.
1268
     * If the timestamp is in the past, it will check if there is any unprocessed queue entry in the past.
1269
     * If the timestamp is in the future it will check, if the queued entry has exactly the same timestamp
1270
     *
1271
     * @param int $tstamp
1272
     * @param array $fieldArray
1273
     *
1274
     * @return array
1275
     */
1276 9
    protected function getDuplicateRowsIfExist($tstamp, $fieldArray)
1277
    {
1278 9
        $rows = [];
1279
1280 9
        $currentTime = $this->getCurrentTime();
1281
1282
        //if this entry is scheduled with "now"
1283 9
        if ($tstamp <= $currentTime) {
1284 3
            if ($this->extensionSettings['enableTimeslot']) {
1285 2
                $timeBegin = $currentTime - 100;
1286 2
                $timeEnd = $currentTime + 100;
1287 2
                $where = ' ((scheduled BETWEEN ' . $timeBegin . ' AND ' . $timeEnd . ' ) OR scheduled <= ' . $currentTime . ') ';
1288
            } else {
1289 3
                $where = 'scheduled <= ' . $currentTime;
1290
            }
1291 6
        } elseif ($tstamp > $currentTime) {
1292
            //entry with a timestamp in the future need to have the same schedule time
1293 6
            $where = 'scheduled = ' . $tstamp ;
1294
        }
1295
1296 9
        if (!empty($where)) {
1297 9
            $result = $this->db->exec_SELECTgetRows(
1298 9
                'qid',
1299 9
                'tx_crawler_queue',
1300
                $where .
1301 9
                ' AND NOT exec_time' .
1302 9
                ' AND NOT process_id ' .
1303 9
                ' AND page_id=' . intval($fieldArray['page_id']) .
1304 9
                ' AND parameters_hash = ' . $this->db->fullQuoteStr($fieldArray['parameters_hash'], 'tx_crawler_queue')
1305
            );
1306
1307 9
            if (is_array($result)) {
1308 9
                foreach ($result as $value) {
1309 7
                    $rows[] = $value['qid'];
1310
                }
1311
            }
1312
        }
1313
1314 9
        return $rows;
1315
    }
1316
1317
    /**
1318
     * Returns the current system time
1319
     *
1320
     * @return int
1321
     */
1322
    public function getCurrentTime()
1323
    {
1324
        return time();
1325
    }
1326
1327
    /************************************
1328
     *
1329
     * URL reading
1330
     *
1331
     ************************************/
1332
1333
    /**
1334
     * Read URL for single queue entry
1335
     *
1336
     * @param integer $queueId
1337
     * @param boolean $force If set, will process even if exec_time has been set!
1338
     * @return integer
1339
     */
1340
    public function readUrl($queueId, $force = false)
1341
    {
1342
        $ret = 0;
1343
        if ($this->debugMode) {
1344
            GeneralUtility::devlog('crawler-readurl start ' . microtime(true), __FUNCTION__);
1345
        }
1346
        // Get entry:
1347
        list($queueRec) = $this->db->exec_SELECTgetRows(
1348
            '*',
1349
            'tx_crawler_queue',
1350
            'qid=' . intval($queueId) . ($force ? '' : ' AND exec_time=0 AND process_scheduled > 0')
1351
        );
1352
1353
        if (!is_array($queueRec)) {
1354
            return;
1355
        }
1356
1357
        $parameters = unserialize($queueRec['parameters']);
1358
        if ($parameters['rootTemplatePid']) {
1359
            $this->initTSFE((int)$parameters['rootTemplatePid']);
1360
        } else {
1361
            GeneralUtility::sysLog(
1362
                'Page with (' . $queueRec['page_id'] . ') could not be crawled, please check your crawler configuration. Perhaps no Root Template Pid is set',
1363
                'crawler',
1364
                GeneralUtility::SYSLOG_SEVERITY_WARNING
1365
            );
1366
        }
1367
1368
        $signalPayload = [$queueId, $queueRec];
1369
        SignalSlotUtility::emitSignal(
1370
            __CLASS__,
1371
            SignalSlotUtility::SIGNAL_QUEUEITEM_PREPROCESS,
1372
            $signalPayload
1373
        );
1374
1375
        // Set exec_time to lock record:
1376
        $field_array = ['exec_time' => $this->getCurrentTime()];
1377
1378
        if (isset($this->processID)) {
1379
            //if mulitprocessing is used we need to store the id of the process which has handled this entry
1380
            $field_array['process_id_completed'] = $this->processID;
1381
        }
1382
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1383
1384
        $result = $this->readUrl_exec($queueRec);
1385
        $resultData = unserialize($result['content']);
1386
1387
        //atm there's no need to point to specific pollable extensions
1388
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'])) {
1389
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'] as $pollable) {
1390
                // only check the success value if the instruction is runnig
1391
                // it is important to name the pollSuccess key same as the procInstructions key
1392
                if (is_array($resultData['parameters']['procInstructions']) && in_array(
1393
                    $pollable,
1394
                    $resultData['parameters']['procInstructions']
1395
                )
1396
                ) {
1397
                    if (!empty($resultData['success'][$pollable]) && $resultData['success'][$pollable]) {
1398
                        $ret |= self::CLI_STATUS_POLLABLE_PROCESSED;
1399
                    }
1400
                }
1401
            }
1402
        }
1403
1404
        // Set result in log which also denotes the end of the processing of this entry.
1405
        $field_array = ['result_data' => serialize($result)];
1406
1407
        $signalPayload = [$queueId, $field_array];
1408
        SignalSlotUtility::emitSignal(
1409
            __CLASS__,
1410
            SignalSlotUtility::SIGNAL_QUEUEITEM_POSTPROCESS,
1411
            $signalPayload
1412
        );
1413
1414
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1415
1416
        if ($this->debugMode) {
1417
            GeneralUtility::devlog('crawler-readurl stop ' . microtime(true), __FUNCTION__);
1418
        }
1419
1420
        return $ret;
1421
    }
1422
1423
    /**
1424
     * Read URL for not-yet-inserted log-entry
1425
     *
1426
     * @param array $field_array Queue field array,
1427
     *
1428
     * @return string
1429
     */
1430
    public function readUrlFromArray($field_array)
1431
    {
1432
1433
            // Set exec_time to lock record:
1434
        $field_array['exec_time'] = $this->getCurrentTime();
1435
        $this->db->exec_INSERTquery('tx_crawler_queue', $field_array);
1436
        $queueId = $field_array['qid'] = $this->db->sql_insert_id();
1437
1438
        $result = $this->readUrl_exec($field_array);
1439
1440
        // Set result in log which also denotes the end of the processing of this entry.
1441
        $field_array = ['result_data' => serialize($result)];
1442
1443
        $signalPayload = [$queueId, $field_array];
1444
        SignalSlotUtility::emitSignal(
1445
            __CLASS__,
1446
            SignalSlotUtility::SIGNAL_QUEUEITEM_POSTPROCESS,
1447
            $signalPayload
1448
        );
1449
1450
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1451
1452
        return $result;
1453
    }
1454
1455
    /**
1456
     * Read URL for a queue record
1457
     *
1458
     * @param array $queueRec Queue record
1459
     * @return string
1460
     */
1461
    public function readUrl_exec($queueRec)
1462
    {
1463
        // Decode parameters:
1464
        $parameters = unserialize($queueRec['parameters']);
1465
        $result = 'ERROR';
1466
        if (is_array($parameters)) {
1467
            if ($parameters['_CALLBACKOBJ']) { // Calling object:
1468
                $objRef = $parameters['_CALLBACKOBJ'];
1469
                $callBackObj = &GeneralUtility::getUserObj($objRef);
1470
                if (is_object($callBackObj)) {
1471
                    unset($parameters['_CALLBACKOBJ']);
1472
                    $result = ['content' => serialize($callBackObj->crawler_execute($parameters, $this))];
1473
                } else {
1474
                    $result = ['content' => 'No object: ' . $objRef];
1475
                }
1476
            } else { // Regular FE request:
1477
1478
                // Prepare:
1479
                $crawlerId = $queueRec['qid'] . ':' . md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey']);
1480
1481
                // Get result:
1482
                $result = $this->requestUrl($parameters['url'], $crawlerId);
1483
1484
                // The event dispatcher is deprecated since crawler v6.4.0, will be removed in crawler v7.0.0.
1485
                // Please use the Signal instead.
1486
                EventDispatcher::getInstance()->post('urlCrawled', $queueRec['set_id'], ['url' => $parameters['url'], 'result' => $result]);
1487
1488
                $signalPayload = ['url' => $parameters['url'], 'result' => $result];
1489
                SignalSlotUtility::emitSignal(
1490
                    __CLASS__,
1491
                    SignalSlotUtility::SIGNAL_URL_CRAWLED,
1492
                    $signalPayload
1493
                );
1494
            }
1495
        }
1496
1497
        return $result;
1498
    }
1499
1500
    /**
1501
     * Gets the content of a URL.
1502
     *
1503
     * @param string $originalUrl URL to read
1504
     * @param string $crawlerId Crawler ID string (qid + hash to verify)
1505
     * @param integer $timeout Timeout time
1506
     * @param integer $recursion Recursion limiter for 302 redirects
1507
     * @return array
1508
     */
1509 2
    public function requestUrl($originalUrl, $crawlerId, $timeout = 2, $recursion = 10)
1510
    {
1511 2
        if (!$recursion) {
1512
            return false;
1513
        }
1514
1515
        // Parse URL, checking for scheme:
1516 2
        $url = parse_url($originalUrl);
1517
1518 2
        if ($url === false) {
1519
            if (TYPO3_DLOG) {
1520
                GeneralUtility::devLog(sprintf('Could not parse_url() for string "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1521
            }
1522
            return false;
1523
        }
1524
1525 2
        if (!in_array($url['scheme'], ['','http','https'])) {
1526
            if (TYPO3_DLOG) {
1527
                GeneralUtility::devLog(sprintf('Scheme does not match for url "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1528
            }
1529
            return false;
1530
        }
1531
1532
        // direct request
1533 2
        if ($this->extensionSettings['makeDirectRequests']) {
1534 2
            $result = $this->sendDirectRequest($originalUrl, $crawlerId);
1535 2
            return $result;
1536
        }
1537
1538
        $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1539
1540
        // thanks to Pierrick Caillon for adding proxy support
1541
        $rurl = $url;
1542
1543
        if ($this->extensionSettings['curlUse'] && $this->extensionSettings['curlProxyServer']) {
1544
            $rurl = parse_url($this->extensionSettings['curlProxyServer']);
1545
            $url['path'] = $url['scheme'] . '://' . $url['host'] . ($url['port'] > 0 ? ':' . $url['port'] : '') . $url['path'];
1546
            $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1547
        }
1548
1549
        $host = $rurl['host'];
1550
1551
        if ($url['scheme'] == 'https') {
1552
            $host = 'ssl://' . $host;
1553
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 443;
1554
        } else {
1555
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 80;
1556
        }
1557
1558
        $startTime = microtime(true);
1559
        $fp = fsockopen($host, $port, $errno, $errstr, $timeout);
1560
1561
        if (!$fp) {
1562
            if (TYPO3_DLOG) {
1563
                GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1564
            }
1565
            return false;
1566
        } else {
1567
            // Request message:
1568
            $msg = implode("\r\n", $reqHeaders) . "\r\n\r\n";
1569
            fputs($fp, $msg);
1570
1571
            // Read response:
1572
            $d = $this->getHttpResponseFromStream($fp);
1573
            fclose($fp);
1574
1575
            $time = microtime(true) - $startTime;
1576
            $this->log($originalUrl . ' ' . $time);
1577
1578
            // Implode content and headers:
1579
            $result = [
1580
                'request' => $msg,
1581
                'headers' => implode('', $d['headers']),
1582
                'content' => implode('', (array)$d['content']),
1583
            ];
1584
1585
            if (($this->extensionSettings['follow30x']) && ($newUrl = $this->getRequestUrlFrom302Header($d['headers'], $url['user'], $url['pass']))) {
1586
                $result = array_merge(['parentRequest' => $result], $this->requestUrl($newUrl, $crawlerId, $recursion--));
0 ignored issues
show
Bug introduced by
It seems like $newUrl defined by $this->getRequestUrlFrom...['user'], $url['pass']) on line 1585 can also be of type boolean; however, AOE\Crawler\Controller\C...ontroller::requestUrl() does only seem to accept string, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
1587
                $newRequestUrl = $this->requestUrl($newUrl, $crawlerId, $timeout, --$recursion);
0 ignored issues
show
Bug introduced by
It seems like $newUrl defined by $this->getRequestUrlFrom...['user'], $url['pass']) on line 1585 can also be of type boolean; however, AOE\Crawler\Controller\C...ontroller::requestUrl() does only seem to accept string, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
1588
1589
                if (is_array($newRequestUrl)) {
1590
                    $result = array_merge(['parentRequest' => $result], $newRequestUrl);
1591
                } else {
1592
                    if (TYPO3_DLOG) {
1593
                        GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1594
                    }
1595
                    return false;
1596
                }
1597
            }
1598
1599
            return $result;
1600
        }
1601
    }
1602
1603
    /**
1604
     * Gets the base path of the website frontend.
1605
     * (e.g. if you call http://mydomain.com/cms/index.php in
1606
     * the browser the base path is "/cms/")
1607
     *
1608
     * @return string Base path of the website frontend
1609
     */
1610
    protected function getFrontendBasePath()
1611
    {
1612
        $frontendBasePath = '/';
1613
1614
        // Get the path from the extension settings:
1615
        if (isset($this->extensionSettings['frontendBasePath']) && $this->extensionSettings['frontendBasePath']) {
1616
            $frontendBasePath = $this->extensionSettings['frontendBasePath'];
1617
        // If empty, try to use config.absRefPrefix:
1618
        } elseif (isset($GLOBALS['TSFE']->absRefPrefix) && !empty($GLOBALS['TSFE']->absRefPrefix)) {
1619
            $frontendBasePath = $GLOBALS['TSFE']->absRefPrefix;
1620
        // If not in CLI mode the base path can be determined from $_SERVER environment:
1621
        } elseif (!defined('TYPO3_REQUESTTYPE_CLI') || !TYPO3_REQUESTTYPE_CLI) {
1622
            $frontendBasePath = GeneralUtility::getIndpEnv('TYPO3_SITE_PATH');
1623
        }
1624
1625
        // Base path must be '/<pathSegements>/':
1626
        if ($frontendBasePath != '/') {
1627
            $frontendBasePath = '/' . ltrim($frontendBasePath, '/');
1628
            $frontendBasePath = rtrim($frontendBasePath, '/') . '/';
1629
        }
1630
1631
        return $frontendBasePath;
1632
    }
1633
1634
    /**
1635
     * Executes a shell command and returns the outputted result.
1636
     *
1637
     * @param string $command Shell command to be executed
1638
     * @return string Outputted result of the command execution
1639
     */
1640
    protected function executeShellCommand($command)
1641
    {
1642
        $result = shell_exec($command);
1643
        return $result;
1644
    }
1645
1646
    /**
1647
     * Reads HTTP response from the given stream.
1648
     *
1649
     * @param  resource $streamPointer  Pointer to connection stream.
1650
     * @return array                    Associative array with the following items:
1651
     *                                  headers <array> Response headers sent by server.
1652
     *                                  content <array> Content, with each line as an array item.
1653
     */
1654 1
    protected function getHttpResponseFromStream($streamPointer)
1655
    {
1656 1
        $response = ['headers' => [], 'content' => []];
1657
1658 1
        if (is_resource($streamPointer)) {
1659
            // read headers
1660 1
            while ($line = fgets($streamPointer, '2048')) {
1661 1
                $line = trim($line);
1662 1
                if ($line !== '') {
1663 1
                    $response['headers'][] = $line;
1664
                } else {
1665 1
                    break;
1666
                }
1667
            }
1668
1669
            // read content
1670 1
            while ($line = fgets($streamPointer, '2048')) {
1671 1
                $response['content'][] = $line;
1672
            }
1673
        }
1674
1675 1
        return $response;
1676
    }
1677
1678
    /**
1679
     * @param message
1680
     */
1681 2
    protected function log($message)
1682
    {
1683 2
        if (!empty($this->extensionSettings['logFileName'])) {
1684
            $fileResult = @file_put_contents($this->extensionSettings['logFileName'], date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);
1685
            if (!$fileResult) {
1686
                GeneralUtility::devLog('File "' . $this->extensionSettings['logFileName'] . '" could not be written, please check file permissions.', 'crawler', LogLevel::INFO);
1687
            }
1688
        }
1689 2
    }
1690
1691
    /**
1692
     * Builds HTTP request headers.
1693
     *
1694
     * @param array $url
1695
     * @param string $crawlerId
1696
     *
1697
     * @return array
1698
     */
1699 6
    protected function buildRequestHeaderArray(array $url, $crawlerId)
1700
    {
1701 6
        $reqHeaders = [];
1702 6
        $reqHeaders[] = 'GET ' . $url['path'] . ($url['query'] ? '?' . $url['query'] : '') . ' HTTP/1.0';
1703 6
        $reqHeaders[] = 'Host: ' . $url['host'];
1704 6
        if (stristr($url['query'], 'ADMCMD_previewWS')) {
1705 2
            $reqHeaders[] = 'Cookie: $Version="1"; be_typo_user="1"; $Path=/';
1706
        }
1707 6
        $reqHeaders[] = 'Connection: close';
1708 6
        if ($url['user'] != '') {
1709 2
            $reqHeaders[] = 'Authorization: Basic ' . base64_encode($url['user'] . ':' . $url['pass']);
1710
        }
1711 6
        $reqHeaders[] = 'X-T3crawler: ' . $crawlerId;
1712 6
        $reqHeaders[] = 'User-Agent: TYPO3 crawler';
1713 6
        return $reqHeaders;
1714
    }
1715
1716
    /**
1717
     * Check if the submitted HTTP-Header contains a redirect location and built new crawler-url
1718
     *
1719
     * @param array $headers HTTP Header
1720
     * @param string $user HTTP Auth. User
1721
     * @param string $pass HTTP Auth. Password
1722
     * @return bool|string
1723
     */
1724 12
    protected function getRequestUrlFrom302Header($headers, $user = '', $pass = '')
1725
    {
1726 12
        $header = [];
1727 12
        if (!is_array($headers)) {
1728 1
            return false;
1729
        }
1730 11
        if (!(stristr($headers[0], '301 Moved') || stristr($headers[0], '302 Found') || stristr($headers[0], '302 Moved'))) {
1731 2
            return false;
1732
        }
1733
1734 9
        foreach ($headers as $hl) {
1735 9
            $tmp = explode(": ", $hl);
1736 9
            $header[trim($tmp[0])] = trim($tmp[1]);
1737 9
            if (trim($tmp[0]) == 'Location') {
1738 9
                break;
1739
            }
1740
        }
1741 9
        if (!array_key_exists('Location', $header)) {
1742 3
            return false;
1743
        }
1744
1745 6
        if ($user != '') {
1746 3
            if (!($tmp = parse_url($header['Location']))) {
1747 1
                return false;
1748
            }
1749 2
            $newUrl = $tmp['scheme'] . '://' . $user . ':' . $pass . '@' . $tmp['host'] . $tmp['path'];
1750 2
            if ($tmp['query'] != '') {
1751 2
                $newUrl .= '?' . $tmp['query'];
1752
            }
1753
        } else {
1754 3
            $newUrl = $header['Location'];
1755
        }
1756 5
        return $newUrl;
1757
    }
1758
1759
    /**************************
1760
     *
1761
     * tslib_fe hooks:
1762
     *
1763
     **************************/
1764
1765
    /**
1766
     * Initialization hook (called after database connection)
1767
     * Takes the "HTTP_X_T3CRAWLER" header and looks up queue record and verifies if the session comes from the system (by comparing hashes)
1768
     *
1769
     * @param array $params Parameters from frontend
1770
     * @param object $ref TSFE object (reference under PHP5)
1771
     * @return void
1772
     *
1773
     * FIXME: Look like this is not used, in commit 9910d3f40cce15f4e9b7bcd0488bf21f31d53ebc it's added as public,
1774
     * FIXME: I think this can be removed. (TNM)
1775
     */
1776
    public function fe_init(&$params, $ref)
0 ignored issues
show
Unused Code introduced by
The parameter $ref is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
1777
    {
1778
        // Authenticate crawler request:
1779
        if (isset($_SERVER['HTTP_X_T3CRAWLER'])) {
1780
            list($queueId, $hash) = explode(':', $_SERVER['HTTP_X_T3CRAWLER']);
1781
            list($queueRec) = $this->db->exec_SELECTgetSingleRow('*', 'tx_crawler_queue', 'qid=' . intval($queueId));
1782
1783
            // If a crawler record was found and hash was matching, set it up:
1784
            if (is_array($queueRec) && $hash === md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey'])) {
1785
                $params['pObj']->applicationData['tx_crawler']['running'] = true;
1786
                $params['pObj']->applicationData['tx_crawler']['parameters'] = unserialize($queueRec['parameters']);
1787
                $params['pObj']->applicationData['tx_crawler']['log'] = [];
1788
            } else {
1789
                die('No crawler entry found!');
1790
            }
1791
        }
1792
    }
1793
1794
    /*****************************
1795
     *
1796
     * Compiling URLs to crawl - tools
1797
     *
1798
     *****************************/
1799
1800
    /**
1801
     * @param integer $id Root page id to start from.
1802
     * @param integer $depth Depth of tree, 0=only id-page, 1= on sublevel, 99 = infinite
1803
     * @param integer $scheduledTime Unix Time when the URL is timed to be visited when put in queue
1804
     * @param integer $reqMinute Number of requests per minute (creates the interleave between requests)
1805
     * @param boolean $submitCrawlUrls If set, submits the URLs to queue in database (real crawling)
1806
     * @param boolean $downloadCrawlUrls If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
1807
     * @param array $incomingProcInstructions Array of processing instructions
1808
     * @param array $configurationSelection Array of configuration keys
1809
     * @return string
1810
     */
1811
    public function getPageTreeAndUrls(
1812
        $id,
1813
        $depth,
1814
        $scheduledTime,
1815
        $reqMinute,
1816
        $submitCrawlUrls,
1817
        $downloadCrawlUrls,
1818
        array $incomingProcInstructions,
1819
        array $configurationSelection
1820
    ) {
1821
        global $BACK_PATH;
1822
        global $LANG;
1823
        if (!is_object($LANG)) {
1824
            $LANG = GeneralUtility::makeInstance(LanguageService::class);
1825
            $LANG->init(0);
1826
        }
1827
        $this->scheduledTime = $scheduledTime;
1828
        $this->reqMinute = $reqMinute;
1829
        $this->submitCrawlUrls = $submitCrawlUrls;
1830
        $this->downloadCrawlUrls = $downloadCrawlUrls;
1831
        $this->incomingProcInstructions = $incomingProcInstructions;
1832
        $this->incomingConfigurationSelection = $configurationSelection;
1833
1834
        $this->duplicateTrack = [];
1835
        $this->downloadUrls = [];
1836
1837
        // Drawing tree:
1838
        /* @var PageTreeView $tree */
1839
        $tree = GeneralUtility::makeInstance(PageTreeView::class);
1840
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
1841
        $tree->init('AND ' . $perms_clause);
1842
1843
        $pageInfo = BackendUtility::readPageAccess($id, $perms_clause);
1844
        if (is_array($pageInfo)) {
1845
            // Set root row:
1846
            $tree->tree[] = [
1847
                'row' => $pageInfo,
1848
                'HTML' => IconUtility::getIconForRecord('pages', $pageInfo),
1849
            ];
1850
        }
1851
1852
        // Get branch beneath:
1853
        if ($depth) {
1854
            $tree->getTree($id, $depth, '');
1855
        }
1856
1857
        // Traverse page tree:
1858
        $code = '';
1859
1860
        foreach ($tree->tree as $data) {
1861
            $this->MP = false;
1862
1863
            // recognize mount points
1864
            if ($data['row']['doktype'] == 7) {
1865
                $mountpage = $this->db->exec_SELECTgetRows('*', 'pages', 'uid = ' . $data['row']['uid']);
1866
1867
                // fetch mounted pages
1868
                $this->MP = $mountpage[0]['mount_pid'] . '-' . $data['row']['uid'];
0 ignored issues
show
Documentation Bug introduced by
The property $MP was declared of type boolean, but $mountpage[0]['mount_pid...' . $data['row']['uid'] is of type string. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
1869
1870
                $mountTree = GeneralUtility::makeInstance(PageTreeView::class);
1871
                $mountTree->init('AND ' . $perms_clause);
1872
                $mountTree->getTree($mountpage[0]['mount_pid'], $depth, '');
1873
1874
                foreach ($mountTree->tree as $mountData) {
1875
                    $code .= $this->drawURLs_addRowsForPage(
1876
                        $mountData['row'],
1877
                        $mountData['HTML'] . BackendUtility::getRecordTitle('pages', $mountData['row'], true)
1878
                    );
1879
                }
1880
1881
                // replace page when mount_pid_ol is enabled
1882
                if ($mountpage[0]['mount_pid_ol']) {
1883
                    $data['row']['uid'] = $mountpage[0]['mount_pid'];
1884
                } else {
1885
                    // if the mount_pid_ol is not set the MP must not be used for the mountpoint page
1886
                    $this->MP = false;
1887
                }
1888
            }
1889
1890
            $code .= $this->drawURLs_addRowsForPage(
1891
                $data['row'],
1892
                $data['HTML'] . BackendUtility::getRecordTitle('pages', $data['row'], true)
1893
            );
1894
        }
1895
1896
        return $code;
1897
    }
1898
1899
    /**
1900
     * Expands exclude string
1901
     *
1902
     * @param string $excludeString Exclude string
1903
     * @return array
1904
     */
1905 1
    public function expandExcludeString($excludeString)
1906
    {
1907
        // internal static caches;
1908 1
        static $expandedExcludeStringCache;
1909 1
        static $treeCache;
1910
1911 1
        if (empty($expandedExcludeStringCache[$excludeString])) {
1912 1
            $pidList = [];
1913
1914 1
            if (!empty($excludeString)) {
1915
                /** @var PageTreeView $tree */
1916
                $tree = GeneralUtility::makeInstance(PageTreeView::class);
1917
                $tree->init('AND ' . $this->backendUser->getPagePermsClause(1));
1918
1919
                $excludeParts = GeneralUtility::trimExplode(',', $excludeString);
1920
1921
                foreach ($excludeParts as $excludePart) {
1922
                    list($pid, $depth) = GeneralUtility::trimExplode('+', $excludePart);
1923
1924
                    // default is "page only" = "depth=0"
1925
                    if (empty($depth)) {
1926
                        $depth = (stristr($excludePart, '+')) ? 99 : 0;
1927
                    }
1928
1929
                    $pidList[] = $pid;
1930
1931
                    if ($depth > 0) {
1932
                        if (empty($treeCache[$pid][$depth])) {
1933
                            $tree->reset();
1934
                            $tree->getTree($pid, $depth);
1935
                            $treeCache[$pid][$depth] = $tree->tree;
1936
                        }
1937
1938
                        foreach ($treeCache[$pid][$depth] as $data) {
1939
                            $pidList[] = $data['row']['uid'];
1940
                        }
1941
                    }
1942
                }
1943
            }
1944
1945 1
            $expandedExcludeStringCache[$excludeString] = array_unique($pidList);
1946
        }
1947
1948 1
        return $expandedExcludeStringCache[$excludeString];
1949
    }
1950
1951
    /**
1952
     * Create the rows for display of the page tree
1953
     * For each page a number of rows are shown displaying GET variable configuration
1954
     *
1955
     * @param    array        Page row
1956
     * @param    string        Page icon and title for row
1957
     * @return    string        HTML <tr> content (one or more)
1958
     */
1959
    public function drawURLs_addRowsForPage(array $pageRow, $pageTitleAndIcon)
1960
    {
1961
        $skipMessage = '';
1962
1963
        // Get list of configurations
1964
        $configurations = $this->getUrlsForPageRow($pageRow, $skipMessage);
1965
1966
        if (count($this->incomingConfigurationSelection) > 0) {
1967
            // remove configuration that does not match the current selection
1968
            foreach ($configurations as $confKey => $confArray) {
1969
                if (!in_array($confKey, $this->incomingConfigurationSelection)) {
1970
                    unset($configurations[$confKey]);
1971
                }
1972
            }
1973
        }
1974
1975
        // Traverse parameter combinations:
1976
        $c = 0;
1977
        $content = '';
1978
        if (count($configurations)) {
1979
            foreach ($configurations as $confKey => $confArray) {
1980
1981
                    // Title column:
1982
                if (!$c) {
1983
                    $titleClm = '<td rowspan="' . count($configurations) . '">' . $pageTitleAndIcon . '</td>';
1984
                } else {
1985
                    $titleClm = '';
1986
                }
1987
1988
                if (!in_array($pageRow['uid'], $this->expandExcludeString($confArray['subCfg']['exclude']))) {
1989
1990
                        // URL list:
1991
                    $urlList = $this->urlListFromUrlArray(
1992
                        $confArray,
1993
                        $pageRow,
1994
                        $this->scheduledTime,
1995
                        $this->reqMinute,
1996
                        $this->submitCrawlUrls,
1997
                        $this->downloadCrawlUrls,
1998
                        $this->duplicateTrack,
1999
                        $this->downloadUrls,
2000
                        $this->incomingProcInstructions // if empty the urls won't be filtered by processing instructions
2001
                    );
2002
2003
                    // Expanded parameters:
2004
                    $paramExpanded = '';
2005
                    $calcAccu = [];
2006
                    $calcRes = 1;
2007
                    foreach ($confArray['paramExpanded'] as $gVar => $gVal) {
2008
                        $paramExpanded .= '
2009
                            <tr>
2010
                                <td class="bgColor4-20">' . htmlspecialchars('&' . $gVar . '=') . '<br/>' .
2011
                                                '(' . count($gVal) . ')' .
2012
                                                '</td>
2013
                                <td class="bgColor4" nowrap="nowrap">' . nl2br(htmlspecialchars(implode(chr(10), $gVal))) . '</td>
2014
                            </tr>
2015
                        ';
2016
                        $calcRes *= count($gVal);
2017
                        $calcAccu[] = count($gVal);
2018
                    }
2019
                    $paramExpanded = '<table class="lrPadding c-list param-expanded">' . $paramExpanded . '</table>';
2020
                    $paramExpanded .= 'Comb: ' . implode('*', $calcAccu) . '=' . $calcRes;
2021
2022
                    // Options
2023
                    $optionValues = '';
2024
                    if ($confArray['subCfg']['userGroups']) {
2025
                        $optionValues .= 'User Groups: ' . $confArray['subCfg']['userGroups'] . '<br/>';
2026
                    }
2027
                    if ($confArray['subCfg']['baseUrl']) {
2028
                        $optionValues .= 'Base Url: ' . $confArray['subCfg']['baseUrl'] . '<br/>';
2029
                    }
2030
                    if ($confArray['subCfg']['procInstrFilter']) {
2031
                        $optionValues .= 'ProcInstr: ' . $confArray['subCfg']['procInstrFilter'] . '<br/>';
2032
                    }
2033
2034
                    // Compile row:
2035
                    $content .= '
2036
                        <tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
2037
                            ' . $titleClm . '
2038
                            <td>' . htmlspecialchars($confKey) . '</td>
2039
                            <td>' . nl2br(htmlspecialchars(rawurldecode(trim(str_replace('&', chr(10) . '&', GeneralUtility::implodeArrayForUrl('', $confArray['paramParsed'])))))) . '</td>
2040
                            <td>' . $paramExpanded . '</td>
2041
                            <td nowrap="nowrap">' . $urlList . '</td>
2042
                            <td nowrap="nowrap">' . $optionValues . '</td>
2043
                            <td nowrap="nowrap">' . DebugUtility::viewArray($confArray['subCfg']['procInstrParams.']) . '</td>
2044
                        </tr>';
2045
                } else {
2046
                    $content .= '<tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
2047
                            ' . $titleClm . '
2048
                            <td>' . htmlspecialchars($confKey) . '</td>
2049
                            <td colspan="5"><em>No entries</em> (Page is excluded in this configuration)</td>
2050
                        </tr>';
2051
                }
2052
2053
                $c++;
2054
            }
2055
        } else {
2056
            $message = !empty($skipMessage) ? ' (' . $skipMessage . ')' : '';
2057
2058
            // Compile row:
2059
            $content .= '
2060
                <tr class="bgColor-20" style="border-bottom: 1px solid black;">
2061
                    <td>' . $pageTitleAndIcon . '</td>
2062
                    <td colspan="6"><em>No entries</em>' . $message . '</td>
2063
                </tr>';
2064
        }
2065
2066
        return $content;
2067
    }
2068
2069
    /*****************************
2070
     *
2071
     * CLI functions
2072
     *
2073
     *****************************/
2074
2075
    /**
2076
     * Main function for running from Command Line PHP script (cron job)
2077
     * See ext/crawler/cli/crawler_cli.phpsh for details
2078
     *
2079
     * @return int number of remaining items or false if error
2080
     */
2081
    public function CLI_main()
2082
    {
2083
        $this->setAccessMode('cli');
2084
        $result = self::CLI_STATUS_NOTHING_PROCCESSED;
2085
        $cliObj = GeneralUtility::makeInstance(CrawlerCommandLineController::class);
2086
2087
        if (isset($cliObj->cli_args['-h']) || isset($cliObj->cli_args['--help'])) {
2088
            $cliObj->cli_validateArgs();
2089
            $cliObj->cli_help();
2090
            exit;
2091
        }
2092
2093
        if (!$this->getDisabled() && $this->CLI_checkAndAcquireNewProcess($this->CLI_buildProcessId())) {
2094
            $countInARun = $cliObj->cli_argValue('--countInARun') ? intval($cliObj->cli_argValue('--countInARun')) : $this->extensionSettings['countInARun'];
2095
            // Seconds
2096
            $sleepAfterFinish = $cliObj->cli_argValue('--sleepAfterFinish') ? intval($cliObj->cli_argValue('--sleepAfterFinish')) : $this->extensionSettings['sleepAfterFinish'];
2097
            // Milliseconds
2098
            $sleepTime = $cliObj->cli_argValue('--sleepTime') ? intval($cliObj->cli_argValue('--sleepTime')) : $this->extensionSettings['sleepTime'];
2099
2100
            try {
2101
                // Run process:
2102
                $result = $this->CLI_run($countInARun, $sleepTime, $sleepAfterFinish);
2103
            } catch (\Exception $e) {
2104
                $this->CLI_debug(get_class($e) . ': ' . $e->getMessage());
2105
                $result = self::CLI_STATUS_ABORTED;
2106
            }
2107
2108
            // Cleanup
2109
            $this->db->exec_DELETEquery('tx_crawler_process', 'assigned_items_count = 0');
2110
2111
            //TODO can't we do that in a clean way?
2112
            $releaseStatus = $this->CLI_releaseProcesses($this->CLI_buildProcessId());
0 ignored issues
show
Unused Code introduced by
$releaseStatus is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
Deprecated Code introduced by
The method AOE\Crawler\Controller\C...:CLI_releaseProcesses() has been deprecated with message: since crawler v6.5.1, will be removed in crawler v9.0.0.

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2113
2114
            $this->CLI_debug("Unprocessed Items remaining:" . $this->queueRepository->countUnprocessedItems() . " (" . $this->CLI_buildProcessId() . ")");
2115
            $result |= ($this->queueRepository->countUnprocessedItems() > 0 ? self::CLI_STATUS_REMAIN : self::CLI_STATUS_NOTHING_PROCCESSED);
2116
        } else {
2117
            $result |= self::CLI_STATUS_ABORTED;
2118
        }
2119
2120
        return $result;
2121
    }
2122
2123
    /**
2124
     * Function executed by crawler_im.php cli script.
2125
     *
2126
     * @return void
2127
     */
2128
    public function CLI_main_im()
2129
    {
2130
        $this->setAccessMode('cli_im');
2131
2132
        $cliObj = GeneralUtility::makeInstance(QueueCommandLineController::class);
2133
2134
        // Force user to admin state and set workspace to "Live":
2135
        $this->backendUser->user['admin'] = 1;
2136
        $this->backendUser->setWorkspace(0);
2137
2138
        // Print help
2139
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2140
            $cliObj->cli_validateArgs();
2141
            $cliObj->cli_help();
2142
            exit;
2143
        }
2144
2145
        $cliObj->cli_validateArgs();
2146
2147
        if ($cliObj->cli_argValue('-o') === 'exec') {
2148
            $this->registerQueueEntriesInternallyOnly = true;
2149
        }
2150
2151
        if (isset($cliObj->cli_args['_DEFAULT'][2])) {
2152
            // Crawler is called over TYPO3 BE
2153
            $pageId = MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][2], 0);
2154
        } else {
2155
            // Crawler is called over cli
2156
            $pageId = MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2157
        }
2158
2159
        $configurationKeys = $this->getConfigurationKeys($cliObj);
0 ignored issues
show
Deprecated Code introduced by
The method AOE\Crawler\Controller\C...:getConfigurationKeys() has been deprecated with message: since crawler v6.3.0, will be removed in crawler v7.0.0.

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2160
2161
        if (!is_array($configurationKeys)) {
2162
            $configurations = $this->getUrlsForPageId($pageId);
2163
            if (is_array($configurations)) {
2164
                $configurationKeys = array_keys($configurations);
2165
            } else {
2166
                $configurationKeys = [];
2167
            }
2168
        }
2169
2170
        if ($cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec') {
2171
            $reason = new Reason();
2172
            $reason->setReason(Reason::REASON_GUI_SUBMIT);
2173
            $reason->setDetailText('The cli script of the crawler added to the queue');
2174
2175
            // The event dispatcher is deprecated since crawler v6.4.0, will be removed in crawler v7.0.0.
2176
            // Please use the Signal instead.
2177
            EventDispatcher::getInstance()->post(
2178
                'invokeQueueChange',
2179
                $this->setID,
2180
                ['reason' => $reason]
2181
            );
2182
2183
            $signalPayload = ['reason' => $reason];
2184
            SignalSlotUtility::emitSignal(
2185
                __CLASS__,
2186
                SignalSlotUtility::SIGNAL_INVOKE_QUEUE_CHANGE,
2187
                $signalPayload
2188
            );
2189
        }
2190
2191
        if ($this->extensionSettings['cleanUpOldQueueEntries']) {
2192
            $this->cleanUpOldQueueEntries();
2193
        }
2194
2195
        $this->setID = (int) GeneralUtility::md5int(microtime());
2196
        $this->getPageTreeAndUrls(
2197
            $pageId,
2198
            MathUtility::forceIntegerInRange($cliObj->cli_argValue('-d'), 0, 99),
2199
            $this->getCurrentTime(),
2200
            MathUtility::forceIntegerInRange($cliObj->cli_isArg('-n') ? $cliObj->cli_argValue('-n') : 30, 1, 1000),
2201
            $cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec',
2202
            $cliObj->cli_argValue('-o') === 'url',
2203
            GeneralUtility::trimExplode(',', $cliObj->cli_argValue('-proc'), true),
2204
            $configurationKeys
2205
        );
2206
2207
        if ($cliObj->cli_argValue('-o') === 'url') {
2208
            $cliObj->cli_echo(implode(chr(10), $this->downloadUrls) . chr(10), true);
2209
        } elseif ($cliObj->cli_argValue('-o') === 'exec') {
2210
            $cliObj->cli_echo("Executing " . count($this->urlList) . " requests right away:\n\n");
2211
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2212
            $cliObj->cli_echo("\nProcessing:\n");
2213
2214
            foreach ($this->queueEntries as $queueRec) {
2215
                $p = unserialize($queueRec['parameters']);
2216
                $cliObj->cli_echo($p['url'] . ' (' . implode(',', $p['procInstructions']) . ') => ');
2217
2218
                $result = $this->readUrlFromArray($queueRec);
2219
2220
                $requestResult = unserialize($result['content']);
2221
                if (is_array($requestResult)) {
2222
                    $resLog = is_array($requestResult['log']) ? chr(10) . chr(9) . chr(9) . implode(chr(10) . chr(9) . chr(9), $requestResult['log']) : '';
2223
                    $cliObj->cli_echo('OK: ' . $resLog . chr(10));
2224
                } else {
2225
                    $cliObj->cli_echo('Error checking Crawler Result: ' . substr(preg_replace('/\s+/', ' ', strip_tags($result['content'])), 0, 30000) . '...' . chr(10));
2226
                }
2227
            }
2228
        } elseif ($cliObj->cli_argValue('-o') === 'queue') {
2229
            $cliObj->cli_echo("Putting " . count($this->urlList) . " entries in queue:\n\n");
2230
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2231
        } else {
2232
            $cliObj->cli_echo(count($this->urlList) . " entries found for processing. (Use -o to decide action):\n\n", true);
2233
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10), true);
2234
        }
2235
    }
2236
2237
    /**
2238
     * Function executed by crawler_im.php cli script.
2239
     *
2240
     * @return bool
2241
     */
2242
    public function CLI_main_flush()
2243
    {
2244
        $this->setAccessMode('cli_flush');
2245
        $cliObj = GeneralUtility::makeInstance(FlushCommandLineController::class);
2246
2247
        // Force user to admin state and set workspace to "Live":
2248
        $this->backendUser->user['admin'] = 1;
2249
        $this->backendUser->setWorkspace(0);
2250
2251
        // Print help
2252
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2253
            $cliObj->cli_validateArgs();
2254
            $cliObj->cli_help();
2255
            exit;
2256
        }
2257
2258
        $cliObj->cli_validateArgs();
2259
        $pageId = MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2260
        $fullFlush = ($pageId == 0);
2261
2262
        $mode = $cliObj->cli_argValue('-o');
2263
2264
        switch ($mode) {
2265
            case 'all':
2266
                $result = $this->getLogEntriesForPageId($pageId, '', true, $fullFlush);
2267
                break;
2268
            case 'finished':
2269
            case 'pending':
2270
                $result = $this->getLogEntriesForPageId($pageId, $mode, true, $fullFlush);
2271
                break;
2272
            default:
2273
                $cliObj->cli_validateArgs();
2274
                $cliObj->cli_help();
2275
                $result = false;
2276
        }
2277
2278
        return $result !== false;
2279
    }
2280
2281
    /**
2282
     * Obtains configuration keys from the CLI arguments
2283
     *
2284
     * @param QueueCommandLineController $cliObj
2285
     * @return array
2286
     *
2287
     * @deprecated since crawler v6.3.0, will be removed in crawler v7.0.0.
2288
     */
2289
    protected function getConfigurationKeys(QueueCommandLineController $cliObj)
2290
    {
2291
        $parameter = trim($cliObj->cli_argValue('-conf'));
2292
        return ($parameter != '' ? GeneralUtility::trimExplode(',', $parameter) : []);
2293
    }
2294
2295
    /**
2296
     * Running the functionality of the CLI (crawling URLs from queue)
2297
     *
2298
     * @param int $countInARun
2299
     * @param int $sleepTime
2300
     * @param int $sleepAfterFinish
2301
     * @return string
2302
     */
2303
    public function CLI_run($countInARun, $sleepTime, $sleepAfterFinish)
2304
    {
2305
        $result = 0;
2306
        $counter = 0;
2307
2308
        // First, run hooks:
2309
        $this->CLI_runHooks();
2310
2311
        // Clean up the queue
2312
        if (intval($this->extensionSettings['purgeQueueDays']) > 0) {
2313
            $purgeDate = $this->getCurrentTime() - 24 * 60 * 60 * intval($this->extensionSettings['purgeQueueDays']);
2314
            $del = $this->db->exec_DELETEquery(
2315
                'tx_crawler_queue',
2316
                'exec_time!=0 AND exec_time<' . $purgeDate
2317
            );
2318
            if (false == $del) {
2319
                GeneralUtility::devLog('Records could not be deleted.', 'crawler', LogLevel::INFO);
2320
            }
2321
        }
2322
2323
        // Select entries:
2324
        //TODO Shouldn't this reside within the transaction?
2325
        $rows = $this->db->exec_SELECTgetRows(
2326
            'qid,scheduled',
2327
            'tx_crawler_queue',
2328
            'exec_time=0
2329
                AND process_scheduled= 0
2330
                AND scheduled<=' . $this->getCurrentTime(),
2331
            '',
2332
            'scheduled, qid',
2333
            intval($countInARun)
2334
        );
2335
2336
        if (count($rows) > 0) {
2337
            $quidList = [];
2338
2339
            foreach ($rows as $r) {
0 ignored issues
show
Bug introduced by
The expression $rows of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
2340
                $quidList[] = $r['qid'];
2341
            }
2342
2343
            $processId = $this->CLI_buildProcessId();
2344
2345
            //reserve queue entries for process
2346
            $this->db->sql_query('BEGIN');
2347
            //TODO make sure we're not taking assigned queue-entires
2348
            $this->db->exec_UPDATEquery(
2349
                'tx_crawler_queue',
2350
                'qid IN (' . implode(',', $quidList) . ')',
2351
                [
2352
                    'process_scheduled' => intval($this->getCurrentTime()),
2353
                    'process_id' => $processId,
2354
                ]
2355
            );
2356
2357
            //save the number of assigned queue entrys to determine who many have been processed later
2358
            $numberOfAffectedRows = $this->db->sql_affected_rows();
2359
            $this->db->exec_UPDATEquery(
2360
                'tx_crawler_process',
2361
                "process_id = '" . $processId . "'",
2362
                [
2363
                    'assigned_items_count' => intval($numberOfAffectedRows),
2364
                ]
2365
            );
2366
2367
            if ($numberOfAffectedRows == count($quidList)) {
2368
                $this->db->sql_query('COMMIT');
2369
            } else {
2370
                $this->db->sql_query('ROLLBACK');
2371
                $this->CLI_debug("Nothing processed due to multi-process collision (" . $this->CLI_buildProcessId() . ")");
2372
                return ($result | self::CLI_STATUS_ABORTED);
2373
            }
2374
2375
            foreach ($rows as $r) {
0 ignored issues
show
Bug introduced by
The expression $rows of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
2376
                $result |= $this->readUrl($r['qid']);
2377
2378
                $counter++;
2379
                usleep(intval($sleepTime)); // Just to relax the system
2380
2381
                // if during the start and the current read url the cli has been disable we need to return from the function
2382
                // mark the process NOT as ended.
2383
                if ($this->getDisabled()) {
2384
                    return ($result | self::CLI_STATUS_ABORTED);
2385
                }
2386
2387
                $process = $this->processRepository->findByUid($this->CLI_buildProcessId());
2388
                if (!$process->isActive()) {
2389
                    $this->CLI_debug("conflict / timeout (" . $this->CLI_buildProcessId() . ")");
2390
2391
                    //TODO might need an additional returncode
2392
                    $result |= self::CLI_STATUS_ABORTED;
2393
                    break; //possible timeout
2394
                }
2395
            }
2396
2397
            sleep(intval($sleepAfterFinish));
2398
2399
            $msg = 'Rows: ' . $counter;
2400
            $this->CLI_debug($msg . " (" . $this->CLI_buildProcessId() . ")");
2401
        } else {
2402
            $this->CLI_debug("Nothing within queue which needs to be processed (" . $this->CLI_buildProcessId() . ")");
2403
        }
2404
2405
        if ($counter > 0) {
2406
            $result |= self::CLI_STATUS_PROCESSED;
2407
        }
2408
2409
        return $result;
2410
    }
2411
2412
    /**
2413
     * Activate hooks
2414
     *
2415
     * @return void
2416
     */
2417
    public function CLI_runHooks()
2418
    {
2419
        global $TYPO3_CONF_VARS;
2420
        if (is_array($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'])) {
2421
            foreach ($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'] as $objRef) {
2422
                $hookObj = &GeneralUtility::getUserObj($objRef);
2423
                if (is_object($hookObj)) {
2424
                    $hookObj->crawler_init($this);
2425
                }
2426
            }
2427
        }
2428
    }
2429
2430
    /**
2431
     * Try to acquire a new process with the given id
2432
     * also performs some auto-cleanup for orphan processes
2433
     * @todo preemption might not be the most elegant way to clean up
2434
     *
2435
     * @param string $id identification string for the process
2436
     * @return boolean
2437
     */
2438
    public function CLI_checkAndAcquireNewProcess($id)
2439
    {
2440
        $ret = true;
2441
2442
        $systemProcessId = getmypid();
2443
        if ($systemProcessId < 1) {
2444
            return false;
2445
        }
2446
2447
        $processCount = 0;
2448
        $orphanProcesses = [];
2449
2450
        $this->db->sql_query('BEGIN');
2451
2452
        $res = $this->db->exec_SELECTquery(
2453
            'process_id,ttl',
2454
            'tx_crawler_process',
2455
            'active=1 AND deleted=0'
2456
            );
2457
2458
        $currentTime = $this->getCurrentTime();
2459
2460
        while ($row = $this->db->sql_fetch_assoc($res)) {
2461
            if ($row['ttl'] < $currentTime) {
2462
                $orphanProcesses[] = $row['process_id'];
2463
            } else {
2464
                $processCount++;
2465
            }
2466
        }
2467
2468
        // if there are less than allowed active processes then add a new one
2469
        if ($processCount < intval($this->extensionSettings['processLimit'])) {
2470
            $this->CLI_debug("add process " . $this->CLI_buildProcessId() . " (" . ($processCount + 1) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2471
2472
            // create new process record
2473
            $this->db->exec_INSERTquery(
2474
                'tx_crawler_process',
2475
                [
2476
                    'process_id' => $id,
2477
                    'active' => '1',
2478
                    'ttl' => ($currentTime + intval($this->extensionSettings['processMaxRunTime'])),
2479
                    'system_process_id' => $systemProcessId,
2480
                ]
2481
                );
2482
        } else {
2483
            $this->CLI_debug("Processlimit reached (" . ($processCount) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2484
            $ret = false;
2485
        }
2486
2487
        $this->CLI_releaseProcesses($orphanProcesses, true); // maybe this should be somehow included into the current lock
0 ignored issues
show
Deprecated Code introduced by
The method AOE\Crawler\Controller\C...:CLI_releaseProcesses() has been deprecated with message: since crawler v6.5.1, will be removed in crawler v9.0.0.

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2488
        $this->CLI_deleteProcessesMarkedDeleted();
0 ignored issues
show
Deprecated Code introduced by
The method AOE\Crawler\Controller\C...rocessesMarkedDeleted() has been deprecated with message: since crawler v6.5.1, will be removed in crawler v9.0.0.

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2489
2490
        $this->db->sql_query('COMMIT');
2491
2492
        return $ret;
2493
    }
2494
2495
    /**
2496
     * Release a process and the required resources
2497
     *
2498
     * @param  mixed    $releaseIds   string with a single process-id or array with multiple process-ids
2499
     * @param  boolean  $withinLock   show whether the DB-actions are included within an existing lock
2500
     * @return boolean
2501
     *
2502
     * @deprecated since crawler v6.5.1, will be removed in crawler v9.0.0.
2503
     */
2504
    public function CLI_releaseProcesses($releaseIds, $withinLock = false)
2505
    {
2506
        if (!is_array($releaseIds)) {
2507
            $releaseIds = [$releaseIds];
2508
        }
2509
2510
        if (!count($releaseIds) > 0) {
2511
            return false;   //nothing to release
2512
        }
2513
2514
        if (!$withinLock) {
2515
            $this->db->sql_query('BEGIN');
2516
        }
2517
2518
        // some kind of 2nd chance algo - this way you need at least 2 processes to have a real cleanup
2519
        // this ensures that a single process can't mess up the entire process table
2520
2521
        // mark all processes as deleted which have no "waiting" queue-entires and which are not active
2522
        $this->db->exec_UPDATEquery(
2523
            'tx_crawler_queue',
2524
            'process_id IN (SELECT process_id FROM tx_crawler_process WHERE active=0 AND deleted=0)',
2525
            [
2526
                'process_scheduled' => 0,
2527
                'process_id' => '',
2528
            ]
2529
        );
2530
        $this->db->exec_UPDATEquery(
2531
            'tx_crawler_process',
2532
            'active=0 AND deleted=0
2533
            AND NOT EXISTS (
2534
                SELECT * FROM tx_crawler_queue
2535
                WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
2536
                AND tx_crawler_queue.exec_time = 0
2537
            )',
2538
            [
2539
                'deleted' => '1',
2540
                'system_process_id' => 0,
2541
            ]
2542
        );
2543
        // mark all requested processes as non-active
2544
        $this->db->exec_UPDATEquery(
2545
            'tx_crawler_process',
2546
            'process_id IN (\'' . implode('\',\'', $releaseIds) . '\') AND deleted=0',
2547
            [
2548
                'active' => '0',
2549
            ]
2550
        );
2551
        $this->db->exec_UPDATEquery(
2552
            'tx_crawler_queue',
2553
            'exec_time=0 AND process_id IN ("' . implode('","', $releaseIds) . '")',
2554
            [
2555
                'process_scheduled' => 0,
2556
                'process_id' => '',
2557
            ]
2558
        );
2559
2560
        if (!$withinLock) {
2561
            $this->db->sql_query('COMMIT');
2562
        }
2563
2564
        return true;
2565
    }
2566
2567
    /**
2568
     * Delete processes marked as deleted
2569
     *
2570
     * @return void
2571
     *
2572
     * @deprecated since crawler v6.5.1, will be removed in crawler v9.0.0.
2573
     */
2574 1
    public function CLI_deleteProcessesMarkedDeleted()
2575
    {
2576 1
        $this->db->exec_DELETEquery('tx_crawler_process', 'deleted = 1');
2577 1
    }
2578
2579
    /**
2580
     * Check if there are still resources left for the process with the given id
2581
     * Used to determine timeouts and to ensure a proper cleanup if there's a timeout
2582
     *
2583
     * @param  string  identification string for the process
2584
     * @return boolean determines if the process is still active / has resources
2585
     *
2586
     * @deprecated since crawler v6.5.1, will be removed in crawler v9.0.0.
2587
     *
2588
     * FIXME: Please remove Transaction, not needed as only a select query.
2589
     */
2590
    public function CLI_checkIfProcessIsActive($pid)
2591
    {
2592
        $ret = false;
2593
        $this->db->sql_query('BEGIN');
2594
        $res = $this->db->exec_SELECTquery(
2595
            'process_id,active,ttl',
2596
            'tx_crawler_process',
2597
            'process_id = \'' . $pid . '\'  AND deleted=0',
2598
            '',
2599
            'ttl',
2600
            '0,1'
2601
        );
2602
        if ($row = $this->db->sql_fetch_assoc($res)) {
2603
            $ret = intVal($row['active']) == 1;
2604
        }
2605
        $this->db->sql_query('COMMIT');
2606
2607
        return $ret;
2608
    }
2609
2610
    /**
2611
     * Create a unique Id for the current process
2612
     *
2613
     * @return string  the ID
2614
     */
2615 2
    public function CLI_buildProcessId()
2616
    {
2617 2
        if (!$this->processID) {
2618 1
            $this->processID = GeneralUtility::shortMD5($this->microtime(true));
2619
        }
2620 2
        return $this->processID;
2621
    }
2622
2623
    /**
2624
     * @param bool $get_as_float
2625
     *
2626
     * @return mixed
2627
     */
2628
    protected function microtime($get_as_float = false)
2629
    {
2630
        return microtime($get_as_float);
2631
    }
2632
2633
    /**
2634
     * Prints a message to the stdout (only if debug-mode is enabled)
2635
     *
2636
     * @param  string $msg  the message
2637
     */
2638
    public function CLI_debug($msg)
2639
    {
2640
        if (intval($this->extensionSettings['processDebug'])) {
2641
            echo $msg . "\n";
2642
            flush();
2643
        }
2644
    }
2645
2646
    /**
2647
     * Get URL content by making direct request to TYPO3.
2648
     *
2649
     * @param  string $url          Page URL
2650
     * @param  int    $crawlerId    Crawler-ID
2651
     * @return array
2652
     */
2653 2
    protected function sendDirectRequest($url, $crawlerId)
2654
    {
2655 2
        $parsedUrl = parse_url($url);
2656 2
        if (!is_array($parsedUrl)) {
2657
            return [];
2658
        }
2659
2660 2
        $requestHeaders = $this->buildRequestHeaderArray($parsedUrl, $crawlerId);
2661
2662 2
        $cmd = escapeshellcmd($this->extensionSettings['phpPath']);
2663 2
        $cmd .= ' ';
2664 2
        $cmd .= escapeshellarg(ExtensionManagementUtility::extPath('crawler') . 'cli/bootstrap.php');
2665 2
        $cmd .= ' ';
2666 2
        $cmd .= escapeshellarg($this->getFrontendBasePath());
2667 2
        $cmd .= ' ';
2668 2
        $cmd .= escapeshellarg($url);
2669 2
        $cmd .= ' ';
2670 2
        $cmd .= escapeshellarg(base64_encode(serialize($requestHeaders)));
2671
2672 2
        $startTime = microtime(true);
2673 2
        $content = $this->executeShellCommand($cmd);
2674 2
        $this->log($url . ' ' . (microtime(true) - $startTime));
2675
2676
        $result = [
2677 2
            'request' => implode("\r\n", $requestHeaders) . "\r\n\r\n",
2678 2
            'headers' => '',
2679 2
            'content' => $content,
2680
        ];
2681
2682 2
        return $result;
2683
    }
2684
2685
    /**
2686
     * Cleans up entries that stayed for too long in the queue. These are:
2687
     * - processed entries that are over 1.5 days in age
2688
     * - scheduled entries that are over 7 days old
2689
     *
2690
     * @return void
2691
     *
2692
     * TODO: Should be switched back to protected - TNM 2018-11-16
2693
     */
2694
    public function cleanUpOldQueueEntries()
2695
    {
2696
        $processedAgeInSeconds = $this->extensionSettings['cleanUpProcessedAge'] * 86400; // 24*60*60 Seconds in 24 hours
2697
        $scheduledAgeInSeconds = $this->extensionSettings['cleanUpScheduledAge'] * 86400;
2698
2699
        $now = time();
2700
        $condition = '(exec_time<>0 AND exec_time<' . ($now - $processedAgeInSeconds) . ') OR scheduled<=' . ($now - $scheduledAgeInSeconds);
2701
        $this->flushQueue($condition);
2702
    }
2703
2704
    /**
2705
     * Initializes a TypoScript Frontend necessary for using TypoScript and TypoLink functions
2706
     *
2707
     * @param int $id
2708
     * @param int $typeNum
2709
     *
2710
     * @throws \TYPO3\CMS\Core\Error\Http\ServiceUnavailableException
2711
     *
2712
     * @return void
2713
     */
2714
    protected function initTSFE($id = 1, $typeNum = 0)
2715
    {
2716
        EidUtility::initTCA();
2717
2718
        $isVersion7 = VersionNumberUtility::convertVersionNumberToInteger(TYPO3_version) < 8000000;
2719
        if ($isVersion7 && !is_object($GLOBALS['TT'])) {
2720
            /** @var NullTimeTracker $GLOBALS['TT'] */
2721
            $GLOBALS['TT'] = new NullTimeTracker();
0 ignored issues
show
Deprecated Code introduced by
The class TYPO3\CMS\Core\TimeTracker\NullTimeTracker has been deprecated with message: since TYPO3 v8, will be removed in v9

This class, trait or interface has been deprecated. The supplier of the file has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the type will be removed from the class and what other constant to use instead.

Loading history...
2722
            $GLOBALS['TT']->start();
0 ignored issues
show
Deprecated Code introduced by
The method TYPO3\CMS\Core\TimeTrack...ullTimeTracker::start() has been deprecated with message: since TYPO3 v8, will be removed in v9, use the regular time tracking

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2723
        } else {
2724
            $timeTracker = GeneralUtility::makeInstance(TimeTracker::class);
2725
            $timeTracker->start();
2726
        }
2727
2728
        $GLOBALS['TSFE'] = GeneralUtility::makeInstance(TypoScriptFrontendController::class, $GLOBALS['TYPO3_CONF_VARS'], $id, $typeNum);
2729
        $GLOBALS['TSFE']->sys_page = GeneralUtility::makeInstance(PageRepository::class);
2730
        $GLOBALS['TSFE']->sys_page->init(true);
2731
        $GLOBALS['TSFE']->connectToDB();
2732
        $GLOBALS['TSFE']->initFEuser();
2733
        $GLOBALS['TSFE']->determineId();
2734
        $GLOBALS['TSFE']->initTemplate();
2735
        $GLOBALS['TSFE']->rootLine = $GLOBALS['TSFE']->sys_page->getRootLine($id, '');
2736
        $GLOBALS['TSFE']->getConfigArray();
2737
        PageGenerator::pagegenInit();
0 ignored issues
show
Deprecated Code introduced by
The method TYPO3\CMS\Frontend\Page\...enerator::pagegenInit() has been deprecated with message: since TYPO3 v8, will be removed in TYPO3 v9

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2738
    }
2739
2740
    /**
2741
     * Returns a md5 hash generated from a serialized configuration array.
2742
     *
2743
     * @param array $configuration
2744
     *
2745
     * @return string
2746
     */
2747 10
    protected function getConfigurationHash(array $configuration)
2748
    {
2749 10
        unset($configuration['paramExpanded']);
2750 10
        unset($configuration['URLs']);
2751 10
        return md5(serialize($configuration));
2752
    }
2753
2754
    /**
2755
     * Check whether the Crawling Protocol should be http or https
2756
     *
2757
     * @param $crawlerConfiguration
2758
     * @param $pageConfiguration
2759
     *
2760
     * @return bool
2761
     */
2762 10
    protected function isCrawlingProtocolHttps($crawlerConfiguration, $pageConfiguration)
2763
    {
2764
        switch ($crawlerConfiguration) {
2765 10
            case -1:
2766 2
                return false;
2767 8
            case 0:
2768 4
                return $pageConfiguration;
2769 4
            case 1:
2770 2
                return true;
2771
            default:
2772 2
                return false;
2773
        }
2774
    }
2775
}
2776