Completed
Push — TYPO3_7 ( 85059e...d37b91 )
by Tomas Norre
10:37
created

CrawlerController   F

Complexity

Total Complexity 360

Size/Duplication

Total Lines 2727
Duplicated Lines 0 %

Coupling/Cohesion

Components 1
Dependencies 23

Test Coverage

Coverage 16.9%

Importance

Changes 0
Metric Value
dl 0
loc 2727
ccs 194
cts 1148
cp 0.169
rs 0.8
c 0
b 0
f 0
wmc 360
lcom 1
cbo 23

60 Methods

Rating   Name   Duplication   Size   Complexity  
A setExtensionSettings() 0 4 1
F checkIfPageShouldBeSkipped() 0 57 16
A getUrlsForPageRow() 0 15 3
A getAccessMode() 0 4 1
A setAccessMode() 0 4 1
A setDisabled() 0 10 3
A getDisabled() 0 8 2
A setProcessFilename() 0 4 1
A getProcessFilename() 0 4 1
A drawURLs_PIfilter() 0 12 4
A buildRequestHeaderArray() 0 16 4
B getRequestUrlFrom302Header() 0 34 11
A hasGroupAccess() 0 12 4
A parseParams() 0 15 3
A microtime() 0 4 1
B compileUrls() 0 25 7
A addQueueEntry_callBack() 0 19 3
A getCurrentTime() 0 4 1
C readUrl() 0 82 13
A readUrlFromArray() 0 24 1
A readUrl_exec() 0 38 4
D requestUrl() 0 93 19
B getFrontendBasePath() 0 23 8
A executeShellCommand() 0 5 1
A getHttpResponseFromStream() 0 23 5
A log() 0 9 3
A fe_init() 0 17 4
B getPageTreeAndUrls() 0 87 8
C drawURLs_addRowsForPage() 0 109 13
B CLI_main() 0 41 10
F CLI_main_im() 0 108 17
A CLI_main_flush() 0 38 5
A getConfigurationKeys() 0 5 2
C CLI_run() 0 108 10
A CLI_runHooks() 0 12 4
B CLI_checkAndAcquireNewProcess() 0 56 5
B CLI_releaseProcesses() 0 62 5
A CLI_checkIfProcessIsActive() 0 19 2
A CLI_buildProcessId() 0 7 2
A CLI_debug() 0 7 2
A sendDirectRequest() 0 31 2
A cleanUpOldQueueEntries() 0 9 1
A initTSFE() 0 25 3
A getConfigurationHash() 0 6 1
A isCrawlingProtocolHttps() 0 13 4
A __construct() 0 24 3
A noUnprocessedQueueEntriesForPageWithConfigurationHashExist() 0 8 1
F urlListFromUrlArray() 0 113 20
A getPageTSconfigForId() 0 22 4
F getUrlsForPageId() 0 135 27
A getBaseUrlForConfigurationRecord() 0 20 4
B getConfigurationsForBranch() 0 45 11
F expandParameters() 0 119 26
B getLogEntriesForPageId() 0 29 6
B getLogEntriesForSetId() 0 29 6
B flushQueue() 0 33 8
B addUrl() 0 86 6
B getDuplicateRowsIfExist() 0 40 7
B expandExcludeString() 0 45 9
A CLI_deleteProcessesMarkedDeleted() 0 4 1

How to fix   Complexity   

Complex Class

Complex classes like CrawlerController often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use CrawlerController, and based on these observations, apply Extract Interface, too.

1
<?php
2
namespace AOE\Crawler\Controller;
3
4
/***************************************************************
5
 *  Copyright notice
6
 *
7
 *  (c) 2017 AOE GmbH <[email protected]>
8
 *
9
 *  All rights reserved
10
 *
11
 *  This script is part of the TYPO3 project. The TYPO3 project is
12
 *  free software; you can redistribute it and/or modify
13
 *  it under the terms of the GNU General Public License as published by
14
 *  the Free Software Foundation; either version 3 of the License, or
15
 *  (at your option) any later version.
16
 *
17
 *  The GNU General Public License can be found at
18
 *  http://www.gnu.org/copyleft/gpl.html.
19
 *
20
 *  This script is distributed in the hope that it will be useful,
21
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
22
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
23
 *  GNU General Public License for more details.
24
 *
25
 *  This copyright notice MUST APPEAR in all copies of the script!
26
 ***************************************************************/
27
28
use AOE\Crawler\Command\CrawlerCommandLineController;
29
use AOE\Crawler\Command\FlushCommandLineController;
30
use AOE\Crawler\Command\QueueCommandLineController;
31
use AOE\Crawler\Domain\Model\Configuration;
32
use AOE\Crawler\Domain\Model\Reason;
33
use AOE\Crawler\Domain\Repository\ConfigurationRepository;
34
use AOE\Crawler\Domain\Repository\ProcessRepository;
35
use AOE\Crawler\Domain\Repository\QueueRepository;
36
use AOE\Crawler\Event\EventDispatcher;
37
use AOE\Crawler\Utility\IconUtility;
38
use AOE\Crawler\Utility\SignalSlotUtility;
39
use TYPO3\CMS\Backend\Tree\View\PageTreeView;
40
use TYPO3\CMS\Backend\Utility\BackendUtility;
41
use TYPO3\CMS\Core\Authentication\BackendUserAuthentication;
42
use TYPO3\CMS\Core\Database\DatabaseConnection;
43
use TYPO3\CMS\Core\Log\LogLevel;
44
use TYPO3\CMS\Core\TimeTracker\NullTimeTracker;
45
use TYPO3\CMS\Core\TimeTracker\TimeTracker;
46
use TYPO3\CMS\Core\Utility\DebugUtility;
47
use TYPO3\CMS\Core\Utility\ExtensionManagementUtility;
48
use TYPO3\CMS\Core\Utility\GeneralUtility;
49
use TYPO3\CMS\Core\Utility\MathUtility;
50
use TYPO3\CMS\Core\Utility\VersionNumberUtility;
51
use TYPO3\CMS\Extbase\Object\ObjectManager;
52
use TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController;
53
use TYPO3\CMS\Frontend\Page\PageGenerator;
54
use TYPO3\CMS\Frontend\Page\PageRepository;
55
use TYPO3\CMS\Frontend\Utility\EidUtility;
56
use TYPO3\CMS\Lang\LanguageService;
57
58
/**
59
 * Class CrawlerController
60
 *
61
 * @package AOE\Crawler\Controller
62
 */
63
class CrawlerController
64
{
65
    const CLI_STATUS_NOTHING_PROCCESSED = 0;
66
    const CLI_STATUS_REMAIN = 1; //queue not empty
67
    const CLI_STATUS_PROCESSED = 2; //(some) queue items where processed
68
    const CLI_STATUS_ABORTED = 4; //instance didn't finish
69
    const CLI_STATUS_POLLABLE_PROCESSED = 8;
70
71
    /**
72
     * @var integer
73
     */
74
    public $setID = 0;
75
76
    /**
77
     * @var string
78
     */
79
    public $processID = '';
80
81
    /**
82
     * One hour is max stalled time for the CLI
83
     * If the process had the status "start" for 3600 seconds, it will be regarded stalled and a new process is started
84
     *
85
     * @var integer
86
     */
87
    public $max_CLI_exec_time = 3600;
88
89
    /**
90
     * @var array
91
     */
92
    public $duplicateTrack = [];
93
94
    /**
95
     * @var array
96
     */
97
    public $downloadUrls = [];
98
99
    /**
100
     * @var array
101
     */
102
    public $incomingProcInstructions = [];
103
104
    /**
105
     * @var array
106
     */
107
    public $incomingConfigurationSelection = [];
108
109
    /**
110
     * @var bool
111
     */
112
    public $registerQueueEntriesInternallyOnly = false;
113
114
    /**
115
     * @var array
116
     */
117
    public $queueEntries = [];
118
119
    /**
120
     * @var array
121
     */
122
    public $urlList = [];
123
124
    /**
125
     * @var boolean
126
     */
127
    public $debugMode = false;
128
129
    /**
130
     * @var array
131
     */
132
    public $extensionSettings = [];
133
134
    /**
135
     * Mount Point
136
     *
137
     * @var boolean
138
     */
139
    public $MP = false;
140
141
    /**
142
     * @var string
143
     */
144
    protected $processFilename;
145
146
    /**
147
     * Holds the internal access mode can be 'gui','cli' or 'cli_im'
148
     *
149
     * @var string
150
     */
151
    protected $accessMode;
152
153
    /**
154
     * @var DatabaseConnection
155
     */
156
    private $db;
157
158
    /**
159
     * @var BackendUserAuthentication
160
     */
161
    private $backendUser;
162
163
    /**
164
     * @var integer
165
     */
166
    private $scheduledTime = 0;
167
168
    /**
169
     * @var integer
170
     */
171
    private $reqMinute = 0;
172
173
    /**
174
     * @var bool
175
     */
176
    private $submitCrawlUrls = false;
177
178
    /**
179
     * @var bool
180
     */
181
    private $downloadCrawlUrls = false;
182
183
    /**
184
     * @var QueueRepository
185
     */
186
    protected $queueRepository;
187
188
    /**
189
     * @var ProcessRepository
190
     */
191
    protected $processRepository;
192
193
    /**
194
     * @var ConfigurationRepository
195
     */
196
    protected $configurationRepository;
197
198
    /**
199
     * Method to set the accessMode can be gui, cli or cli_im
200
     *
201
     * @return string
202
     */
203 1
    public function getAccessMode()
204
    {
205 1
        return $this->accessMode;
206
    }
207
208
    /**
209
     * @param string $accessMode
210
     */
211 1
    public function setAccessMode($accessMode)
212
    {
213 1
        $this->accessMode = $accessMode;
214 1
    }
215
216
    /**
217
     * Set disabled status to prevent processes from being processed
218
     *
219
     * @param  bool $disabled (optional, defaults to true)
220
     * @return void
221
     */
222 3
    public function setDisabled($disabled = true)
223
    {
224 3
        if ($disabled) {
225 2
            GeneralUtility::writeFile($this->processFilename, '');
226
        } else {
227 1
            if (is_file($this->processFilename)) {
228 1
                unlink($this->processFilename);
229
            }
230
        }
231 3
    }
232
233
    /**
234
     * Get disable status
235
     *
236
     * @return bool true if disabled
237
     */
238 3
    public function getDisabled()
239
    {
240 3
        if (is_file($this->processFilename)) {
241 2
            return true;
242
        } else {
243 1
            return false;
244
        }
245
    }
246
247
    /**
248
     * @param string $filenameWithPath
249
     *
250
     * @return void
251
     */
252 4
    public function setProcessFilename($filenameWithPath)
253
    {
254 4
        $this->processFilename = $filenameWithPath;
255 4
    }
256
257
    /**
258
     * @return string
259
     */
260 1
    public function getProcessFilename()
261
    {
262 1
        return $this->processFilename;
263
    }
264
265
    /************************************
266
     *
267
     * Getting URLs based on Page TSconfig
268
     *
269
     ************************************/
270
271 2
    public function __construct()
272
    {
273 2
        $objectManager = GeneralUtility::makeInstance(ObjectManager::class);
274 2
        $this->queueRepository = $objectManager->get(QueueRepository::class);
275 2
        $this->configurationRepository = $objectManager->get(ConfigurationRepository::class);
276 2
        $this->processRepository = $objectManager->get(ProcessRepository::class);
277
278 2
        $this->db = $GLOBALS['TYPO3_DB'];
279 2
        $this->backendUser = $GLOBALS['BE_USER'];
280 2
        $this->processFilename = PATH_site . 'typo3temp/tx_crawler.proc';
281
282 2
        $settings = unserialize($GLOBALS['TYPO3_CONF_VARS']['EXT']['extConf']['crawler']);
283 2
        $settings = is_array($settings) ? $settings : [];
284
285
        // read ext_em_conf_template settings and set
286 2
        $this->setExtensionSettings($settings);
287
288
        // set defaults:
289 2
        if (MathUtility::convertToPositiveInteger($this->extensionSettings['countInARun']) == 0) {
290
            $this->extensionSettings['countInARun'] = 100;
291
        }
292
293 2
        $this->extensionSettings['processLimit'] = MathUtility::forceIntegerInRange($this->extensionSettings['processLimit'], 1, 99, 1);
294 2
    }
295
296
    /**
297
     * Sets the extensions settings (unserialized pendant of $TYPO3_CONF_VARS['EXT']['extConf']['crawler']).
298
     *
299
     * @param array $extensionSettings
300
     * @return void
301
     */
302 11
    public function setExtensionSettings(array $extensionSettings)
303
    {
304 11
        $this->extensionSettings = $extensionSettings;
305 11
    }
306
307
    /**
308
     * Check if the given page should be crawled
309
     *
310
     * @param array $pageRow
311
     * @return false|string false if the page should be crawled (not excluded), true / skipMessage if it should be skipped
312
     */
313 6
    public function checkIfPageShouldBeSkipped(array $pageRow)
314
    {
315 6
        $skipPage = false;
316 6
        $skipMessage = 'Skipped'; // message will be overwritten later
317
318
        // if page is hidden
319 6
        if (!$this->extensionSettings['crawlHiddenPages']) {
320 6
            if ($pageRow['hidden']) {
321 1
                $skipPage = true;
322 1
                $skipMessage = 'Because page is hidden';
323
            }
324
        }
325
326 6
        if (!$skipPage) {
327 5
            if (GeneralUtility::inList('3,4', $pageRow['doktype']) || $pageRow['doktype'] >= 199) {
328 3
                $skipPage = true;
329 3
                $skipMessage = 'Because doktype is not allowed';
330
            }
331
        }
332
333 6
        if (!$skipPage) {
334 2
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'])) {
335 2
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'] as $key => $doktypeList) {
336 1
                    if (GeneralUtility::inList($doktypeList, $pageRow['doktype'])) {
337 1
                        $skipPage = true;
338 1
                        $skipMessage = 'Doktype was excluded by "' . $key . '"';
339 1
                        break;
340
                    }
341
                }
342
            }
343
        }
344
345 6
        if (!$skipPage) {
346
            // veto hook
347 1
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'])) {
348
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'] as $key => $func) {
349
                    $params = [
350
                        'pageRow' => $pageRow,
351
                    ];
352
                    // expects "false" if page is ok and "true" or a skipMessage if this page should _not_ be crawled
353
                    $veto = GeneralUtility::callUserFunction($func, $params, $this);
354
                    if ($veto !== false) {
355
                        $skipPage = true;
356
                        if (is_string($veto)) {
357
                            $skipMessage = $veto;
358
                        } else {
359
                            $skipMessage = 'Veto from hook "' . htmlspecialchars($key) . '"';
360
                        }
361
                        // no need to execute other hooks if a previous one return a veto
362
                        break;
363
                    }
364
                }
365
            }
366
        }
367
368 6
        return $skipPage ? $skipMessage : false;
369
    }
370
371
    /**
372
     * Wrapper method for getUrlsForPageId()
373
     * It returns an array of configurations and no urls!
374
     *
375
     * @param array $pageRow Page record with at least dok-type and uid columns.
376
     * @param string $skipMessage
377
     * @return array
378
     * @see getUrlsForPageId()
379
     */
380 2
    public function getUrlsForPageRow(array $pageRow, &$skipMessage = '')
381
    {
382 2
        $message = $this->checkIfPageShouldBeSkipped($pageRow);
383
384 2
        if ($message === false) {
385 1
            $forceSsl = ($pageRow['url_scheme'] === 2) ? true : false;
386 1
            $res = $this->getUrlsForPageId($pageRow['uid'], $forceSsl);
387 1
            $skipMessage = '';
388
        } else {
389 1
            $skipMessage = $message;
390 1
            $res = [];
391
        }
392
393 2
        return $res;
394
    }
395
396
    /**
397
     * This method is used to count if there are ANY unprocessed queue entries
398
     * of a given page_id and the configuration which matches a given hash.
399
     * If there if none, we can skip an inner detail check
400
     *
401
     * @param  int $uid
402
     * @param  string $configurationHash
403
     * @return boolean
404
     */
405
    protected function noUnprocessedQueueEntriesForPageWithConfigurationHashExist($uid, $configurationHash)
406
    {
407
        $configurationHash = $this->db->fullQuoteStr($configurationHash, 'tx_crawler_queue');
408
        $res = $this->db->exec_SELECTquery('count(*) as anz', 'tx_crawler_queue', "page_id=" . intval($uid) . " AND configuration_hash=" . $configurationHash . " AND exec_time=0");
409
        $row = $this->db->sql_fetch_assoc($res);
410
411
        return ($row['anz'] == 0);
412
    }
413
414
    /**
415
     * Creates a list of URLs from input array (and submits them to queue if asked for)
416
     * See Web > Info module script + "indexed_search"'s crawler hook-client using this!
417
     *
418
     * @param    array        Information about URLs from pageRow to crawl.
419
     * @param    array        Page row
420
     * @param    integer        Unix time to schedule indexing to, typically time()
421
     * @param    integer        Number of requests per minute (creates the interleave between requests)
422
     * @param    boolean        If set, submits the URLs to queue
423
     * @param    boolean        If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
424
     * @param    array        Array which is passed by reference and contains the an id per url to secure we will not crawl duplicates
425
     * @param    array        Array which will be filled with URLS for download if flag is set.
426
     * @param    array        Array of processing instructions
427
     * @return    string        List of URLs (meant for display in backend module)
428
     *
429
     */
430
    public function urlListFromUrlArray(
431
        array $vv,
432
        array $pageRow,
433
        $scheduledTime,
434
        $reqMinute,
435
        $submitCrawlUrls,
436
        $downloadCrawlUrls,
437
        array &$duplicateTrack,
438
        array &$downloadUrls,
439
        array $incomingProcInstructions
440
    ) {
441
        $urlList = '';
442
        // realurl support (thanks to Ingo Renner)
443
        if (ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
444
445
            /** @var tx_realurl $urlObj */
446
            $urlObj = GeneralUtility::makeInstance('tx_realurl');
447
448
            if (!empty($vv['subCfg']['baseUrl'])) {
449
                $urlParts = parse_url($vv['subCfg']['baseUrl']);
450
                $host = strtolower($urlParts['host']);
451
                $urlObj->host = $host;
452
453
                // First pass, finding configuration OR pointer string:
454
                $urlObj->extConf = isset($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
455
456
                // If it turned out to be a string pointer, then look up the real config:
457
                if (is_string($urlObj->extConf)) {
458
                    $urlObj->extConf = is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
459
                }
460
            }
461
462
            if (!$GLOBALS['TSFE']->sys_page) {
463
                $GLOBALS['TSFE']->sys_page = GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\PageRepository');
464
            }
465
466
            if (!$GLOBALS['TSFE']->tmpl->rootLine[0]['uid']) {
467
                $GLOBALS['TSFE']->tmpl->rootLine[0]['uid'] = $urlObj->extConf['pagePath']['rootpage_id'];
468
            }
469
        }
470
471
        if (is_array($vv['URLs'])) {
472
            $configurationHash = $this->getConfigurationHash($vv);
473
            $skipInnerCheck = $this->noUnprocessedQueueEntriesForPageWithConfigurationHashExist($pageRow['uid'], $configurationHash);
474
475
            foreach ($vv['URLs'] as $urlQuery) {
476
                if ($this->drawURLs_PIfilter($vv['subCfg']['procInstrFilter'], $incomingProcInstructions)) {
477
478
                    // Calculate cHash:
479
                    if ($vv['subCfg']['cHash']) {
480
                        /* @var $cacheHash \TYPO3\CMS\Frontend\Page\CacheHashCalculator */
481
                        $cacheHash = GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\CacheHashCalculator');
482
                        $urlQuery .= '&cHash=' . $cacheHash->generateForParameters($urlQuery);
483
                    }
484
485
                    // Create key by which to determine unique-ness:
486
                    $uKey = $urlQuery . '|' . $vv['subCfg']['userGroups'] . '|' . $vv['subCfg']['baseUrl'] . '|' . $vv['subCfg']['procInstrFilter'];
487
488
                    // realurl support (thanks to Ingo Renner)
489
                    $urlQuery = 'index.php' . $urlQuery;
490
                    if (ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
491
                        $params = [
492
                            'LD' => [
493
                                'totalURL' => $urlQuery,
494
                            ],
495
                            'TCEmainHook' => true,
496
                        ];
497
                        $urlObj->encodeSpURL($params);
0 ignored issues
show
Bug introduced by
The variable $urlObj does not seem to be defined for all execution paths leading up to this point.

If you define a variable conditionally, it can happen that it is not defined for all execution paths.

Let’s take a look at an example:

function myFunction($a) {
    switch ($a) {
        case 'foo':
            $x = 1;
            break;

        case 'bar':
            $x = 2;
            break;
    }

    // $x is potentially undefined here.
    echo $x;
}

In the above example, the variable $x is defined if you pass “foo” or “bar” as argument for $a. However, since the switch statement has no default case statement, if you pass any other value, the variable $x would be undefined.

Available Fixes

  1. Check for existence of the variable explicitly:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        if (isset($x)) { // Make sure it's always set.
            echo $x;
        }
    }
    
  2. Define a default value for the variable:

    function myFunction($a) {
        $x = ''; // Set a default which gets overridden for certain paths.
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        echo $x;
    }
    
  3. Add a value for the missing path:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
    
            // We add support for the missing case.
            default:
                $x = '';
                break;
        }
    
        echo $x;
    }
    
Loading history...
498
                        $urlQuery = $params['LD']['totalURL'];
499
                    }
500
501
                    // Scheduled time:
502
                    $schTime = $scheduledTime + round(count($duplicateTrack) * (60 / $reqMinute));
503
                    $schTime = floor($schTime / 60) * 60;
504
505
                    if (isset($duplicateTrack[$uKey])) {
506
507
                        //if the url key is registered just display it and do not resubmit is
508
                        $urlList = '<em><span class="typo3-dimmed">' . htmlspecialchars($urlQuery) . '</span></em><br/>';
509
                    } else {
510
                        $urlList = '[' . date('d.m.y H:i', $schTime) . '] ' . htmlspecialchars($urlQuery);
511
                        $this->urlList[] = '[' . date('d.m.y H:i', $schTime) . '] ' . $urlQuery;
512
513
                        $theUrl = ($vv['subCfg']['baseUrl'] ? $vv['subCfg']['baseUrl'] : GeneralUtility::getIndpEnv('TYPO3_SITE_URL')) . $urlQuery;
514
515
                        // Submit for crawling!
516
                        if ($submitCrawlUrls) {
517
                            $added = $this->addUrl(
518
                                $pageRow['uid'],
519
                                $theUrl,
520
                                $vv['subCfg'],
521
                                $scheduledTime,
522
                                $configurationHash,
523
                                $skipInnerCheck
524
                            );
525
                            if ($added === false) {
526
                                $urlList .= ' (Url already existed)';
527
                            }
528
                        } elseif ($downloadCrawlUrls) {
529
                            $downloadUrls[$theUrl] = $theUrl;
530
                        }
531
532
                        $urlList .= '<br />';
533
                    }
534
                    $duplicateTrack[$uKey] = true;
535
                }
536
            }
537
        } else {
538
            $urlList = 'ERROR - no URL generated';
539
        }
540
541
        return $urlList;
542
    }
543
544
    /**
545
     * Returns true if input processing instruction is among registered ones.
546
     *
547
     * @param string $piString PI to test
548
     * @param array $incomingProcInstructions Processing instructions
549
     * @return boolean
550
     */
551 5
    public function drawURLs_PIfilter($piString, array $incomingProcInstructions)
552
    {
553 5
        if (empty($incomingProcInstructions)) {
554 1
            return true;
555
        }
556
557 4
        foreach ($incomingProcInstructions as $pi) {
558 4
            if (GeneralUtility::inList($piString, $pi)) {
559 4
                return true;
560
            }
561
        }
562 2
    }
563
564
    public function getPageTSconfigForId($id)
565
    {
566
        if (!$this->MP) {
567
            $pageTSconfig = BackendUtility::getPagesTSconfig($id);
568
        } else {
569
            list(, $mountPointId) = explode('-', $this->MP);
570
            $pageTSconfig = BackendUtility::getPagesTSconfig($mountPointId);
571
        }
572
573
        // Call a hook to alter configuration
574
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'])) {
575
            $params = [
576
                'pageId' => $id,
577
                'pageTSConfig' => &$pageTSconfig,
578
            ];
579
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'] as $userFunc) {
580
                GeneralUtility::callUserFunction($userFunc, $params, $this);
581
            }
582
        }
583
584
        return $pageTSconfig;
585
    }
586
587
    /**
588
     * This methods returns an array of configurations.
589
     * And no urls!
590
     *
591
     * @param integer $id Page ID
592
     * @param bool $forceSsl Use https
593
     * @return array
594
     *
595
     * TODO: Should be switched back to protected - TNM 2018-11-16
596
     */
597
    public function getUrlsForPageId($id, $forceSsl = false)
598
    {
599
600
        /**
601
         * Get configuration from tsConfig
602
         */
603
604
        // Get page TSconfig for page ID:
605
        $pageTSconfig = $this->getPageTSconfigForId($id);
606
607
        $res = [];
608
609
        if (\is_array($pageTSconfig) && \is_array($pageTSconfig['tx_crawler.']['crawlerCfg.'])) {
610
            $crawlerCfg = $pageTSconfig['tx_crawler.']['crawlerCfg.'];
611
612
            if (\is_array($crawlerCfg['paramSets.'])) {
613
                foreach ($crawlerCfg['paramSets.'] as $key => $values) {
614
615
                    $key = str_replace('.', '', $key);
616
                    // Sub configuration for a single configuration string:
617
                    $subCfg = (array)$crawlerCfg['paramSets.'][$key . '.'];
618
                    $subCfg['key'] = $key;
619
                    $res[$key] = isset($res[$key]) ? $res[$key] : [];
620
621
                    if (!\is_array($values)) {
622
                        $res[$key]['paramParsed'] = $this->parseParams($values);
623
                        $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
624
                    }
625
626
                    if (\is_array($values)) {
627
                        if (strcmp($subCfg['procInstrFilter'], '')) {
628
                            $subCfg['procInstrFilter'] = implode(',', GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']));
629
                        }
630
                        $pidOnlyList = implode(',', GeneralUtility::trimExplode(',', $subCfg['pidsonly'], true));
631
632
                        // process configuration if it is not page-specific or if the specific page is the current page:
633
                        if (!strcmp($subCfg['pidsonly'], '') || GeneralUtility::inList($pidOnlyList, $id)) {
634
635
                                // add trailing slash if not present
636
                            if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) !== '/') {
637
                                $subCfg['baseUrl'] .= '/';
638
                            }
639
640
                            // Explode, process etc.:
641
642
                            $res[$key]['subCfg'] = $subCfg;
643
                            $res[$key]['origin'] = 'pagets';
644
645
                            // recognize MP value
646
                            if (!$this->MP) {
647
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
648
                            } else {
649
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id . '&MP=' . $this->MP]);
650
                            }
651
                        }
652
                    }
653
                }
654
            }
655
        }
656
657
        /**
658
         * Get configuration from tx_crawler_configuration records
659
         */
660
661
        // get records along the rootline
662
        $rootLine = BackendUtility::BEgetRootLine($id);
663
        foreach ($rootLine as $page) {
664
            $configurationRecordsForCurrentPage = $this->configurationRepository->getConfigurationRecordsPageUid($page['uid'])->toArray();
665
666
            /** @var Configuration $configurationRecord */
667
            foreach ($configurationRecordsForCurrentPage as $configurationRecord) {
668
669
                // check access to the configuration record
670
                if (empty($configurationRecord->getBeGroups()) || $GLOBALS['BE_USER']->isAdmin() || $this->hasGroupAccess($GLOBALS['BE_USER']->user['usergroup_cached_list'], $configurationRecord->getBeGroups())) {
671
                    $pidOnlyList = implode(',', GeneralUtility::trimExplode(',', $configurationRecord->getPidsOnly(), true));
672
673
                    // process configuration if it is not page-specific or if the specific page is the current page:
674
                    if (!strcmp($configurationRecord->getPidsOnly(), '') || GeneralUtility::inList($pidOnlyList, $id)) {
675
                        $key = $configurationRecord->getName();
676
677
                        // don't overwrite previously defined paramSets
678
                        if (!isset($res[$key])) {
679
680
                            /* @var $TSparserObject \TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser */
681
                            $TSparserObject = GeneralUtility::makeInstance('TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser');
682
                            // Todo: Check where the field processing_instructions_parameters_ts comes from.
683
                            $TSparserObject->parse($configurationRecord->getProcessingInstructionFilter()); //['processing_instruction_parameters_ts']);
684
685
                            $isCrawlingProtocolHttps = $this->isCrawlingProtocolHttps($configurationRecord->isForceSsl(), $forceSsl);
686
687
                            $subCfg = [
688
                                'procInstrFilter' => $configurationRecord->getProcessingInstructionFilter(),
689
                                'procInstrParams.' => $TSparserObject->setup,
690
                                'baseUrl' => $this->getBaseUrlForConfigurationRecord(
691
                                    $configurationRecord->getBaseUrl(),
692
                                    $configurationRecord->getSysDomainBaseUrl(),
693
                                    $isCrawlingProtocolHttps
694
                                ),
695
                                'realurl' => $configurationRecord->getRealUrl(),
696
                                'cHash' => $configurationRecord->getCHash(),
697
                                'userGroups' => $configurationRecord->getFeGroups(),
698
                                'exclude' => $configurationRecord->getExclude(),
699
                                'rootTemplatePid' => (int)$configurationRecord->getRootTemplatePid(),
700
                                'key' => $key,
701
                            ];
702
703
                            // add trailing slash if not present
704
                            if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
705
                                $subCfg['baseUrl'] .= '/';
706
                            }
707
                            if (!in_array($id, $this->expandExcludeString($subCfg['exclude']))) {
708
                                $res[$key] = [];
709
                                $res[$key]['subCfg'] = $subCfg;
710
                                $res[$key]['paramParsed'] = $this->parseParams($configurationRecord->getConfiguration());
711
                                $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
712
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
713
                                $res[$key]['origin'] = 'tx_crawler_configuration_' . $configurationRecord->getUid();
714
                            }
715
                        }
716
                    }
717
                }
718
            }
719
        }
720
721
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'])) {
722
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'] as $func) {
723
                $params = [
724
                    'res' => &$res,
725
                ];
726
                GeneralUtility::callUserFunction($func, $params, $this);
727
            }
728
        }
729
730
        return $res;
731
    }
732
733
    /**
734
     * Checks if a domain record exist and returns the base-url based on the record. If not the given baseUrl string is used.
735
     *
736
     * @param string $baseUrl
737
     * @param integer $sysDomainUid
738
     * @param bool $ssl
739
     * @return string
740
     */
741
    protected function getBaseUrlForConfigurationRecord($baseUrl, $sysDomainUid, $ssl = false)
742
    {
743
        $sysDomainUid = intval($sysDomainUid);
744
        $urlScheme = ($ssl === false) ? 'http' : 'https';
745
746
        if ($sysDomainUid > 0) {
747
            $res = $this->db->exec_SELECTquery(
748
                '*',
749
                'sys_domain',
750
                'uid = ' . $sysDomainUid .
751
                BackendUtility::BEenableFields('sys_domain') .
752
                BackendUtility::deleteClause('sys_domain')
753
            );
754
            $row = $this->db->sql_fetch_assoc($res);
755
            if ($row['domainName'] != '') {
756
                return $urlScheme . '://' . $row['domainName'];
757
            }
758
        }
759
        return $baseUrl;
760
    }
761
762
    public function getConfigurationsForBranch($rootid, $depth)
763
    {
764
        $configurationsForBranch = [];
765
766
        $pageTSconfig = $this->getPageTSconfigForId($rootid);
767
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'])) {
768
            $sets = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'];
769
            if (is_array($sets)) {
770
                foreach ($sets as $key => $value) {
771
                    if (!is_array($value)) {
772
                        continue;
773
                    }
774
                    $configurationsForBranch[] = substr($key, -1) == '.' ? substr($key, 0, -1) : $key;
775
                }
776
            }
777
        }
778
        $pids = [];
779
        $rootLine = BackendUtility::BEgetRootLine($rootid);
780
        foreach ($rootLine as $node) {
781
            $pids[] = $node['uid'];
782
        }
783
        /* @var PageTreeView $tree */
784
        $tree = GeneralUtility::makeInstance(PageTreeView::class);
785
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
786
        $tree->init('AND ' . $perms_clause);
787
        $tree->getTree($rootid, $depth, '');
788
        foreach ($tree->tree as $node) {
789
            $pids[] = $node['row']['uid'];
790
        }
791
792
        $res = $this->db->exec_SELECTquery(
793
            '*',
794
            'tx_crawler_configuration',
795
            'pid IN (' . implode(',', $pids) . ') ' .
796
            BackendUtility::BEenableFields('tx_crawler_configuration') .
797
            BackendUtility::deleteClause('tx_crawler_configuration') . ' ' .
798
            BackendUtility::versioningPlaceholderClause('tx_crawler_configuration') . ' '
799
        );
800
801
        while ($row = $this->db->sql_fetch_assoc($res)) {
802
            $configurationsForBranch[] = $row['name'];
803
        }
804
        $this->db->sql_free_result($res);
805
        return $configurationsForBranch;
806
    }
807
808
    /**
809
     * Check if a user has access to an item
810
     * (e.g. get the group list of the current logged in user from $GLOBALS['TSFE']->gr_list)
811
     *
812
     * @see \TYPO3\CMS\Frontend\Page\PageRepository::getMultipleGroupsWhereClause()
813
     * @param  string $groupList    Comma-separated list of (fe_)group UIDs from a user
814
     * @param  string $accessList   Comma-separated list of (fe_)group UIDs of the item to access
815
     * @return bool                 TRUE if at least one of the users group UIDs is in the access list or the access list is empty
816
     */
817 3
    public function hasGroupAccess($groupList, $accessList)
818
    {
819 3
        if (empty($accessList)) {
820 1
            return true;
821
        }
822 2
        foreach (GeneralUtility::intExplode(',', $groupList) as $groupUid) {
823 2
            if (GeneralUtility::inList($accessList, $groupUid)) {
824 2
                return true;
825
            }
826
        }
827 1
        return false;
828
    }
829
830
    /**
831
     * Parse GET vars of input Query into array with key=>value pairs
832
     *
833
     * @param string $inputQuery Input query string
834
     * @return array
835
     */
836 3
    public function parseParams($inputQuery)
837
    {
838
        // Extract all GET parameters into an ARRAY:
839 3
        $paramKeyValues = [];
840 3
        $GETparams = explode('&', $inputQuery);
841
842 3
        foreach ($GETparams as $paramAndValue) {
843 3
            list($p, $v) = explode('=', $paramAndValue, 2);
844 3
            if (strlen($p)) {
845 3
                $paramKeyValues[rawurldecode($p)] = rawurldecode($v);
846
            }
847
        }
848
849 3
        return $paramKeyValues;
850
    }
851
852
    /**
853
     * Will expand the parameters configuration to individual values. This follows a certain syntax of the value of each parameter.
854
     * Syntax of values:
855
     * - Basically: If the value is wrapped in [...] it will be expanded according to the following syntax, otherwise the value is taken literally
856
     * - Configuration is splitted by "|" and the parts are processed individually and finally added together
857
     * - For each configuration part:
858
     *         - "[int]-[int]" = Integer range, will be expanded to all values in between, values included, starting from low to high (max. 1000). Example "1-34" or "-40--30"
859
     *         - "_TABLE:[TCA table name];[_PID:[optional page id, default is current page]];[_ENABLELANG:1]" = Look up of table records from PID, filtering out deleted records. Example "_TABLE:tt_content; _PID:123"
860
     *        _ENABLELANG:1 picks only original records without their language overlays
861
     *         - Default: Literal value
862
     *
863
     * @param array $paramArray Array with key (GET var name) and values (value of GET var which is configuration for expansion)
864
     * @param integer $pid Current page ID
865
     * @return array
866
     */
867
    public function expandParameters($paramArray, $pid)
868
    {
869
        global $TCA;
870
871
        // Traverse parameter names:
872
        foreach ($paramArray as $p => $v) {
873
            $v = trim($v);
874
875
            // If value is encapsulated in square brackets it means there are some ranges of values to find, otherwise the value is literal
876
            if (substr($v, 0, 1) === '[' && substr($v, -1) === ']') {
877
                // So, find the value inside brackets and reset the paramArray value as an array.
878
                $v = substr($v, 1, -1);
879
                $paramArray[$p] = [];
880
881
                // Explode parts and traverse them:
882
                $parts = explode('|', $v);
883
                foreach ($parts as $pV) {
884
885
                        // Look for integer range: (fx. 1-34 or -40--30 // reads minus 40 to minus 30)
886
                    if (preg_match('/^(-?[0-9]+)\s*-\s*(-?[0-9]+)$/', trim($pV), $reg)) {
887
888
                        // Swap if first is larger than last:
889
                        if ($reg[1] > $reg[2]) {
890
                            $temp = $reg[2];
891
                            $reg[2] = $reg[1];
892
                            $reg[1] = $temp;
893
                        }
894
895
                        // Traverse range, add values:
896
                        $runAwayBrake = 1000; // Limit to size of range!
897
                        for ($a = $reg[1]; $a <= $reg[2];$a++) {
898
                            $paramArray[$p][] = $a;
899
                            $runAwayBrake--;
900
                            if ($runAwayBrake <= 0) {
901
                                break;
902
                            }
903
                        }
904
                    } elseif (substr(trim($pV), 0, 7) == '_TABLE:') {
905
906
                        // Parse parameters:
907
                        $subparts = GeneralUtility::trimExplode(';', $pV);
908
                        $subpartParams = [];
909
                        foreach ($subparts as $spV) {
910
                            list($pKey, $pVal) = GeneralUtility::trimExplode(':', $spV);
911
                            $subpartParams[$pKey] = $pVal;
912
                        }
913
914
                        // Table exists:
915
                        if (isset($TCA[$subpartParams['_TABLE']])) {
916
                            $lookUpPid = isset($subpartParams['_PID']) ? intval($subpartParams['_PID']) : intval($pid);
917
                            $recursiveDepth = isset($subpartParams['_RECURSIVE']) ? intval($subpartParams['_RECURSIVE']) : 0;
918
                            $pidField = isset($subpartParams['_PIDFIELD']) ? trim($subpartParams['_PIDFIELD']) : 'pid';
919
                            $where = isset($subpartParams['_WHERE']) ? $subpartParams['_WHERE'] : '';
920
                            $addTable = isset($subpartParams['_ADDTABLE']) ? $subpartParams['_ADDTABLE'] : '';
921
922
                            $fieldName = $subpartParams['_FIELD'] ? $subpartParams['_FIELD'] : 'uid';
923
                            if ($fieldName === 'uid' || $TCA[$subpartParams['_TABLE']]['columns'][$fieldName]) {
924
                                $andWhereLanguage = '';
925
                                $transOrigPointerField = $TCA[$subpartParams['_TABLE']]['ctrl']['transOrigPointerField'];
926
927
                                if ($subpartParams['_ENABLELANG'] && $transOrigPointerField) {
928
                                    $andWhereLanguage = ' AND ' . $this->db->quoteStr($transOrigPointerField, $subpartParams['_TABLE']) . ' <= 0 ';
929
                                }
930
								
931
                                if($recursiveDepth > 0) {
932
                                    /** @var \TYPO3\CMS\Core\Database\QueryGenerator $queryGenerator */
933
                                    $queryGenerator = GeneralUtility::makeInstance(\TYPO3\CMS\Core\Database\QueryGenerator::class);
934
                                    $pidList = $queryGenerator->getTreeList($lookUpPid, $recursiveDepth, 0, 1);
935
                                } else {
936
                                    $pidList = (string)$lookUpPid;
937
                                }
938
939
                                $where = $this->db->quoteStr($pidField, $subpartParams['_TABLE']) . ' IN(' . $pidList . ') ' .
940
                                    $andWhereLanguage . $where;
941
942
                                $rows = $this->db->exec_SELECTgetRows(
943
                                    $fieldName,
944
                                    $subpartParams['_TABLE'] . $addTable,
945
                                    $where . BackendUtility::deleteClause($subpartParams['_TABLE']),
946
                                    '',
947
                                    '',
948
                                    '',
949
                                    $fieldName
950
                                );
951
952
                                if (is_array($rows)) {
953
                                    $paramArray[$p] = array_merge($paramArray[$p], array_keys($rows));
954
                                }
955
                            }
956
                        }
957
                    } else { // Just add value:
958
                        $paramArray[$p][] = $pV;
959
                    }
960
                    // Hook for processing own expandParameters place holder
961
                    if (is_array($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'])) {
962
                        $_params = [
963
                            'pObj' => &$this,
964
                            'paramArray' => &$paramArray,
965
                            'currentKey' => $p,
966
                            'currentValue' => $pV,
967
                            'pid' => $pid,
968
                        ];
969
                        foreach ($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'] as $key => $_funcRef) {
970
                            GeneralUtility::callUserFunction($_funcRef, $_params, $this);
971
                        }
972
                    }
973
                }
974
975
                // Make unique set of values and sort array by key:
976
                $paramArray[$p] = array_unique($paramArray[$p]);
977
                ksort($paramArray);
978
            } else {
979
                // Set the literal value as only value in array:
980
                $paramArray[$p] = [$v];
981
            }
982
        }
983
984
        return $paramArray;
985
    }
986
987
    /**
988
     * Compiling URLs from parameter array (output of expandParameters())
989
     * The number of URLs will be the multiplication of the number of parameter values for each key
990
     *
991
     * @param array $paramArray Output of expandParameters(): Array with keys (GET var names) and for each an array of values
992
     * @param array $urls URLs accumulated in this array (for recursion)
993
     * @return array
994
     */
995 3
    public function compileUrls($paramArray, $urls = [])
996
    {
997 3
        if (count($paramArray) && is_array($urls)) {
998
            // shift first off stack:
999 2
            reset($paramArray);
1000 2
            $varName = key($paramArray);
1001 2
            $valueSet = array_shift($paramArray);
1002
1003
            // Traverse value set:
1004 2
            $newUrls = [];
1005 2
            foreach ($urls as $url) {
1006 1
                foreach ($valueSet as $val) {
1007 1
                    $newUrls[] = $url . (strcmp($val, '') ? '&' . rawurlencode($varName) . '=' . rawurlencode($val) : '');
1008
1009 1
                    if (count($newUrls) > MathUtility::forceIntegerInRange($this->extensionSettings['maxCompileUrls'], 1, 1000000000, 10000)) {
1010 1
                        break;
1011
                    }
1012
                }
1013
            }
1014 2
            $urls = $newUrls;
1015 2
            $urls = $this->compileUrls($paramArray, $urls);
1016
        }
1017
1018 3
        return $urls;
1019
    }
1020
1021
    /************************************
1022
     *
1023
     * Crawler log
1024
     *
1025
     ************************************/
1026
1027
    /**
1028
     * Return array of records from crawler queue for input page ID
1029
     *
1030
     * @param integer $id Page ID for which to look up log entries.
1031
     * @param string$filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
1032
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
1033
     * @param boolean $doFullFlush
1034
     * @param integer $itemsPerPage Limit the amount of entries per page default is 10
1035
     * @return array
1036
     */
1037
    public function getLogEntriesForPageId($id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
1038
    {
1039
        switch ($filter) {
1040
            case 'pending':
1041
                $addWhere = ' AND exec_time=0';
1042
                break;
1043
            case 'finished':
1044
                $addWhere = ' AND exec_time>0';
1045
                break;
1046
            default:
1047
                $addWhere = '';
1048
                break;
1049
        }
1050
1051
        // FIXME: Write unit test that ensures that the right records are deleted.
1052
        if ($doFlush) {
1053
            $this->flushQueue(($doFullFlush ? '1=1' : ('page_id=' . intval($id))) . $addWhere);
1054
            return [];
1055
        } else {
1056
            return $this->db->exec_SELECTgetRows(
1057
                '*',
1058
                'tx_crawler_queue',
1059
                'page_id=' . intval($id) . $addWhere,
1060
                '',
1061
                'scheduled DESC',
1062
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
1063
            );
1064
        }
1065
    }
1066
1067
    /**
1068
     * Return array of records from crawler queue for input set ID
1069
     *
1070
     * @param integer $set_id Set ID for which to look up log entries.
1071
     * @param string $filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
1072
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
1073
     * @param integer $itemsPerPage Limit the amount of entires per page default is 10
1074
     * @return array
1075
     */
1076
    public function getLogEntriesForSetId($set_id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
1077
    {
1078
        // FIXME: Write Unit tests for Filters
1079
        switch ($filter) {
1080
            case 'pending':
1081
                $addWhere = ' AND exec_time=0';
1082
                break;
1083
            case 'finished':
1084
                $addWhere = ' AND exec_time>0';
1085
                break;
1086
            default:
1087
                $addWhere = '';
1088
                break;
1089
        }
1090
        // FIXME: Write unit test that ensures that the right records are deleted.
1091
        if ($doFlush) {
1092
            $this->flushQueue($doFullFlush ? '' : ('set_id=' . intval($set_id) . $addWhere));
1093
            return [];
1094
        } else {
1095
            return $this->db->exec_SELECTgetRows(
1096
                '*',
1097
                'tx_crawler_queue',
1098
                'set_id=' . intval($set_id) . $addWhere,
1099
                '',
1100
                'scheduled DESC',
1101
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
1102
            );
1103
        }
1104
    }
1105
1106
    /**
1107
     * Removes queue entries
1108
     *
1109
     * @param string $where SQL related filter for the entries which should be removed
1110
     * @return void
1111
     */
1112
    protected function flushQueue($where = '')
1113
    {
1114
        $realWhere = strlen($where) > 0 ? $where : '1=1';
1115
1116
        if (EventDispatcher::getInstance()->hasObserver('queueEntryFlush') || SignalSlotUtility::hasSignal(__CLASS__, SignalSlotUtility::SIGNAL_QUEUE_ENTRY_FLUSH)) {
1117
            $groups = $this->db->exec_SELECTgetRows('DISTINCT set_id', 'tx_crawler_queue', $realWhere);
1118
            if (is_array($groups)) {
1119
                foreach ($groups as $group) {
1120
1121
                    // The event dispatcher is deprecated since crawler v6.4.0, will be removed in crawler v7.0.0.
1122
                    // Please use the Signal instead.
1123
                    if (EventDispatcher::getInstance()->hasObserver('queueEntryFlush')) {
1124
                        EventDispatcher::getInstance()->post(
1125
                            'queueEntryFlush',
1126
                            $group['set_id'],
1127
                            $this->db->exec_SELECTgetRows('uid, set_id', 'tx_crawler_queue', $realWhere . ' AND set_id="' . $group['set_id'] . '"')
1128
                        );
1129
                    }
1130
1131
                    if (SignalSlotUtility::hasSignal(__CLASS__, SignalSlotUtility::SIGNAL_QUEUE_ENTRY_FLUSH)) {
1132
                        $signalInputArray = $this->db->exec_SELECTgetRows('uid, set_id', 'tx_crawler_queue', $realWhere . ' AND set_id="' . $group['set_id'] . '"');
1133
                        SignalSlotUtility::emitSignal(
1134
                            __CLASS__,
1135
                            SignalSlotUtility::SIGNAL_QUEUE_ENTRY_FLUSH,
1136
                            $signalInputArray
0 ignored issues
show
Bug introduced by
It seems like $signalInputArray defined by $this->db->exec_SELECTge...$group['set_id'] . '"') on line 1132 can also be of type null; however, AOE\Crawler\Utility\Sign...otUtility::emitSignal() does only seem to accept array, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
1137
                        );
1138
                    }
1139
                }
1140
            }
1141
        }
1142
1143
        $GLOBALS['TYPO3_DB']->exec_DELETEquery('tx_crawler_queue', $realWhere);
1144
    }
1145
1146
    /**
1147
     * Adding call back entries to log (called from hooks typically, see indexed search class "class.crawler.php"
1148
     *
1149
     * @param integer $setId Set ID
1150
     * @param array $params Parameters to pass to call back function
1151
     * @param string $callBack Call back object reference, eg. 'EXT:indexed_search/class.crawler.php:&tx_indexedsearch_crawler'
1152
     * @param integer $page_id Page ID to attach it to
1153
     * @param integer $schedule Time at which to activate
1154
     * @return void
1155
     */
1156
    public function addQueueEntry_callBack($setId, $params, $callBack, $page_id = 0, $schedule = 0)
1157
    {
1158
        if (!is_array($params)) {
1159
            $params = [];
1160
        }
1161
        $params['_CALLBACKOBJ'] = $callBack;
1162
1163
        // Compile value array:
1164
        $fieldArray = [
1165
            'page_id' => intval($page_id),
1166
            'parameters' => serialize($params),
1167
            'scheduled' => intval($schedule) ? intval($schedule) : $this->getCurrentTime(),
1168
            'exec_time' => 0,
1169
            'set_id' => intval($setId),
1170
            'result_data' => '',
1171
        ];
1172
1173
        $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1174
    }
1175
1176
    /************************************
1177
     *
1178
     * URL setting
1179
     *
1180
     ************************************/
1181
1182
    /**
1183
     * Setting a URL for crawling:
1184
     *
1185
     * @param integer $id Page ID
1186
     * @param string $url Complete URL
1187
     * @param array $subCfg Sub configuration array (from TS config)
1188
     * @param integer $tstamp Scheduled-time
1189
     * @param string $configurationHash (optional) configuration hash
1190
     * @param bool $skipInnerDuplicationCheck (optional) skip inner duplication check
1191
     * @return bool
1192
     */
1193
    public function addUrl(
1194
        $id,
1195
        $url,
1196
        array $subCfg,
1197
        $tstamp,
1198
        $configurationHash = '',
1199
        $skipInnerDuplicationCheck = false
1200
    ) {
1201
        $urlAdded = false;
1202
        $rows = [];
1203
1204
        // Creating parameters:
1205
        $parameters = [
1206
            'url' => $url,
1207
        ];
1208
1209
        // fe user group simulation:
1210
        $uGs = implode(',', array_unique(GeneralUtility::intExplode(',', $subCfg['userGroups'], true)));
1211
        if ($uGs) {
1212
            $parameters['feUserGroupList'] = $uGs;
1213
        }
1214
1215
        // Setting processing instructions
1216
        $parameters['procInstructions'] = GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']);
1217
        if (is_array($subCfg['procInstrParams.'])) {
1218
            $parameters['procInstrParams'] = $subCfg['procInstrParams.'];
1219
        }
1220
1221
        // Possible TypoScript Template Parents
1222
        $parameters['rootTemplatePid'] = $subCfg['rootTemplatePid'];
1223
1224
        // Compile value array:
1225
        $parameters_serialized = serialize($parameters);
1226
        $fieldArray = [
1227
            'page_id' => intval($id),
1228
            'parameters' => $parameters_serialized,
1229
            'parameters_hash' => GeneralUtility::shortMD5($parameters_serialized),
1230
            'configuration_hash' => $configurationHash,
1231
            'scheduled' => $tstamp,
1232
            'exec_time' => 0,
1233
            'set_id' => intval($this->setID),
1234
            'result_data' => '',
1235
            'configuration' => $subCfg['key'],
1236
        ];
1237
1238
        if ($this->registerQueueEntriesInternallyOnly) {
1239
            //the entries will only be registered and not stored to the database
1240
            $this->queueEntries[] = $fieldArray;
1241
        } else {
1242
            if (!$skipInnerDuplicationCheck) {
1243
                // check if there is already an equal entry
1244
                $rows = $this->getDuplicateRowsIfExist($tstamp, $fieldArray);
1245
            }
1246
1247
            if (count($rows) == 0) {
1248
                $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1249
                $uid = $this->db->sql_insert_id();
1250
                $rows[] = $uid;
1251
                $urlAdded = true;
1252
1253
                // The event dispatcher is deprecated since crawler v6.4.0, will be removed in crawler v7.0.0.
1254
                // Please use the Signal instead.
1255
                EventDispatcher::getInstance()->post('urlAddedToQueue', $this->setID, ['uid' => $uid, 'fieldArray' => $fieldArray]);
1256
1257
                $signalPayload = ['uid' => $uid, 'fieldArray' => $fieldArray];
1258
                SignalSlotUtility::emitSignal(
1259
                    __CLASS__,
1260
                    SignalSlotUtility::SIGNAL_URL_ADDED_TO_QUEUE,
1261
                    $signalPayload
1262
                );
1263
            } else {
1264
                // The event dispatcher is deprecated since crawler v6.4.0, will be removed in crawler v7.0.0.
1265
                // Please use the Signal instead.
1266
                EventDispatcher::getInstance()->post('duplicateUrlInQueue', $this->setID, ['rows' => $rows, 'fieldArray' => $fieldArray]);
1267
1268
                $signalPayload = ['rows' => $rows, 'fieldArray' => $fieldArray];
1269
                SignalSlotUtility::emitSignal(
1270
                    __CLASS__,
1271
                    SignalSlotUtility::SIGNAL_DUPLICATE_URL_IN_QUEUE,
1272
                    $signalPayload
1273
                );
1274
            }
1275
        }
1276
1277
        return $urlAdded;
1278
    }
1279
1280
    /**
1281
     * This method determines duplicates for a queue entry with the same parameters and this timestamp.
1282
     * If the timestamp is in the past, it will check if there is any unprocessed queue entry in the past.
1283
     * If the timestamp is in the future it will check, if the queued entry has exactly the same timestamp
1284
     *
1285
     * @param int $tstamp
1286
     * @param array $fieldArray
1287
     *
1288
     * @return array
1289
     */
1290
    protected function getDuplicateRowsIfExist($tstamp, $fieldArray)
1291
    {
1292
        $rows = [];
1293
1294
        $currentTime = $this->getCurrentTime();
1295
1296
        //if this entry is scheduled with "now"
1297
        if ($tstamp <= $currentTime) {
1298
            if ($this->extensionSettings['enableTimeslot']) {
1299
                $timeBegin = $currentTime - 100;
1300
                $timeEnd = $currentTime + 100;
1301
                $where = ' ((scheduled BETWEEN ' . $timeBegin . ' AND ' . $timeEnd . ' ) OR scheduled <= ' . $currentTime . ') ';
1302
            } else {
1303
                $where = 'scheduled <= ' . $currentTime;
1304
            }
1305
        } elseif ($tstamp > $currentTime) {
1306
            //entry with a timestamp in the future need to have the same schedule time
1307
            $where = 'scheduled = ' . $tstamp ;
1308
        }
1309
1310
        if (!empty($where)) {
1311
            $result = $this->db->exec_SELECTgetRows(
1312
                'qid',
1313
                'tx_crawler_queue',
1314
                $where .
1315
                ' AND NOT exec_time' .
1316
                ' AND NOT process_id ' .
1317
                ' AND page_id=' . intval($fieldArray['page_id']) .
1318
                ' AND parameters_hash = ' . $this->db->fullQuoteStr($fieldArray['parameters_hash'], 'tx_crawler_queue')
1319
            );
1320
1321
            if (is_array($result)) {
1322
                foreach ($result as $value) {
1323
                    $rows[] = $value['qid'];
1324
                }
1325
            }
1326
        }
1327
1328
        return $rows;
1329
    }
1330
1331
    /**
1332
     * Returns the current system time
1333
     *
1334
     * @return int
1335
     */
1336
    public function getCurrentTime()
1337
    {
1338
        return time();
1339
    }
1340
1341
    /************************************
1342
     *
1343
     * URL reading
1344
     *
1345
     ************************************/
1346
1347
    /**
1348
     * Read URL for single queue entry
1349
     *
1350
     * @param integer $queueId
1351
     * @param boolean $force If set, will process even if exec_time has been set!
1352
     * @return integer
1353
     */
1354
    public function readUrl($queueId, $force = false)
1355
    {
1356
        $ret = 0;
1357
        if ($this->debugMode) {
1358
            GeneralUtility::devlog('crawler-readurl start ' . microtime(true), __FUNCTION__);
1359
        }
1360
        // Get entry:
1361
        list($queueRec) = $this->db->exec_SELECTgetRows(
1362
            '*',
1363
            'tx_crawler_queue',
1364
            'qid=' . intval($queueId) . ($force ? '' : ' AND exec_time=0 AND process_scheduled > 0')
1365
        );
1366
1367
        if (!is_array($queueRec)) {
1368
            return;
1369
        }
1370
1371
        $parameters = unserialize($queueRec['parameters']);
1372
        if ($parameters['rootTemplatePid']) {
1373
            $this->initTSFE((int)$parameters['rootTemplatePid']);
1374
        } else {
1375
            GeneralUtility::sysLog(
1376
                'Page with (' . $queueRec['page_id'] . ') could not be crawled, please check your crawler configuration. Perhaps no Root Template Pid is set',
1377
                'crawler',
1378
                GeneralUtility::SYSLOG_SEVERITY_WARNING
1379
            );
1380
        }
1381
1382
        $signalPayload = [$queueId, &$queueRec];
1383
        SignalSlotUtility::emitSignal(
1384
            __CLASS__,
1385
            SignalSlotUtility::SIGNAL_QUEUEITEM_PREPROCESS,
1386
            $signalPayload
1387
        );
1388
1389
        // Set exec_time to lock record:
1390
        $field_array = ['exec_time' => $this->getCurrentTime()];
1391
1392
        if (isset($this->processID)) {
1393
            //if mulitprocessing is used we need to store the id of the process which has handled this entry
1394
            $field_array['process_id_completed'] = $this->processID;
1395
        }
1396
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1397
1398
        $result = $this->readUrl_exec($queueRec);
1399
        $resultData = unserialize($result['content']);
1400
1401
        //atm there's no need to point to specific pollable extensions
1402
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'])) {
1403
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'] as $pollable) {
1404
                // only check the success value if the instruction is runnig
1405
                // it is important to name the pollSuccess key same as the procInstructions key
1406
                if (is_array($resultData['parameters']['procInstructions']) && in_array(
1407
                    $pollable,
1408
                    $resultData['parameters']['procInstructions']
1409
                )
1410
                ) {
1411
                    if (!empty($resultData['success'][$pollable]) && $resultData['success'][$pollable]) {
1412
                        $ret |= self::CLI_STATUS_POLLABLE_PROCESSED;
1413
                    }
1414
                }
1415
            }
1416
        }
1417
1418
        // Set result in log which also denotes the end of the processing of this entry.
1419
        $field_array = ['result_data' => serialize($result)];
1420
1421
        $signalPayload = [$queueId, &$field_array];
1422
        SignalSlotUtility::emitSignal(
1423
            __CLASS__,
1424
            SignalSlotUtility::SIGNAL_QUEUEITEM_POSTPROCESS,
1425
            $signalPayload
1426
        );
1427
1428
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1429
1430
        if ($this->debugMode) {
1431
            GeneralUtility::devlog('crawler-readurl stop ' . microtime(true), __FUNCTION__);
1432
        }
1433
1434
        return $ret;
1435
    }
1436
1437
    /**
1438
     * Read URL for not-yet-inserted log-entry
1439
     *
1440
     * @param array $field_array Queue field array,
1441
     *
1442
     * @return string
1443
     */
1444
    public function readUrlFromArray($field_array)
1445
    {
1446
1447
            // Set exec_time to lock record:
1448
        $field_array['exec_time'] = $this->getCurrentTime();
1449
        $this->db->exec_INSERTquery('tx_crawler_queue', $field_array);
1450
        $queueId = $field_array['qid'] = $this->db->sql_insert_id();
1451
1452
        $result = $this->readUrl_exec($field_array);
1453
1454
        // Set result in log which also denotes the end of the processing of this entry.
1455
        $field_array = ['result_data' => serialize($result)];
1456
1457
        $signalPayload = [$queueId, &$field_array];
1458
        SignalSlotUtility::emitSignal(
1459
            __CLASS__,
1460
            SignalSlotUtility::SIGNAL_QUEUEITEM_POSTPROCESS,
1461
            $signalPayload
1462
        );
1463
1464
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1465
1466
        return $result;
1467
    }
1468
1469
    /**
1470
     * Read URL for a queue record
1471
     *
1472
     * @param array $queueRec Queue record
1473
     * @return string
1474
     */
1475
    public function readUrl_exec($queueRec)
1476
    {
1477
        // Decode parameters:
1478
        $parameters = unserialize($queueRec['parameters']);
1479
        $result = 'ERROR';
1480
        if (is_array($parameters)) {
1481
            if ($parameters['_CALLBACKOBJ']) { // Calling object:
1482
                $objRef = $parameters['_CALLBACKOBJ'];
1483
                $callBackObj = &GeneralUtility::getUserObj($objRef);
1484
                if (is_object($callBackObj)) {
1485
                    unset($parameters['_CALLBACKOBJ']);
1486
                    $result = ['content' => serialize($callBackObj->crawler_execute($parameters, $this))];
1487
                } else {
1488
                    $result = ['content' => 'No object: ' . $objRef];
1489
                }
1490
            } else { // Regular FE request:
1491
1492
                // Prepare:
1493
                $crawlerId = $queueRec['qid'] . ':' . md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey']);
1494
1495
                // Get result:
1496
                $result = $this->requestUrl($parameters['url'], $crawlerId);
1497
1498
                // The event dispatcher is deprecated since crawler v6.4.0, will be removed in crawler v7.0.0.
1499
                // Please use the Signal instead.
1500
                EventDispatcher::getInstance()->post('urlCrawled', $queueRec['set_id'], ['url' => $parameters['url'], 'result' => $result]);
1501
1502
                $signalPayload = ['url' => $parameters['url'], 'result' => $result];
1503
                SignalSlotUtility::emitSignal(
1504
                    __CLASS__,
1505
                    SignalSlotUtility::SIGNAL_URL_CRAWLED,
1506
                    $signalPayload
1507
                );
1508
            }
1509
        }
1510
1511
        return $result;
1512
    }
1513
1514
    /**
1515
     * Gets the content of a URL.
1516
     *
1517
     * @param string $originalUrl URL to read
1518
     * @param string $crawlerId Crawler ID string (qid + hash to verify)
1519
     * @param integer $timeout Timeout time
1520
     * @param integer $recursion Recursion limiter for 302 redirects
1521
     * @return array
1522
     */
1523 2
    public function requestUrl($originalUrl, $crawlerId, $timeout = 2, $recursion = 10)
1524
    {
1525 2
        if (!$recursion) {
1526
            return false;
1527
        }
1528
1529
        // Parse URL, checking for scheme:
1530 2
        $url = parse_url($originalUrl);
1531
1532 2
        if ($url === false) {
1533
            if (TYPO3_DLOG) {
1534
                GeneralUtility::devLog(sprintf('Could not parse_url() for string "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1535
            }
1536
            return false;
1537
        }
1538
1539 2
        if (!in_array($url['scheme'], ['','http','https'])) {
1540
            if (TYPO3_DLOG) {
1541
                GeneralUtility::devLog(sprintf('Scheme does not match for url "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1542
            }
1543
            return false;
1544
        }
1545
1546
        // direct request
1547 2
        if ($this->extensionSettings['makeDirectRequests']) {
1548 2
            $result = $this->sendDirectRequest($originalUrl, $crawlerId);
1549 2
            return $result;
1550
        }
1551
1552
        $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1553
1554
        // thanks to Pierrick Caillon for adding proxy support
1555
        $rurl = $url;
1556
1557
        if ($this->extensionSettings['curlUse'] && $this->extensionSettings['curlProxyServer']) {
1558
            $rurl = parse_url($this->extensionSettings['curlProxyServer']);
1559
            $url['path'] = $url['scheme'] . '://' . $url['host'] . ($url['port'] > 0 ? ':' . $url['port'] : '') . $url['path'];
1560
            $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1561
        }
1562
1563
        $host = $rurl['host'];
1564
1565
        if ($url['scheme'] == 'https') {
1566
            $host = 'ssl://' . $host;
1567
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 443;
1568
        } else {
1569
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 80;
1570
        }
1571
1572
        $startTime = microtime(true);
1573
        $fp = fsockopen($host, $port, $errno, $errstr, $timeout);
1574
1575
        if (!$fp) {
1576
            if (TYPO3_DLOG) {
1577
                GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1578
            }
1579
            return false;
1580
        } else {
1581
            // Request message:
1582
            $msg = implode("\r\n", $reqHeaders) . "\r\n\r\n";
1583
            fputs($fp, $msg);
1584
1585
            // Read response:
1586
            $d = $this->getHttpResponseFromStream($fp);
1587
            fclose($fp);
1588
1589
            $time = microtime(true) - $startTime;
1590
            $this->log($originalUrl . ' ' . $time);
1591
1592
            // Implode content and headers:
1593
            $result = [
1594
                'request' => $msg,
1595
                'headers' => implode('', $d['headers']),
1596
                'content' => implode('', (array)$d['content']),
1597
            ];
1598
1599
            if (($this->extensionSettings['follow30x']) && ($newUrl = $this->getRequestUrlFrom302Header($d['headers'], $url['user'], $url['pass']))) {
1600
                $result = array_merge(['parentRequest' => $result], $this->requestUrl($newUrl, $crawlerId, $recursion--));
0 ignored issues
show
Bug introduced by
It seems like $newUrl defined by $this->getRequestUrlFrom...['user'], $url['pass']) on line 1599 can also be of type boolean; however, AOE\Crawler\Controller\C...ontroller::requestUrl() does only seem to accept string, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
1601
                $newRequestUrl = $this->requestUrl($newUrl, $crawlerId, $timeout, --$recursion);
0 ignored issues
show
Bug introduced by
It seems like $newUrl defined by $this->getRequestUrlFrom...['user'], $url['pass']) on line 1599 can also be of type boolean; however, AOE\Crawler\Controller\C...ontroller::requestUrl() does only seem to accept string, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
1602
1603
                if (is_array($newRequestUrl)) {
1604
                    $result = array_merge(['parentRequest' => $result], $newRequestUrl);
1605
                } else {
1606
                    if (TYPO3_DLOG) {
1607
                        GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1608
                    }
1609
                    return false;
1610
                }
1611
            }
1612
1613
            return $result;
1614
        }
1615
    }
1616
1617
    /**
1618
     * Gets the base path of the website frontend.
1619
     * (e.g. if you call http://mydomain.com/cms/index.php in
1620
     * the browser the base path is "/cms/")
1621
     *
1622
     * @return string Base path of the website frontend
1623
     */
1624
    protected function getFrontendBasePath()
1625
    {
1626
        $frontendBasePath = '/';
1627
1628
        // Get the path from the extension settings:
1629
        if (isset($this->extensionSettings['frontendBasePath']) && $this->extensionSettings['frontendBasePath']) {
1630
            $frontendBasePath = $this->extensionSettings['frontendBasePath'];
1631
        // If empty, try to use config.absRefPrefix:
1632
        } elseif (isset($GLOBALS['TSFE']->absRefPrefix) && !empty($GLOBALS['TSFE']->absRefPrefix)) {
1633
            $frontendBasePath = $GLOBALS['TSFE']->absRefPrefix;
1634
        // If not in CLI mode the base path can be determined from $_SERVER environment:
1635
        } elseif (!defined('TYPO3_REQUESTTYPE_CLI') || !TYPO3_REQUESTTYPE_CLI) {
1636
            $frontendBasePath = GeneralUtility::getIndpEnv('TYPO3_SITE_PATH');
1637
        }
1638
1639
        // Base path must be '/<pathSegements>/':
1640
        if ($frontendBasePath != '/') {
1641
            $frontendBasePath = '/' . ltrim($frontendBasePath, '/');
1642
            $frontendBasePath = rtrim($frontendBasePath, '/') . '/';
1643
        }
1644
1645
        return $frontendBasePath;
1646
    }
1647
1648
    /**
1649
     * Executes a shell command and returns the outputted result.
1650
     *
1651
     * @param string $command Shell command to be executed
1652
     * @return string Outputted result of the command execution
1653
     */
1654
    protected function executeShellCommand($command)
1655
    {
1656
        $result = shell_exec($command);
1657
        return $result;
1658
    }
1659
1660
    /**
1661
     * Reads HTTP response from the given stream.
1662
     *
1663
     * @param  resource $streamPointer  Pointer to connection stream.
1664
     * @return array                    Associative array with the following items:
1665
     *                                  headers <array> Response headers sent by server.
1666
     *                                  content <array> Content, with each line as an array item.
1667
     */
1668 1
    protected function getHttpResponseFromStream($streamPointer)
1669
    {
1670 1
        $response = ['headers' => [], 'content' => []];
1671
1672 1
        if (is_resource($streamPointer)) {
1673
            // read headers
1674 1
            while ($line = fgets($streamPointer, '2048')) {
1675 1
                $line = trim($line);
1676 1
                if ($line !== '') {
1677 1
                    $response['headers'][] = $line;
1678
                } else {
1679 1
                    break;
1680
                }
1681
            }
1682
1683
            // read content
1684 1
            while ($line = fgets($streamPointer, '2048')) {
1685 1
                $response['content'][] = $line;
1686
            }
1687
        }
1688
1689 1
        return $response;
1690
    }
1691
1692
    /**
1693
     * @param message
1694
     */
1695 2
    protected function log($message)
1696
    {
1697 2
        if (!empty($this->extensionSettings['logFileName'])) {
1698
            $fileResult = @file_put_contents($this->extensionSettings['logFileName'], date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);
1699
            if (!$fileResult) {
1700
                GeneralUtility::devLog('File "' . $this->extensionSettings['logFileName'] . '" could not be written, please check file permissions.', 'crawler', LogLevel::INFO);
1701
            }
1702
        }
1703 2
    }
1704
1705
    /**
1706
     * Builds HTTP request headers.
1707
     *
1708
     * @param array $url
1709
     * @param string $crawlerId
1710
     *
1711
     * @return array
1712
     */
1713 6
    protected function buildRequestHeaderArray(array $url, $crawlerId)
1714
    {
1715 6
        $reqHeaders = [];
1716 6
        $reqHeaders[] = 'GET ' . $url['path'] . ($url['query'] ? '?' . $url['query'] : '') . ' HTTP/1.0';
1717 6
        $reqHeaders[] = 'Host: ' . $url['host'];
1718 6
        if (stristr($url['query'], 'ADMCMD_previewWS')) {
1719 2
            $reqHeaders[] = 'Cookie: $Version="1"; be_typo_user="1"; $Path=/';
1720
        }
1721 6
        $reqHeaders[] = 'Connection: close';
1722 6
        if ($url['user'] != '') {
1723 2
            $reqHeaders[] = 'Authorization: Basic ' . base64_encode($url['user'] . ':' . $url['pass']);
1724
        }
1725 6
        $reqHeaders[] = 'X-T3crawler: ' . $crawlerId;
1726 6
        $reqHeaders[] = 'User-Agent: TYPO3 crawler';
1727 6
        return $reqHeaders;
1728
    }
1729
1730
    /**
1731
     * Check if the submitted HTTP-Header contains a redirect location and built new crawler-url
1732
     *
1733
     * @param array $headers HTTP Header
1734
     * @param string $user HTTP Auth. User
1735
     * @param string $pass HTTP Auth. Password
1736
     * @return bool|string
1737
     */
1738 12
    protected function getRequestUrlFrom302Header($headers, $user = '', $pass = '')
1739
    {
1740 12
        $header = [];
1741 12
        if (!is_array($headers)) {
1742 1
            return false;
1743
        }
1744 11
        if (!(stristr($headers[0], '301 Moved') || stristr($headers[0], '302 Found') || stristr($headers[0], '302 Moved'))) {
1745 2
            return false;
1746
        }
1747
1748 9
        foreach ($headers as $hl) {
1749 9
            $tmp = explode(": ", $hl);
1750 9
            $header[trim($tmp[0])] = trim($tmp[1]);
1751 9
            if (trim($tmp[0]) == 'Location') {
1752 9
                break;
1753
            }
1754
        }
1755 9
        if (!array_key_exists('Location', $header)) {
1756 3
            return false;
1757
        }
1758
1759 6
        if ($user != '') {
1760 3
            if (!($tmp = parse_url($header['Location']))) {
1761 1
                return false;
1762
            }
1763 2
            $newUrl = $tmp['scheme'] . '://' . $user . ':' . $pass . '@' . $tmp['host'] . $tmp['path'];
1764 2
            if ($tmp['query'] != '') {
1765 2
                $newUrl .= '?' . $tmp['query'];
1766
            }
1767
        } else {
1768 3
            $newUrl = $header['Location'];
1769
        }
1770 5
        return $newUrl;
1771
    }
1772
1773
    /**************************
1774
     *
1775
     * tslib_fe hooks:
1776
     *
1777
     **************************/
1778
1779
    /**
1780
     * Initialization hook (called after database connection)
1781
     * Takes the "HTTP_X_T3CRAWLER" header and looks up queue record and verifies if the session comes from the system (by comparing hashes)
1782
     *
1783
     * @param array $params Parameters from frontend
1784
     * @param object $ref TSFE object (reference under PHP5)
1785
     * @return void
1786
     *
1787
     * FIXME: Look like this is not used, in commit 9910d3f40cce15f4e9b7bcd0488bf21f31d53ebc it's added as public,
1788
     * FIXME: I think this can be removed. (TNM)
1789
     */
1790
    public function fe_init(&$params, $ref)
0 ignored issues
show
Unused Code introduced by
The parameter $ref is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
1791
    {
1792
        // Authenticate crawler request:
1793
        if (isset($_SERVER['HTTP_X_T3CRAWLER'])) {
1794
            list($queueId, $hash) = explode(':', $_SERVER['HTTP_X_T3CRAWLER']);
1795
            list($queueRec) = $this->db->exec_SELECTgetSingleRow('*', 'tx_crawler_queue', 'qid=' . intval($queueId));
1796
1797
            // If a crawler record was found and hash was matching, set it up:
1798
            if (is_array($queueRec) && $hash === md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey'])) {
1799
                $params['pObj']->applicationData['tx_crawler']['running'] = true;
1800
                $params['pObj']->applicationData['tx_crawler']['parameters'] = unserialize($queueRec['parameters']);
1801
                $params['pObj']->applicationData['tx_crawler']['log'] = [];
1802
            } else {
1803
                die('No crawler entry found!');
1804
            }
1805
        }
1806
    }
1807
1808
    /*****************************
1809
     *
1810
     * Compiling URLs to crawl - tools
1811
     *
1812
     *****************************/
1813
1814
    /**
1815
     * @param integer $id Root page id to start from.
1816
     * @param integer $depth Depth of tree, 0=only id-page, 1= on sublevel, 99 = infinite
1817
     * @param integer $scheduledTime Unix Time when the URL is timed to be visited when put in queue
1818
     * @param integer $reqMinute Number of requests per minute (creates the interleave between requests)
1819
     * @param boolean $submitCrawlUrls If set, submits the URLs to queue in database (real crawling)
1820
     * @param boolean $downloadCrawlUrls If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
1821
     * @param array $incomingProcInstructions Array of processing instructions
1822
     * @param array $configurationSelection Array of configuration keys
1823
     * @return string
1824
     */
1825
    public function getPageTreeAndUrls(
1826
        $id,
1827
        $depth,
1828
        $scheduledTime,
1829
        $reqMinute,
1830
        $submitCrawlUrls,
1831
        $downloadCrawlUrls,
1832
        array $incomingProcInstructions,
1833
        array $configurationSelection
1834
    ) {
1835
        global $BACK_PATH;
1836
        global $LANG;
1837
        if (!is_object($LANG)) {
1838
            $LANG = GeneralUtility::makeInstance(LanguageService::class);
1839
            $LANG->init(0);
1840
        }
1841
        $this->scheduledTime = $scheduledTime;
1842
        $this->reqMinute = $reqMinute;
1843
        $this->submitCrawlUrls = $submitCrawlUrls;
1844
        $this->downloadCrawlUrls = $downloadCrawlUrls;
1845
        $this->incomingProcInstructions = $incomingProcInstructions;
1846
        $this->incomingConfigurationSelection = $configurationSelection;
1847
1848
        $this->duplicateTrack = [];
1849
        $this->downloadUrls = [];
1850
1851
        // Drawing tree:
1852
        /* @var PageTreeView $tree */
1853
        $tree = GeneralUtility::makeInstance(PageTreeView::class);
1854
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
1855
        $tree->init('AND ' . $perms_clause);
1856
1857
        $pageInfo = BackendUtility::readPageAccess($id, $perms_clause);
1858
        if (is_array($pageInfo)) {
1859
            // Set root row:
1860
            $tree->tree[] = [
1861
                'row' => $pageInfo,
1862
                'HTML' => IconUtility::getIconForRecord('pages', $pageInfo),
1863
            ];
1864
        }
1865
1866
        // Get branch beneath:
1867
        if ($depth) {
1868
            $tree->getTree($id, $depth, '');
1869
        }
1870
1871
        // Traverse page tree:
1872
        $code = '';
1873
1874
        foreach ($tree->tree as $data) {
1875
            $this->MP = false;
1876
1877
            // recognize mount points
1878
            if ($data['row']['doktype'] == 7) {
1879
                $mountpage = $this->db->exec_SELECTgetRows('*', 'pages', 'uid = ' . $data['row']['uid']);
1880
1881
                // fetch mounted pages
1882
                $this->MP = $mountpage[0]['mount_pid'] . '-' . $data['row']['uid'];
0 ignored issues
show
Documentation Bug introduced by
The property $MP was declared of type boolean, but $mountpage[0]['mount_pid...' . $data['row']['uid'] is of type string. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
1883
1884
                $mountTree = GeneralUtility::makeInstance(PageTreeView::class);
1885
                $mountTree->init('AND ' . $perms_clause);
1886
                $mountTree->getTree($mountpage[0]['mount_pid'], $depth, '');
1887
1888
                foreach ($mountTree->tree as $mountData) {
1889
                    $code .= $this->drawURLs_addRowsForPage(
1890
                        $mountData['row'],
1891
                        $mountData['HTML'] . BackendUtility::getRecordTitle('pages', $mountData['row'], true)
1892
                    );
1893
                }
1894
1895
                // replace page when mount_pid_ol is enabled
1896
                if ($mountpage[0]['mount_pid_ol']) {
1897
                    $data['row']['uid'] = $mountpage[0]['mount_pid'];
1898
                } else {
1899
                    // if the mount_pid_ol is not set the MP must not be used for the mountpoint page
1900
                    $this->MP = false;
1901
                }
1902
            }
1903
1904
            $code .= $this->drawURLs_addRowsForPage(
1905
                $data['row'],
1906
                $data['HTML'] . BackendUtility::getRecordTitle('pages', $data['row'], true)
1907
            );
1908
        }
1909
1910
        return $code;
1911
    }
1912
1913
    /**
1914
     * Expands exclude string
1915
     *
1916
     * @param string $excludeString Exclude string
1917
     * @return array
1918
     */
1919
    public function expandExcludeString($excludeString)
1920
    {
1921
        // internal static caches;
1922
        static $expandedExcludeStringCache;
1923
        static $treeCache;
1924
1925
        if (empty($expandedExcludeStringCache[$excludeString])) {
1926
            $pidList = [];
1927
1928
            if (!empty($excludeString)) {
1929
                /** @var PageTreeView $tree */
1930
                $tree = GeneralUtility::makeInstance(PageTreeView::class);
1931
                $tree->init('AND ' . $this->backendUser->getPagePermsClause(1));
1932
1933
                $excludeParts = GeneralUtility::trimExplode(',', $excludeString);
1934
1935
                foreach ($excludeParts as $excludePart) {
1936
                    list($pid, $depth) = GeneralUtility::trimExplode('+', $excludePart);
1937
1938
                    // default is "page only" = "depth=0"
1939
                    if (empty($depth)) {
1940
                        $depth = (stristr($excludePart, '+')) ? 99 : 0;
1941
                    }
1942
1943
                    $pidList[] = $pid;
1944
1945
                    if ($depth > 0) {
1946
                        if (empty($treeCache[$pid][$depth])) {
1947
                            $tree->reset();
1948
                            $tree->getTree($pid, $depth);
1949
                            $treeCache[$pid][$depth] = $tree->tree;
1950
                        }
1951
1952
                        foreach ($treeCache[$pid][$depth] as $data) {
1953
                            $pidList[] = $data['row']['uid'];
1954
                        }
1955
                    }
1956
                }
1957
            }
1958
1959
            $expandedExcludeStringCache[$excludeString] = array_unique($pidList);
1960
        }
1961
1962
        return $expandedExcludeStringCache[$excludeString];
1963
    }
1964
1965
    /**
1966
     * Create the rows for display of the page tree
1967
     * For each page a number of rows are shown displaying GET variable configuration
1968
     *
1969
     * @param    array        Page row
1970
     * @param    string        Page icon and title for row
1971
     * @return    string        HTML <tr> content (one or more)
1972
     */
1973
    public function drawURLs_addRowsForPage(array $pageRow, $pageTitleAndIcon)
1974
    {
1975
        $skipMessage = '';
1976
1977
        // Get list of configurations
1978
        $configurations = $this->getUrlsForPageRow($pageRow, $skipMessage);
1979
1980
        if (count($this->incomingConfigurationSelection) > 0) {
1981
            // remove configuration that does not match the current selection
1982
            foreach ($configurations as $confKey => $confArray) {
1983
                if (!in_array($confKey, $this->incomingConfigurationSelection)) {
1984
                    unset($configurations[$confKey]);
1985
                }
1986
            }
1987
        }
1988
1989
        // Traverse parameter combinations:
1990
        $c = 0;
1991
        $content = '';
1992
        if (count($configurations)) {
1993
            foreach ($configurations as $confKey => $confArray) {
1994
1995
                    // Title column:
1996
                if (!$c) {
1997
                    $titleClm = '<td rowspan="' . count($configurations) . '">' . $pageTitleAndIcon . '</td>';
1998
                } else {
1999
                    $titleClm = '';
2000
                }
2001
2002
                if (!in_array($pageRow['uid'], $this->expandExcludeString($confArray['subCfg']['exclude']))) {
2003
2004
                        // URL list:
2005
                    $urlList = $this->urlListFromUrlArray(
2006
                        $confArray,
2007
                        $pageRow,
2008
                        $this->scheduledTime,
2009
                        $this->reqMinute,
2010
                        $this->submitCrawlUrls,
2011
                        $this->downloadCrawlUrls,
2012
                        $this->duplicateTrack,
2013
                        $this->downloadUrls,
2014
                        $this->incomingProcInstructions // if empty the urls won't be filtered by processing instructions
2015
                    );
2016
2017
                    // Expanded parameters:
2018
                    $paramExpanded = '';
2019
                    $calcAccu = [];
2020
                    $calcRes = 1;
2021
                    foreach ($confArray['paramExpanded'] as $gVar => $gVal) {
2022
                        $paramExpanded .= '
2023
                            <tr>
2024
                                <td>' . htmlspecialchars('&' . $gVar . '=') . '<br/>' .
2025
                                                '(' . count($gVal) . ')' .
2026
                                                '</td>
2027
                                <td>' . nl2br(htmlspecialchars(implode(chr(10), $gVal))) . '</td>
2028
                            </tr>
2029
                        ';
2030
                        $calcRes *= count($gVal);
2031
                        $calcAccu[] = count($gVal);
2032
                    }
2033
                    $paramExpanded = '<table class="table table-striped table-hover typo3-page-pages">' . $paramExpanded . '</table>';
2034
                    $paramExpanded .= 'Comb: ' . implode('*', $calcAccu) . '=' . $calcRes;
2035
2036
                    // Options
2037
                    $optionValues = '';
2038
                    if ($confArray['subCfg']['userGroups']) {
2039
                        $optionValues .= 'User Groups: ' . $confArray['subCfg']['userGroups'] . '<br/>';
2040
                    }
2041
                    if ($confArray['subCfg']['baseUrl']) {
2042
                        $optionValues .= 'Base Url: ' . $confArray['subCfg']['baseUrl'] . '<br/>';
2043
                    }
2044
                    if ($confArray['subCfg']['procInstrFilter']) {
2045
                        $optionValues .= 'ProcInstr: ' . $confArray['subCfg']['procInstrFilter'] . '<br/>';
2046
                    }
2047
2048
                    // Compile row:
2049
                    $content .= '
2050
                        <tr>
2051
                            ' . $titleClm . '
2052
                            <td>' . htmlspecialchars($confKey) . '</td>
2053
                            <td>' . nl2br(htmlspecialchars(rawurldecode(trim(str_replace('&', chr(10) . '&', GeneralUtility::implodeArrayForUrl('', $confArray['paramParsed'])))))) . '</td>
2054
                            <td>' . $paramExpanded . '</td>
2055
                            <td>' . $urlList . '</td>
2056
                            <td>' . $optionValues . '</td>
2057
                            <td>' . DebugUtility::viewArray($confArray['subCfg']['procInstrParams.']) . '</td>
2058
                        </tr>';
2059
                } else {
2060
                    $content .= '<tr>
2061
                            ' . $titleClm . '
2062
                            <td>' . htmlspecialchars($confKey) . '</td>
2063
                            <td colspan="5"><em>No entries</em> (Page is excluded in this configuration)</td>
2064
                        </tr>';
2065
                }
2066
2067
                $c++;
2068
            }
2069
        } else {
2070
            $message = !empty($skipMessage) ? ' (' . $skipMessage . ')' : '';
2071
2072
            // Compile row:
2073
            $content .= '
2074
                <tr style="border-bottom: 1px solid black;">
2075
                    <td>' . $pageTitleAndIcon . '</td>
2076
                    <td colspan="6"><em>No entries</em>' . $message . '</td>
2077
                </tr>';
2078
        }
2079
2080
        return $content;
2081
    }
2082
2083
    /*****************************
2084
     *
2085
     * CLI functions
2086
     *
2087
     *****************************/
2088
2089
    /**
2090
     * Main function for running from Command Line PHP script (cron job)
2091
     * See ext/crawler/cli/crawler_cli.phpsh for details
2092
     *
2093
     * @return int number of remaining items or false if error
2094
     */
2095
    public function CLI_main()
2096
    {
2097
        $this->setAccessMode('cli');
2098
        $result = self::CLI_STATUS_NOTHING_PROCCESSED;
2099
        $cliObj = GeneralUtility::makeInstance(CrawlerCommandLineController::class);
2100
2101
        if (isset($cliObj->cli_args['-h']) || isset($cliObj->cli_args['--help'])) {
2102
            $cliObj->cli_validateArgs();
2103
            $cliObj->cli_help();
2104
            exit;
2105
        }
2106
2107
        if (!$this->getDisabled() && $this->CLI_checkAndAcquireNewProcess($this->CLI_buildProcessId())) {
2108
            $countInARun = $cliObj->cli_argValue('--countInARun') ? intval($cliObj->cli_argValue('--countInARun')) : $this->extensionSettings['countInARun'];
2109
            // Seconds
2110
            $sleepAfterFinish = $cliObj->cli_argValue('--sleepAfterFinish') ? intval($cliObj->cli_argValue('--sleepAfterFinish')) : $this->extensionSettings['sleepAfterFinish'];
2111
            // Milliseconds
2112
            $sleepTime = $cliObj->cli_argValue('--sleepTime') ? intval($cliObj->cli_argValue('--sleepTime')) : $this->extensionSettings['sleepTime'];
2113
2114
            try {
2115
                // Run process:
2116
                $result = $this->CLI_run($countInARun, $sleepTime, $sleepAfterFinish);
2117
            } catch (\Exception $e) {
2118
                $this->CLI_debug(get_class($e) . ': ' . $e->getMessage());
2119
                $result = self::CLI_STATUS_ABORTED;
2120
            }
2121
2122
            // Cleanup
2123
            $this->db->exec_DELETEquery('tx_crawler_process', 'assigned_items_count = 0');
2124
2125
            //TODO can't we do that in a clean way?
2126
            $releaseStatus = $this->CLI_releaseProcesses($this->CLI_buildProcessId());
0 ignored issues
show
Unused Code introduced by
$releaseStatus is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
Deprecated Code introduced by
The method AOE\Crawler\Controller\C...:CLI_releaseProcesses() has been deprecated with message: since crawler v6.5.1, will be removed in crawler v9.0.0.

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2127
2128
            $this->CLI_debug("Unprocessed Items remaining:" . $this->queueRepository->countUnprocessedItems() . " (" . $this->CLI_buildProcessId() . ")");
2129
            $result |= ($this->queueRepository->countUnprocessedItems() > 0 ? self::CLI_STATUS_REMAIN : self::CLI_STATUS_NOTHING_PROCCESSED);
2130
        } else {
2131
            $result |= self::CLI_STATUS_ABORTED;
2132
        }
2133
2134
        return $result;
2135
    }
2136
2137
    /**
2138
     * Function executed by crawler_im.php cli script.
2139
     *
2140
     * @return void
2141
     */
2142
    public function CLI_main_im()
2143
    {
2144
        $this->setAccessMode('cli_im');
2145
2146
        $cliObj = GeneralUtility::makeInstance(QueueCommandLineController::class);
2147
2148
        // Force user to admin state and set workspace to "Live":
2149
        $this->backendUser->user['admin'] = 1;
2150
        $this->backendUser->setWorkspace(0);
2151
2152
        // Print help
2153
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2154
            $cliObj->cli_validateArgs();
2155
            $cliObj->cli_help();
2156
            exit;
2157
        }
2158
2159
        $cliObj->cli_validateArgs();
2160
2161
        if ($cliObj->cli_argValue('-o') === 'exec') {
2162
            $this->registerQueueEntriesInternallyOnly = true;
2163
        }
2164
2165
        if (isset($cliObj->cli_args['_DEFAULT'][2])) {
2166
            // Crawler is called over TYPO3 BE
2167
            $pageId = MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][2], 0);
2168
        } else {
2169
            // Crawler is called over cli
2170
            $pageId = MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2171
        }
2172
2173
        $configurationKeys = $this->getConfigurationKeys($cliObj);
0 ignored issues
show
Deprecated Code introduced by
The method AOE\Crawler\Controller\C...:getConfigurationKeys() has been deprecated with message: since crawler v6.3.0, will be removed in crawler v7.0.0.

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2174
2175
        if (!is_array($configurationKeys)) {
2176
            $configurations = $this->getUrlsForPageId($pageId);
2177
            if (is_array($configurations)) {
2178
                $configurationKeys = array_keys($configurations);
2179
            } else {
2180
                $configurationKeys = [];
2181
            }
2182
        }
2183
2184
        if ($cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec') {
2185
            $reason = new Reason();
2186
            $reason->setReason(Reason::REASON_GUI_SUBMIT);
2187
            $reason->setDetailText('The cli script of the crawler added to the queue');
2188
2189
            // The event dispatcher is deprecated since crawler v6.4.0, will be removed in crawler v7.0.0.
2190
            // Please use the Signal instead.
2191
            EventDispatcher::getInstance()->post(
2192
                'invokeQueueChange',
2193
                $this->setID,
2194
                ['reason' => $reason]
2195
            );
2196
2197
            $signalPayload = ['reason' => $reason];
2198
            SignalSlotUtility::emitSignal(
2199
                __CLASS__,
2200
                SignalSlotUtility::SIGNAL_INVOKE_QUEUE_CHANGE,
2201
                $signalPayload
2202
            );
2203
        }
2204
2205
        if ($this->extensionSettings['cleanUpOldQueueEntries']) {
2206
            $this->cleanUpOldQueueEntries();
2207
        }
2208
2209
        $this->setID = (int) GeneralUtility::md5int(microtime());
2210
        $this->getPageTreeAndUrls(
2211
            $pageId,
2212
            MathUtility::forceIntegerInRange($cliObj->cli_argValue('-d'), 0, 99),
2213
            $this->getCurrentTime(),
2214
            MathUtility::forceIntegerInRange($cliObj->cli_isArg('-n') ? $cliObj->cli_argValue('-n') : 30, 1, 1000),
2215
            $cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec',
2216
            $cliObj->cli_argValue('-o') === 'url',
2217
            GeneralUtility::trimExplode(',', $cliObj->cli_argValue('-proc'), true),
2218
            $configurationKeys
2219
        );
2220
2221
        if ($cliObj->cli_argValue('-o') === 'url') {
2222
            $cliObj->cli_echo(implode(chr(10), $this->downloadUrls) . chr(10), true);
2223
        } elseif ($cliObj->cli_argValue('-o') === 'exec') {
2224
            $cliObj->cli_echo("Executing " . count($this->urlList) . " requests right away:\n\n");
2225
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2226
            $cliObj->cli_echo("\nProcessing:\n");
2227
2228
            foreach ($this->queueEntries as $queueRec) {
2229
                $p = unserialize($queueRec['parameters']);
2230
                $cliObj->cli_echo($p['url'] . ' (' . implode(',', $p['procInstructions']) . ') => ');
2231
2232
                $result = $this->readUrlFromArray($queueRec);
2233
2234
                $requestResult = unserialize($result['content']);
2235
                if (is_array($requestResult)) {
2236
                    $resLog = is_array($requestResult['log']) ? chr(10) . chr(9) . chr(9) . implode(chr(10) . chr(9) . chr(9), $requestResult['log']) : '';
2237
                    $cliObj->cli_echo('OK: ' . $resLog . chr(10));
2238
                } else {
2239
                    $cliObj->cli_echo('Error checking Crawler Result: ' . substr(preg_replace('/\s+/', ' ', strip_tags($result['content'])), 0, 30000) . '...' . chr(10));
2240
                }
2241
            }
2242
        } elseif ($cliObj->cli_argValue('-o') === 'queue') {
2243
            $cliObj->cli_echo("Putting " . count($this->urlList) . " entries in queue:\n\n");
2244
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2245
        } else {
2246
            $cliObj->cli_echo(count($this->urlList) . " entries found for processing. (Use -o to decide action):\n\n", true);
2247
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10), true);
2248
        }
2249
    }
2250
2251
    /**
2252
     * Function executed by crawler_im.php cli script.
2253
     *
2254
     * @return bool
2255
     */
2256
    public function CLI_main_flush()
2257
    {
2258
        $this->setAccessMode('cli_flush');
2259
        $cliObj = GeneralUtility::makeInstance(FlushCommandLineController::class);
2260
2261
        // Force user to admin state and set workspace to "Live":
2262
        $this->backendUser->user['admin'] = 1;
2263
        $this->backendUser->setWorkspace(0);
2264
2265
        // Print help
2266
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2267
            $cliObj->cli_validateArgs();
2268
            $cliObj->cli_help();
2269
            exit;
2270
        }
2271
2272
        $cliObj->cli_validateArgs();
2273
        $pageId = MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2274
        $fullFlush = ($pageId == 0);
2275
2276
        $mode = $cliObj->cli_argValue('-o');
2277
2278
        switch ($mode) {
2279
            case 'all':
2280
                $result = $this->getLogEntriesForPageId($pageId, '', true, $fullFlush);
2281
                break;
2282
            case 'finished':
2283
            case 'pending':
2284
                $result = $this->getLogEntriesForPageId($pageId, $mode, true, $fullFlush);
2285
                break;
2286
            default:
2287
                $cliObj->cli_validateArgs();
2288
                $cliObj->cli_help();
2289
                $result = false;
2290
        }
2291
2292
        return $result !== false;
2293
    }
2294
2295
    /**
2296
     * Obtains configuration keys from the CLI arguments
2297
     *
2298
     * @param QueueCommandLineController $cliObj
2299
     * @return array
2300
     *
2301
     * @deprecated since crawler v6.3.0, will be removed in crawler v7.0.0.
2302
     */
2303
    protected function getConfigurationKeys(QueueCommandLineController $cliObj)
2304
    {
2305
        $parameter = trim($cliObj->cli_argValue('-conf'));
2306
        return ($parameter != '' ? GeneralUtility::trimExplode(',', $parameter) : []);
2307
    }
2308
2309
    /**
2310
     * Running the functionality of the CLI (crawling URLs from queue)
2311
     *
2312
     * @param int $countInARun
2313
     * @param int $sleepTime
2314
     * @param int $sleepAfterFinish
2315
     * @return string
2316
     */
2317
    public function CLI_run($countInARun, $sleepTime, $sleepAfterFinish)
2318
    {
2319
        $result = 0;
2320
        $counter = 0;
2321
2322
        // First, run hooks:
2323
        $this->CLI_runHooks();
2324
2325
        // Clean up the queue
2326
        if (intval($this->extensionSettings['purgeQueueDays']) > 0) {
2327
            $purgeDate = $this->getCurrentTime() - 24 * 60 * 60 * intval($this->extensionSettings['purgeQueueDays']);
2328
            $del = $this->db->exec_DELETEquery(
2329
                'tx_crawler_queue',
2330
                'exec_time!=0 AND exec_time<' . $purgeDate
2331
            );
2332
            if (false == $del) {
2333
                GeneralUtility::devLog('Records could not be deleted.', 'crawler', LogLevel::INFO);
2334
            }
2335
        }
2336
2337
        // Select entries:
2338
        //TODO Shouldn't this reside within the transaction?
2339
        $rows = $this->db->exec_SELECTgetRows(
2340
            'qid,scheduled',
2341
            'tx_crawler_queue',
2342
            'exec_time=0
2343
                AND process_scheduled= 0
2344
                AND scheduled<=' . $this->getCurrentTime(),
2345
            '',
2346
            'scheduled, qid',
2347
            intval($countInARun)
2348
        );
2349
2350
        if (count($rows) > 0) {
2351
            $quidList = [];
2352
2353
            foreach ($rows as $r) {
0 ignored issues
show
Bug introduced by
The expression $rows of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
2354
                $quidList[] = $r['qid'];
2355
            }
2356
2357
            $processId = $this->CLI_buildProcessId();
2358
2359
            //reserve queue entries for process
2360
            $this->db->sql_query('BEGIN');
2361
            //TODO make sure we're not taking assigned queue-entires
2362
            $this->db->exec_UPDATEquery(
2363
                'tx_crawler_queue',
2364
                'qid IN (' . implode(',', $quidList) . ')',
2365
                [
2366
                    'process_scheduled' => intval($this->getCurrentTime()),
2367
                    'process_id' => $processId,
2368
                ]
2369
            );
2370
2371
            //save the number of assigned queue entrys to determine who many have been processed later
2372
            $numberOfAffectedRows = $this->db->sql_affected_rows();
2373
            $this->db->exec_UPDATEquery(
2374
                'tx_crawler_process',
2375
                "process_id = '" . $processId . "'",
2376
                [
2377
                    'assigned_items_count' => intval($numberOfAffectedRows),
2378
                ]
2379
            );
2380
2381
            if ($numberOfAffectedRows == count($quidList)) {
2382
                $this->db->sql_query('COMMIT');
2383
            } else {
2384
                $this->db->sql_query('ROLLBACK');
2385
                $this->CLI_debug("Nothing processed due to multi-process collision (" . $this->CLI_buildProcessId() . ")");
2386
                return ($result | self::CLI_STATUS_ABORTED);
2387
            }
2388
2389
            foreach ($rows as $r) {
0 ignored issues
show
Bug introduced by
The expression $rows of type null|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
2390
                $result |= $this->readUrl($r['qid']);
2391
2392
                $counter++;
2393
                usleep(intval($sleepTime)); // Just to relax the system
2394
2395
                // if during the start and the current read url the cli has been disable we need to return from the function
2396
                // mark the process NOT as ended.
2397
                if ($this->getDisabled()) {
2398
                    return ($result | self::CLI_STATUS_ABORTED);
2399
                }
2400
2401
                $process = $this->processRepository->findByProcessId($this->CLI_buildProcessId());
2402
                if (!$process[0]['active']) {
2403
                    $this->CLI_debug("conflict / timeout (" . $this->CLI_buildProcessId() . ")");
2404
2405
                    //TODO might need an additional returncode
2406
                    $result |= self::CLI_STATUS_ABORTED;
2407
                    break; //possible timeout
2408
                }
2409
            }
2410
2411
            sleep(intval($sleepAfterFinish));
2412
2413
            $msg = 'Rows: ' . $counter;
2414
            $this->CLI_debug($msg . " (" . $this->CLI_buildProcessId() . ")");
2415
        } else {
2416
            $this->CLI_debug("Nothing within queue which needs to be processed (" . $this->CLI_buildProcessId() . ")");
2417
        }
2418
2419
        if ($counter > 0) {
2420
            $result |= self::CLI_STATUS_PROCESSED;
2421
        }
2422
2423
        return $result;
2424
    }
2425
2426
    /**
2427
     * Activate hooks
2428
     *
2429
     * @return void
2430
     */
2431
    public function CLI_runHooks()
2432
    {
2433
        global $TYPO3_CONF_VARS;
2434
        if (is_array($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'])) {
2435
            foreach ($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'] as $objRef) {
2436
                $hookObj = &GeneralUtility::getUserObj($objRef);
2437
                if (is_object($hookObj)) {
2438
                    $hookObj->crawler_init($this);
2439
                }
2440
            }
2441
        }
2442
    }
2443
2444
    /**
2445
     * Try to acquire a new process with the given id
2446
     * also performs some auto-cleanup for orphan processes
2447
     * @todo preemption might not be the most elegant way to clean up
2448
     *
2449
     * @param string $id identification string for the process
2450
     * @return boolean
2451
     */
2452
    public function CLI_checkAndAcquireNewProcess($id)
2453
    {
2454
        $ret = true;
2455
2456
        $systemProcessId = getmypid();
2457
        if ($systemProcessId < 1) {
2458
            return false;
2459
        }
2460
2461
        $processCount = 0;
2462
        $orphanProcesses = [];
2463
2464
        $this->db->sql_query('BEGIN');
2465
2466
        $res = $this->db->exec_SELECTquery(
2467
            'process_id,ttl',
2468
            'tx_crawler_process',
2469
            'active=1 AND deleted=0'
2470
            );
2471
2472
        $currentTime = $this->getCurrentTime();
2473
2474
        while ($row = $this->db->sql_fetch_assoc($res)) {
2475
            if ($row['ttl'] < $currentTime) {
2476
                $orphanProcesses[] = $row['process_id'];
2477
            } else {
2478
                $processCount++;
2479
            }
2480
        }
2481
2482
        // if there are less than allowed active processes then add a new one
2483
        if ($processCount < intval($this->extensionSettings['processLimit'])) {
2484
            $this->CLI_debug("add process " . $this->CLI_buildProcessId() . " (" . ($processCount + 1) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2485
2486
            // create new process record
2487
            $this->db->exec_INSERTquery(
2488
                'tx_crawler_process',
2489
                [
2490
                    'process_id' => $id,
2491
                    'active' => '1',
2492
                    'ttl' => ($currentTime + intval($this->extensionSettings['processMaxRunTime'])),
2493
                    'system_process_id' => $systemProcessId,
2494
                ]
2495
                );
2496
        } else {
2497
            $this->CLI_debug("Processlimit reached (" . ($processCount) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2498
            $ret = false;
2499
        }
2500
2501
        $this->CLI_releaseProcesses($orphanProcesses, true); // maybe this should be somehow included into the current lock
0 ignored issues
show
Deprecated Code introduced by
The method AOE\Crawler\Controller\C...:CLI_releaseProcesses() has been deprecated with message: since crawler v6.5.1, will be removed in crawler v9.0.0.

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2502
        $this->CLI_deleteProcessesMarkedDeleted();
0 ignored issues
show
Deprecated Code introduced by
The method AOE\Crawler\Controller\C...rocessesMarkedDeleted() has been deprecated with message: since crawler v6.5.1, will be removed in crawler v9.0.0.

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2503
2504
        $this->db->sql_query('COMMIT');
2505
2506
        return $ret;
2507
    }
2508
2509
    /**
2510
     * Release a process and the required resources
2511
     *
2512
     * @param  mixed    $releaseIds   string with a single process-id or array with multiple process-ids
2513
     * @param  boolean  $withinLock   show whether the DB-actions are included within an existing lock
2514
     * @return boolean
2515
     *
2516
     * @deprecated since crawler v6.5.1, will be removed in crawler v9.0.0.
2517
     */
2518
    public function CLI_releaseProcesses($releaseIds, $withinLock = false)
2519
    {
2520
        if (!is_array($releaseIds)) {
2521
            $releaseIds = [$releaseIds];
2522
        }
2523
2524
        if (!count($releaseIds) > 0) {
2525
            return false;   //nothing to release
2526
        }
2527
2528
        if (!$withinLock) {
2529
            $this->db->sql_query('BEGIN');
2530
        }
2531
2532
        // some kind of 2nd chance algo - this way you need at least 2 processes to have a real cleanup
2533
        // this ensures that a single process can't mess up the entire process table
2534
2535
        // mark all processes as deleted which have no "waiting" queue-entires and which are not active
2536
        $this->db->exec_UPDATEquery(
2537
            'tx_crawler_queue',
2538
            'process_id IN (SELECT process_id FROM tx_crawler_process WHERE active=0 AND deleted=0)',
2539
            [
2540
                'process_scheduled' => 0,
2541
                'process_id' => '',
2542
            ]
2543
        );
2544
        $this->db->exec_UPDATEquery(
2545
            'tx_crawler_process',
2546
            'active=0 AND deleted=0
2547
            AND NOT EXISTS (
2548
                SELECT * FROM tx_crawler_queue
2549
                WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
2550
                AND tx_crawler_queue.exec_time = 0
2551
            )',
2552
            [
2553
                'deleted' => '1',
2554
                'system_process_id' => 0,
2555
            ]
2556
        );
2557
        // mark all requested processes as non-active
2558
        $this->db->exec_UPDATEquery(
2559
            'tx_crawler_process',
2560
            'process_id IN (\'' . implode('\',\'', $releaseIds) . '\') AND deleted=0',
2561
            [
2562
                'active' => '0',
2563
            ]
2564
        );
2565
        $this->db->exec_UPDATEquery(
2566
            'tx_crawler_queue',
2567
            'exec_time=0 AND process_id IN ("' . implode('","', $releaseIds) . '")',
2568
            [
2569
                'process_scheduled' => 0,
2570
                'process_id' => '',
2571
            ]
2572
        );
2573
2574
        if (!$withinLock) {
2575
            $this->db->sql_query('COMMIT');
2576
        }
2577
2578
        return true;
2579
    }
2580
2581
    /**
2582
     * Delete processes marked as deleted
2583
     *
2584
     * @return void
2585
     *
2586
     * @deprecated since crawler v6.5.1, will be removed in crawler v9.0.0.
2587
     */
2588
    public function CLI_deleteProcessesMarkedDeleted()
2589
    {
2590
        $this->db->exec_DELETEquery('tx_crawler_process', 'deleted = 1');
2591
    }
2592
2593
    /**
2594
     * Check if there are still resources left for the process with the given id
2595
     * Used to determine timeouts and to ensure a proper cleanup if there's a timeout
2596
     *
2597
     * @param  string  identification string for the process
2598
     * @return boolean determines if the process is still active / has resources
2599
     *
2600
     * @deprecated since crawler v6.5.1, will be removed in crawler v9.0.0.
2601
     *
2602
     * FIXME: Please remove Transaction, not needed as only a select query.
2603
     */
2604
    public function CLI_checkIfProcessIsActive($pid)
2605
    {
2606
        $ret = false;
2607
        $this->db->sql_query('BEGIN');
2608
        $res = $this->db->exec_SELECTquery(
2609
            'process_id,active,ttl',
2610
            'tx_crawler_process',
2611
            'process_id = \'' . $pid . '\'  AND deleted=0',
2612
            '',
2613
            'ttl',
2614
            '0,1'
2615
        );
2616
        if ($row = $this->db->sql_fetch_assoc($res)) {
2617
            $ret = intVal($row['active']) == 1;
2618
        }
2619
        $this->db->sql_query('COMMIT');
2620
2621
        return $ret;
2622
    }
2623
2624
    /**
2625
     * Create a unique Id for the current process
2626
     *
2627
     * @return string  the ID
2628
     */
2629 2
    public function CLI_buildProcessId()
2630
    {
2631 2
        if (!$this->processID) {
2632 1
            $this->processID = GeneralUtility::shortMD5($this->microtime(true));
2633
        }
2634 2
        return $this->processID;
2635
    }
2636
2637
    /**
2638
     * @param bool $get_as_float
2639
     *
2640
     * @return mixed
2641
     */
2642
    protected function microtime($get_as_float = false)
2643
    {
2644
        return microtime($get_as_float);
2645
    }
2646
2647
    /**
2648
     * Prints a message to the stdout (only if debug-mode is enabled)
2649
     *
2650
     * @param  string $msg  the message
2651
     */
2652
    public function CLI_debug($msg)
2653
    {
2654
        if (intval($this->extensionSettings['processDebug'])) {
2655
            echo $msg . "\n";
2656
            flush();
2657
        }
2658
    }
2659
2660
    /**
2661
     * Get URL content by making direct request to TYPO3.
2662
     *
2663
     * @param  string $url          Page URL
2664
     * @param  int    $crawlerId    Crawler-ID
2665
     * @return array
2666
     */
2667 2
    protected function sendDirectRequest($url, $crawlerId)
2668
    {
2669 2
        $parsedUrl = parse_url($url);
2670 2
        if (!is_array($parsedUrl)) {
2671
            return [];
2672
        }
2673
2674 2
        $requestHeaders = $this->buildRequestHeaderArray($parsedUrl, $crawlerId);
2675
2676 2
        $cmd = escapeshellcmd($this->extensionSettings['phpPath']);
2677 2
        $cmd .= ' ';
2678 2
        $cmd .= escapeshellarg(ExtensionManagementUtility::extPath('crawler') . 'cli/bootstrap.php');
2679 2
        $cmd .= ' ';
2680 2
        $cmd .= escapeshellarg($this->getFrontendBasePath());
2681 2
        $cmd .= ' ';
2682 2
        $cmd .= escapeshellarg($url);
2683 2
        $cmd .= ' ';
2684 2
        $cmd .= escapeshellarg(base64_encode(serialize($requestHeaders)));
2685
2686 2
        $startTime = microtime(true);
2687 2
        $content = $this->executeShellCommand($cmd);
2688 2
        $this->log($url . ' ' . (microtime(true) - $startTime));
2689
2690
        $result = [
2691 2
            'request' => implode("\r\n", $requestHeaders) . "\r\n\r\n",
2692 2
            'headers' => '',
2693 2
            'content' => $content,
2694
        ];
2695
2696 2
        return $result;
2697
    }
2698
2699
    /**
2700
     * Cleans up entries that stayed for too long in the queue. These are:
2701
     * - processed entries that are over 1.5 days in age
2702
     * - scheduled entries that are over 7 days old
2703
     *
2704
     * @return void
2705
     *
2706
     * TODO: Should be switched back to protected - TNM 2018-11-16
2707
     */
2708
    public function cleanUpOldQueueEntries()
2709
    {
2710
        $processedAgeInSeconds = $this->extensionSettings['cleanUpProcessedAge'] * 86400; // 24*60*60 Seconds in 24 hours
2711
        $scheduledAgeInSeconds = $this->extensionSettings['cleanUpScheduledAge'] * 86400;
2712
2713
        $now = time();
2714
        $condition = '(exec_time<>0 AND exec_time<' . ($now - $processedAgeInSeconds) . ') OR scheduled<=' . ($now - $scheduledAgeInSeconds);
2715
        $this->flushQueue($condition);
2716
    }
2717
2718
    /**
2719
     * Initializes a TypoScript Frontend necessary for using TypoScript and TypoLink functions
2720
     *
2721
     * @param int $id
2722
     * @param int $typeNum
2723
     *
2724
     * @throws \TYPO3\CMS\Core\Error\Http\ServiceUnavailableException
2725
     *
2726
     * @return void
2727
     */
2728
    protected function initTSFE($id = 1, $typeNum = 0)
2729
    {
2730
        EidUtility::initTCA();
2731
2732
        $isVersion7 = VersionNumberUtility::convertVersionNumberToInteger(TYPO3_version) < 8000000;
2733
        if ($isVersion7 && !is_object($GLOBALS['TT'])) {
2734
            /** @var NullTimeTracker $GLOBALS['TT'] */
2735
            $GLOBALS['TT'] = new NullTimeTracker();
0 ignored issues
show
Deprecated Code introduced by
The class TYPO3\CMS\Core\TimeTracker\NullTimeTracker has been deprecated with message: since TYPO3 v8, will be removed in v9

This class, trait or interface has been deprecated. The supplier of the file has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the type will be removed from the class and what other constant to use instead.

Loading history...
2736
            $GLOBALS['TT']->start();
0 ignored issues
show
Deprecated Code introduced by
The method TYPO3\CMS\Core\TimeTrack...ullTimeTracker::start() has been deprecated with message: since TYPO3 v8, will be removed in v9, use the regular time tracking

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2737
        } else {
2738
            $timeTracker = GeneralUtility::makeInstance(TimeTracker::class);
2739
            $timeTracker->start();
2740
        }
2741
2742
        $GLOBALS['TSFE'] = GeneralUtility::makeInstance(TypoScriptFrontendController::class, $GLOBALS['TYPO3_CONF_VARS'], $id, $typeNum);
2743
        $GLOBALS['TSFE']->sys_page = GeneralUtility::makeInstance(PageRepository::class);
2744
        $GLOBALS['TSFE']->sys_page->init(true);
2745
        $GLOBALS['TSFE']->connectToDB();
2746
        $GLOBALS['TSFE']->initFEuser();
2747
        $GLOBALS['TSFE']->determineId();
2748
        $GLOBALS['TSFE']->initTemplate();
2749
        $GLOBALS['TSFE']->rootLine = $GLOBALS['TSFE']->sys_page->getRootLine($id, '');
2750
        $GLOBALS['TSFE']->getConfigArray();
2751
        PageGenerator::pagegenInit();
0 ignored issues
show
Deprecated Code introduced by
The method TYPO3\CMS\Frontend\Page\...enerator::pagegenInit() has been deprecated with message: since TYPO3 v8, will be removed in TYPO3 v9

This method has been deprecated. The supplier of the class has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the method will be removed from the class and what other method or class to use instead.

Loading history...
2752
    }
2753
2754
    /**
2755
     * Returns a md5 hash generated from a serialized configuration array.
2756
     *
2757
     * @param array $configuration
2758
     *
2759
     * @return string
2760
     */
2761 5
    protected function getConfigurationHash(array $configuration)
2762
    {
2763 5
        unset($configuration['paramExpanded']);
2764 5
        unset($configuration['URLs']);
2765 5
        return md5(serialize($configuration));
2766
    }
2767
2768
    /**
2769
     * Check whether the Crawling Protocol should be http or https
2770
     *
2771
     * @param $crawlerConfiguration
2772
     * @param $pageConfiguration
2773
     *
2774
     * @return bool
2775
     */
2776 5
    protected function isCrawlingProtocolHttps($crawlerConfiguration, $pageConfiguration)
2777
    {
2778
        switch ($crawlerConfiguration) {
2779 5
            case -1:
2780 1
                return false;
2781 4
            case 0:
2782 2
                return $pageConfiguration;
2783 2
            case 1:
2784 1
                return true;
2785
            default:
2786 1
                return false;
2787
        }
2788
    }
2789
}
2790