Completed
Push — typo3v9 ( 6762b6...ea38b1 )
by Tomas Norre
05:52
created

CrawlerController   F

Complexity

Total Complexity 279

Size/Duplication

Total Lines 2483
Duplicated Lines 0 %

Coupling/Cohesion

Components 1
Dependencies 6

Test Coverage

Coverage 41.32%

Importance

Changes 0
Metric Value
wmc 279
lcom 1
cbo 6
dl 0
loc 2483
ccs 433
cts 1048
cp 0.4132
rs 0.8
c 0
b 0
f 0

54 Methods

Rating   Name   Duplication   Size   Complexity  
A getAccessMode() 0 4 1
A setAccessMode() 0 4 1
A setDisabled() 0 10 3
A getDisabled() 0 4 1
A setProcessFilename() 0 4 1
A getProcessFilename() 0 4 1
A __construct() 0 23 3
A setExtensionSettings() 0 4 1
F checkIfPageShouldBeSkipped() 0 53 14
A getUrlsForPageRow() 0 14 2
A noUnprocessedQueueEntriesForPageWithConfigurationHashExist() 0 22 2
B urlListFromUrlArray() 0 62 9
A drawURLs_PIfilter() 0 13 4
A getPageTSconfigForId() 0 21 4
F getUrlsForPageId() 0 105 20
A getBaseUrlForConfigurationRecord() 0 19 4
B getConfigurationsForBranch() 0 44 7
A getQueryBuilder() 0 4 1
A hasGroupAccess() 0 12 4
F expandParameters() 0 121 25
B compileUrls() 0 23 6
B getLogEntriesForPageId() 0 44 6
B getLogEntriesForSetId() 0 44 6
A flushQueue() 0 34 5
A addQueueEntry_callBack() 0 20 3
B addUrl() 0 71 6
B getDuplicateRowsIfExist() 0 49 5
A getCurrentTime() 0 4 1
C readUrl() 0 92 11
A readUrlFromArray() 0 31 1
A readUrl_exec() 0 29 4
D requestUrl() 0 97 15
B getFrontendBasePath() 0 23 7
A executeShellCommand() 0 4 1
A getHttpResponseFromStream() 0 23 5
A buildRequestHeaderArray() 0 16 4
B getRequestUrlFrom302Header() 0 34 11
A fe_init() 0 26 4
B getPageTreeAndUrls() 0 91 7
B expandExcludeString() 0 45 9
C drawURLs_addRowsForPage() 0 106 14
C CLI_run() 0 118 10
A CLI_runHooks() 0 9 3
B CLI_checkAndAcquireNewProcess() 0 56 5
B CLI_releaseProcesses() 0 87 5
A CLI_checkIfProcessIsActive() 0 20 2
A CLI_buildProcessId() 0 7 2
A microtime() 0 4 1
A CLI_debug() 0 7 2
A sendDirectRequest() 0 31 2
A cleanUpOldQueueEntries() 0 9 1
A initTSFE() 0 15 1
A getConfigurationHash() 0 6 1
A getUrlFromPageAndQueryParameters() 0 34 5

How to fix   Complexity   

Complex Class

Complex classes like CrawlerController often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use CrawlerController, and based on these observations, apply Extract Interface, too.

1
<?php
2
namespace AOE\Crawler\Controller;
3
4
/***************************************************************
5
 *  Copyright notice
6
 *
7
 *  (c) 2019 AOE GmbH <[email protected]>
8
 *
9
 *  All rights reserved
10
 *
11
 *  This script is part of the TYPO3 project. The TYPO3 project is
12
 *  free software; you can redistribute it and/or modify
13
 *  it under the terms of the GNU General Public License as published by
14
 *  the Free Software Foundation; either version 3 of the License, or
15
 *  (at your option) any later version.
16
 *
17
 *  The GNU General Public License can be found at
18
 *  http://www.gnu.org/copyleft/gpl.html.
19
 *
20
 *  This script is distributed in the hope that it will be useful,
21
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
22
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
23
 *  GNU General Public License for more details.
24
 *
25
 *  This copyright notice MUST APPEAR in all copies of the script!
26
 ***************************************************************/
27
28
use AOE\Crawler\Configuration\ExtensionConfigurationProvider;
29
use AOE\Crawler\Domain\Repository\ConfigurationRepository;
30
use AOE\Crawler\Domain\Repository\ProcessRepository;
31
use AOE\Crawler\Domain\Repository\QueueRepository;
32
use AOE\Crawler\Event\EventDispatcher;
33
use AOE\Crawler\Utility\IconUtility;
34
use AOE\Crawler\Utility\SignalSlotUtility;
35
use Psr\Http\Message\UriInterface;
36
use Psr\Log\LoggerAwareInterface;
37
use Psr\Log\LoggerAwareTrait;
38
use TYPO3\CMS\Backend\Tree\View\PageTreeView;
39
use TYPO3\CMS\Backend\Utility\BackendUtility;
40
use TYPO3\CMS\Core\Authentication\BackendUserAuthentication;
41
use TYPO3\CMS\Core\Core\Environment;
42
use TYPO3\CMS\Core\Database\Connection;
43
use TYPO3\CMS\Core\Database\ConnectionPool;
44
use TYPO3\CMS\Core\Database\Query\Restriction\DeletedRestriction;
45
use TYPO3\CMS\Core\Database\Query\Restriction\EndTimeRestriction;
46
use TYPO3\CMS\Core\Database\Query\Restriction\HiddenRestriction;
47
use TYPO3\CMS\Core\Database\Query\Restriction\StartTimeRestriction;
48
use TYPO3\CMS\Core\Http\Uri;
49
use TYPO3\CMS\Core\Routing\SiteMatcher;
50
use TYPO3\CMS\Core\Site\Entity\Site;
51
use TYPO3\CMS\Core\Site\SiteFinder;
52
use TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser;
53
use TYPO3\CMS\Core\Utility\DebugUtility;
54
use TYPO3\CMS\Core\Utility\ExtensionManagementUtility;
55
use TYPO3\CMS\Core\Utility\GeneralUtility;
56
use TYPO3\CMS\Core\Utility\MathUtility;
57
use TYPO3\CMS\Extbase\Object\ObjectManager;
58
use TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController;
59
use TYPO3\CMS\Frontend\Page\CacheHashCalculator;
60
use TYPO3\CMS\Frontend\Page\PageRepository;
61
62
/**
63
 * Class CrawlerController
64
 *
65
 * @package AOE\Crawler\Controller
66
 */
67
class CrawlerController implements LoggerAwareInterface
68
{
69
    use LoggerAwareTrait;
70
71
    const CLI_STATUS_NOTHING_PROCCESSED = 0;
72
    const CLI_STATUS_REMAIN = 1; //queue not empty
73
    const CLI_STATUS_PROCESSED = 2; //(some) queue items where processed
74
    const CLI_STATUS_ABORTED = 4; //instance didn't finish
75
    const CLI_STATUS_POLLABLE_PROCESSED = 8;
76
77
    /**
78
     * @var integer
79
     */
80
    public $setID = 0;
81
82
    /**
83
     * @var string
84
     */
85
    public $processID = '';
86
87
    /**
88
     * @var array
89
     */
90
    public $duplicateTrack = [];
91
92
    /**
93
     * @var array
94
     */
95
    public $downloadUrls = [];
96
97
    /**
98
     * @var array
99
     */
100
    public $incomingProcInstructions = [];
101
102
    /**
103
     * @var array
104
     */
105
    public $incomingConfigurationSelection = [];
106
107
    /**
108
     * @var bool
109
     */
110
    public $registerQueueEntriesInternallyOnly = false;
111
112
    /**
113
     * @var array
114
     */
115
    public $queueEntries = [];
116
117
    /**
118
     * @var array
119
     */
120
    public $urlList = [];
121
122
    /**
123
     * @var array
124
     */
125
    public $extensionSettings = [];
126
127
    /**
128
     * Mount Point
129
     *
130
     * @var boolean
131
     */
132
    public $MP = false;
133
134
    /**
135
     * @var string
136
     */
137
    protected $processFilename;
138
139
    /**
140
     * Holds the internal access mode can be 'gui','cli' or 'cli_im'
141
     *
142
     * @var string
143
     */
144
    protected $accessMode;
145
146
    /**
147
     * @var BackendUserAuthentication
148
     */
149
    private $backendUser;
150
151
    /**
152
     * @var integer
153
     */
154
    private $scheduledTime = 0;
155
156
    /**
157
     * @var integer
158
     */
159
    private $reqMinute = 0;
160
161
    /**
162
     * @var bool
163
     */
164
    private $submitCrawlUrls = false;
165
166
    /**
167
     * @var bool
168
     */
169
    private $downloadCrawlUrls = false;
170
171
    /**
172
     * @var QueueRepository
173
     */
174
    protected $queueRepository;
175
176
    /**
177
     * @var ProcessRepository
178
     */
179
    protected $processRepository;
180
181
    /**
182
     * @var ConfigurationRepository
183
     */
184
    protected $configurationRepository;
185
186
    /**
187
     * @var string
188
     */
189
    protected $tableName = 'tx_crawler_queue';
190
191
192
    /**
193
     * @var int
194
     */
195
    protected $maximumUrlsToCompile = 10000;
196
197
    /**
198
     * Method to set the accessMode can be gui, cli or cli_im
199
     *
200
     * @return string
201
     */
202 1
    public function getAccessMode()
203
    {
204 1
        return $this->accessMode;
205
    }
206
207
    /**
208
     * @param string $accessMode
209
     */
210 1
    public function setAccessMode($accessMode)
211
    {
212 1
        $this->accessMode = $accessMode;
213 1
    }
214
215
    /**
216
     * Set disabled status to prevent processes from being processed
217
     *
218
     * @param  bool $disabled (optional, defaults to true)
219
     * @return void
220
     */
221 3
    public function setDisabled($disabled = true)
222
    {
223 3
        if ($disabled) {
224 2
            GeneralUtility::writeFile($this->processFilename, '');
225
        } else {
226 1
            if (is_file($this->processFilename)) {
227 1
                unlink($this->processFilename);
228
            }
229
        }
230 3
    }
231
232
    /**
233
     * Get disable status
234
     *
235
     * @return bool true if disabled
236
     */
237 3
    public function getDisabled()
238
    {
239 3
        return is_file($this->processFilename);
240
    }
241
242
    /**
243
     * @param string $filenameWithPath
244
     *
245
     * @return void
246
     */
247 4
    public function setProcessFilename($filenameWithPath)
248
    {
249 4
        $this->processFilename = $filenameWithPath;
250 4
    }
251
252
    /**
253
     * @return string
254
     */
255 1
    public function getProcessFilename()
256
    {
257 1
        return $this->processFilename;
258
    }
259
260
    /************************************
261
     *
262
     * Getting URLs based on Page TSconfig
263
     *
264
     ************************************/
265
266 31
    public function __construct()
267
    {
268 31
        $objectManager = GeneralUtility::makeInstance(ObjectManager::class);
269 31
        $this->queueRepository = $objectManager->get(QueueRepository::class);
270 31
        $this->processRepository = $objectManager->get(ProcessRepository::class);
271 31
        $this->configurationRepository = $objectManager->get(ConfigurationRepository::class);
272
273 31
        $this->backendUser = $GLOBALS['BE_USER'];
274 31
        $this->processFilename = Environment::getVarPath() . '/locks/tx_crawler.proc';
275
276
        /** @var ExtensionConfigurationProvider $configurationProvider */
277 31
        $configurationProvider = GeneralUtility::makeInstance(ExtensionConfigurationProvider::class);
278 31
        $settings = $configurationProvider->getExtensionConfiguration();
279 31
        $this->extensionSettings = is_array($settings) ? $settings : [];
280
281
        // set defaults:
282 31
        if (MathUtility::convertToPositiveInteger($this->extensionSettings['countInARun']) == 0) {
283
            $this->extensionSettings['countInARun'] = 100;
284
        }
285
286 31
        $this->extensionSettings['processLimit'] = MathUtility::forceIntegerInRange($this->extensionSettings['processLimit'], 1, 99, 1);
287 31
        $this->maximumUrlsToCompile = MathUtility::forceIntegerInRange($this->extensionSettings['maxCompileUrls'], 1, 1000000000, 10000);
288 31
    }
289
290
    /**
291
     * Sets the extensions settings (unserialized pendant of $TYPO3_CONF_VARS['EXT']['extConf']['crawler']).
292
     *
293
     * @param array $extensionSettings
294
     * @return void
295
     */
296 9
    public function setExtensionSettings(array $extensionSettings)
297
    {
298 9
        $this->extensionSettings = $extensionSettings;
299 9
    }
300
301
    /**
302
     * Check if the given page should be crawled
303
     *
304
     * @param array $pageRow
305
     * @return false|string false if the page should be crawled (not excluded), true / skipMessage if it should be skipped
306
     */
307 8
    public function checkIfPageShouldBeSkipped(array $pageRow)
308
    {
309 8
        $skipPage = false;
310 8
        $skipMessage = 'Skipped'; // message will be overwritten later
311
312
        // if page is hidden
313 8
        if (!$this->extensionSettings['crawlHiddenPages']) {
314 8
            if ($pageRow['hidden']) {
315 1
                $skipPage = true;
316 1
                $skipMessage = 'Because page is hidden';
317
            }
318
        }
319
320 8
        if (!$skipPage) {
321 7
            if (GeneralUtility::inList('3,4', $pageRow['doktype']) || $pageRow['doktype'] >= 199) {
322 3
                $skipPage = true;
323 3
                $skipMessage = 'Because doktype is not allowed';
324
            }
325
        }
326
327 8
        if (!$skipPage) {
328 4
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'] ?? [] as $key => $doktypeList) {
329 1
                if (GeneralUtility::inList($doktypeList, $pageRow['doktype'])) {
330 1
                    $skipPage = true;
331 1
                    $skipMessage = 'Doktype was excluded by "' . $key . '"';
332 1
                    break;
333
                }
334
            }
335
        }
336
337 8
        if (!$skipPage) {
338
            // veto hook
339 3
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'] ?? [] as $key => $func) {
340
                $params = [
341
                    'pageRow' => $pageRow
342
                ];
343
                // expects "false" if page is ok and "true" or a skipMessage if this page should _not_ be crawled
344
                $veto = GeneralUtility::callUserFunction($func, $params, $this);
345
                if ($veto !== false) {
346
                    $skipPage = true;
347
                    if (is_string($veto)) {
348
                        $skipMessage = $veto;
349
                    } else {
350
                        $skipMessage = 'Veto from hook "' . htmlspecialchars($key) . '"';
351
                    }
352
                    // no need to execute other hooks if a previous one return a veto
353
                    break;
354
                }
355
            }
356
        }
357
358 8
        return $skipPage ? $skipMessage : false;
359
    }
360
361
    /**
362
     * Wrapper method for getUrlsForPageId()
363
     * It returns an array of configurations and no urls!
364
     *
365
     * @param array $pageRow Page record with at least dok-type and uid columns.
366
     * @param string $skipMessage
367
     * @return array
368
     * @see getUrlsForPageId()
369
     */
370 4
    public function getUrlsForPageRow(array $pageRow, &$skipMessage = '')
371
    {
372 4
        $message = $this->checkIfPageShouldBeSkipped($pageRow);
373
374 4
        if ($message === false) {
375 3
            $res = $this->getUrlsForPageId($pageRow['uid']);
376 3
            $skipMessage = '';
377
        } else {
378 1
            $skipMessage = $message;
379 1
            $res = [];
380
        }
381
382 4
        return $res;
383
    }
384
385
    /**
386
     * This method is used to count if there are ANY unprocessed queue entries
387
     * of a given page_id and the configuration which matches a given hash.
388
     * If there if none, we can skip an inner detail check
389
     *
390
     * @param  int $uid
391
     * @param  string $configurationHash
392
     * @return boolean
393
     */
394 5
    protected function noUnprocessedQueueEntriesForPageWithConfigurationHashExist($uid, $configurationHash)
395
    {
396 5
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
397 5
        $noUnprocessedQueueEntriesFound = true;
398
399
        $result = $queryBuilder
400 5
            ->count('*')
401 5
            ->from($this->tableName)
402 5
            ->where(
403 5
                $queryBuilder->expr()->eq('page_id', intval($uid)),
404 5
                $queryBuilder->expr()->eq('configuration_hash', $queryBuilder->createNamedParameter($configurationHash)),
405 5
                $queryBuilder->expr()->eq('exec_time', 0)
406
            )
407 5
            ->execute()
408 5
            ->fetchColumn();
409
410 5
        if ($result) {
411 3
            $noUnprocessedQueueEntriesFound = false;
412
        }
413
414 5
        return $noUnprocessedQueueEntriesFound;
415
    }
416
417
    /**
418
     * Creates a list of URLs from input array (and submits them to queue if asked for)
419
     * See Web > Info module script + "indexed_search"'s crawler hook-client using this!
420
     *
421
     * @param    array        Information about URLs from pageRow to crawl.
422
     * @param    array        Page row
423
     * @param    integer        Unix time to schedule indexing to, typically time()
424
     * @param    integer        Number of requests per minute (creates the interleave between requests)
425
     * @param    boolean        If set, submits the URLs to queue
426
     * @param    boolean        If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
427
     * @param    array        Array which is passed by reference and contains the an id per url to secure we will not crawl duplicates
428
     * @param    array        Array which will be filled with URLS for download if flag is set.
429
     * @param    array        Array of processing instructions
430
     * @return    string        List of URLs (meant for display in backend module)
431
     *
432
     */
433 2
    public function urlListFromUrlArray(
434
        array $vv,
435
        array $pageRow,
436
        $scheduledTime,
437
        $reqMinute,
438
        $submitCrawlUrls,
439
        $downloadCrawlUrls,
440
        array &$duplicateTrack,
441
        array &$downloadUrls,
442
        array $incomingProcInstructions
443
    ) {
444 2
        $urlList = '';
445
446 2
        if (is_array($vv['URLs'])) {
447 2
            $configurationHash = $this->getConfigurationHash($vv);
448 2
            $skipInnerCheck = $this->noUnprocessedQueueEntriesForPageWithConfigurationHashExist($pageRow['uid'], $configurationHash);
449
450 2
            foreach ($vv['URLs'] as $urlQuery) {
451 2
                if ($this->drawURLs_PIfilter($vv['subCfg']['procInstrFilter'], $incomingProcInstructions)) {
452 2
                    $url = (string)$this->getUrlFromPageAndQueryParameters((int)$pageRow['uid'], $urlQuery, $vv['subCfg']['baseUrl'] ?: null);
453
454
                    // Create key by which to determine unique-ness:
455 2
                    $uKey = $url . '|' . $vv['subCfg']['userGroups'] . '|' . $vv['subCfg']['procInstrFilter'];
456
                    // Scheduled time:
457 2
                    $schTime = $scheduledTime + round(count($duplicateTrack) * (60 / $reqMinute));
458 2
                    $schTime = floor($schTime / 60) * 60;
459
460 2
                    if (isset($duplicateTrack[$uKey])) {
461
                        //if the url key is registered just display it and do not resubmit is
462
                        $urlList = '<em><span class="typo3-dimmed">' . htmlspecialchars($url) . '</span></em><br/>';
463
                    } else {
464 2
                        $urlList = '[' . date('d.m.y H:i', $schTime) . '] ' . htmlspecialchars($url);
465 2
                        $this->urlList[] = '[' . date('d.m.y H:i', $schTime) . '] ' . $url;
466
467
                        // Submit for crawling!
468 2
                        if ($submitCrawlUrls) {
469 2
                            $added = $this->addUrl(
470 2
                                $pageRow['uid'],
471 2
                                $url,
472 2
                                $vv['subCfg'],
473 2
                                $scheduledTime,
474 2
                                $configurationHash,
475 2
                                $skipInnerCheck
476
                            );
477 2
                            if ($added === false) {
478 2
                                $urlList .= ' (URL already existed)';
479
                            }
480
                        } elseif ($downloadCrawlUrls) {
481
                            $downloadUrls[$url] = $url;
482
                        }
483
484 2
                        $urlList .= '<br />';
485
                    }
486 2
                    $duplicateTrack[$uKey] = true;
487
                }
488
            }
489
        } else {
490
            $urlList = 'ERROR - no URL generated';
491
        }
492
493 2
        return $urlList;
494
    }
495
496
    /**
497
     * Returns true if input processing instruction is among registered ones.
498
     *
499
     * @param string $piString PI to test
500
     * @param array $incomingProcInstructions Processing instructions
501
     * @return boolean
502
     */
503 5
    public function drawURLs_PIfilter($piString, array $incomingProcInstructions)
504
    {
505 5
        if (empty($incomingProcInstructions)) {
506 1
            return true;
507
        }
508
509 4
        foreach ($incomingProcInstructions as $pi) {
510 4
            if (GeneralUtility::inList($piString, $pi)) {
511 2
                return true;
512
            }
513
        }
514 2
        return false;
515
    }
516
517 2
    public function getPageTSconfigForId($id)
518
    {
519 2
        if (!$this->MP) {
520 2
            $pageTSconfig = BackendUtility::getPagesTSconfig($id);
521
        } else {
522
            [, $mountPointId] = explode('-', $this->MP);
0 ignored issues
show
Bug introduced by
The variable $mountPointId does not exist. Did you forget to declare it?

This check marks access to variables or properties that have not been declared yet. While PHP has no explicit notion of declaring a variable, accessing it before a value is assigned to it is most likely a bug.

Loading history...
523
            $pageTSconfig = BackendUtility::getPagesTSconfig($mountPointId);
524
        }
525
526
        // Call a hook to alter configuration
527 2
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'])) {
528
            $params = [
529
                'pageId' => $id,
530
                'pageTSConfig' => &$pageTSconfig
531
            ];
532
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'] as $userFunc) {
533
                GeneralUtility::callUserFunction($userFunc, $params, $this);
534
            }
535
        }
536 2
        return $pageTSconfig;
537
    }
538
539
    /**
540
     * This methods returns an array of configurations.
541
     * And no urls!
542
     *
543
     * @param integer $id Page ID
0 ignored issues
show
Bug introduced by
There is no parameter named $id. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
544
     * @return array
545
     */
546 2
    public function getUrlsForPageId($pageId)
547
    {
548
        // Get page TSconfig for page ID
549 2
        $pageTSconfig = $this->getPageTSconfigForId($pageId);
550
551 2
        $res = [];
552
553
        // Fetch Crawler Configuration from pageTSconfig
554 2
        $crawlerCfg = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'] ?? [];
555 2
        foreach ($crawlerCfg as $key => $values) {
556 1
            if (!is_array($values)) {
557 1
                continue;
558
            }
559 1
            $key = str_replace('.', '', $key);
560
            // Sub configuration for a single configuration string:
561 1
            $subCfg = (array)$crawlerCfg[$key . '.'];
562 1
            $subCfg['key'] = $key;
563
564 1
            if (strcmp($subCfg['procInstrFilter'], '')) {
565 1
                $subCfg['procInstrFilter'] = implode(',', GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']));
566
            }
567 1
            $pidOnlyList = implode(',', GeneralUtility::trimExplode(',', $subCfg['pidsOnly'], true));
568
569
            // process configuration if it is not page-specific or if the specific page is the current page:
570 1
            if (!strcmp($subCfg['pidsOnly'], '') || GeneralUtility::inList($pidOnlyList, $pageId)) {
571
572
                    // add trailing slash if not present
573 1
                if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
574
                    $subCfg['baseUrl'] .= '/';
575
                }
576
577
                // Explode, process etc.:
578 1
                $res[$key] = [];
579 1
                $res[$key]['subCfg'] = $subCfg;
580 1
                $res[$key]['paramParsed'] = GeneralUtility::explodeUrl2Array($crawlerCfg[$key]);
581 1
                $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $pageId);
582 1
                $res[$key]['origin'] = 'pagets';
583
584
                // recognize MP value
585 1
                if (!$this->MP) {
586 1
                    $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $pageId]);
587
                } else {
588
                    $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $pageId . '&MP=' . $this->MP]);
589
                }
590
            }
591
        }
592
593
        // Get configuration from tx_crawler_configuration records up the rootline
594 2
        $crawlerConfigurations = $this->configurationRepository->getCrawlerConfigurationRecordsFromRootLine($pageId);
595 2
        foreach ($crawlerConfigurations as $configurationRecord) {
596
597
                // check access to the configuration record
598 1
            if (empty($configurationRecord['begroups']) || $GLOBALS['BE_USER']->isAdmin() || $this->hasGroupAccess($GLOBALS['BE_USER']->user['usergroup_cached_list'], $configurationRecord['begroups'])) {
599 1
                $pidOnlyList = implode(',', GeneralUtility::trimExplode(',', $configurationRecord['pidsonly'], true));
600
601
                // process configuration if it is not page-specific or if the specific page is the current page:
602 1
                if (!strcmp($configurationRecord['pidsonly'], '') || GeneralUtility::inList($pidOnlyList, $pageId)) {
603 1
                    $key = $configurationRecord['name'];
604
605
                    // don't overwrite previously defined paramSets
606 1
                    if (!isset($res[$key])) {
607
608
                            /* @var $TSparserObject \TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser */
609 1
                        $TSparserObject = GeneralUtility::makeInstance(TypoScriptParser::class);
610 1
                        $TSparserObject->parse($configurationRecord['processing_instruction_parameters_ts']);
611
612
                        $subCfg = [
613 1
                            'procInstrFilter' => $configurationRecord['processing_instruction_filter'],
614 1
                            'procInstrParams.' => $TSparserObject->setup,
615 1
                            'baseUrl' => $this->getBaseUrlForConfigurationRecord(
616 1
                                $configurationRecord['base_url'],
617 1
                                (int)$configurationRecord['sys_domain_base_url'],
618 1
                                (bool)($configurationRecord['force_ssl'] > 0)
619
                            ),
620 1
                            'userGroups' => $configurationRecord['fegroups'],
621 1
                            'exclude' => $configurationRecord['exclude'],
622 1
                            'rootTemplatePid' => (int) $configurationRecord['root_template_pid'],
623 1
                            'key' => $key
624
                        ];
625
626
                        // add trailing slash if not present
627 1
                        if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
628
                            $subCfg['baseUrl'] .= '/';
629
                        }
630 1
                        if (!in_array($pageId, $this->expandExcludeString($subCfg['exclude']))) {
631 1
                            $res[$key] = [];
632 1
                            $res[$key]['subCfg'] = $subCfg;
633 1
                            $res[$key]['paramParsed'] = GeneralUtility::explodeUrl2Array($configurationRecord['configuration']);
634 1
                            $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $pageId);
635 1
                            $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $pageId]);
636 1
                            $res[$key]['origin'] = 'tx_crawler_configuration_' . $configurationRecord['uid'];
637
                        }
638
                    }
639
                }
640
            }
641
        }
642
643 2
        foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'] ?? [] as $func) {
644
            $params = [
645
                'res' => &$res,
646
            ];
647
            GeneralUtility::callUserFunction($func, $params, $this);
648
        }
649 2
        return $res;
650
    }
651
652
    /**
653
     * Checks if a domain record exist and returns the base-url based on the record. If not the given baseUrl string is used.
654
     *
655
     * @param string $baseUrl
656
     * @param integer $sysDomainUid
657
     * @param bool $ssl
658
     * @return string
659
     */
660 4
    protected function getBaseUrlForConfigurationRecord(string $baseUrl, int $sysDomainUid, bool $ssl = false): string
661
    {
662 4
        if ($sysDomainUid > 0) {
663 2
            $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable('sys_domain');
664
            $domainName = $queryBuilder
665 2
                ->select('domainName')
666 2
                ->from('sys_domain')
667 2
                ->where(
668 2
                    $queryBuilder->expr()->eq('uid', $sysDomainUid)
669
                )
670 2
                ->execute()
671 2
                ->fetchColumn();
672
673 2
            if (!empty($domainName)) {
674 1
                $baseUrl = ($ssl ? 'https' : 'http') . '://' . $domainName;
675
            }
676
        }
677 4
        return $baseUrl;
678
    }
679
680
    /**
681
     * Find all configurations of subpages of a page
682
     *
683
     * @param int $rootid
684
     * @param $depth
685
     * @return array
686
     *
687
     * TODO: Write Functional Tests
688
     */
689
    public function getConfigurationsForBranch(int $rootid, $depth)
690
    {
691
        $configurationsForBranch = [];
692
        $pageTSconfig = $this->getPageTSconfigForId($rootid);
693
        $sets = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'] ?? [];
694
        foreach ($sets as $key => $value) {
695
            if (!is_array($value)) {
696
                continue;
697
            }
698
            $configurationsForBranch[] = substr($key, -1) == '.' ? substr($key, 0, -1) : $key;
699
        }
700
        $pids = [];
701
        $rootLine = BackendUtility::BEgetRootLine($rootid);
702
        foreach ($rootLine as $node) {
703
            $pids[] = $node['uid'];
704
        }
705
        /* @var PageTreeView $tree */
706
        $tree = GeneralUtility::makeInstance(PageTreeView::class);
707
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
708
        $tree->init('AND ' . $perms_clause);
709
        $tree->getTree($rootid, $depth, '');
710
        foreach ($tree->tree as $node) {
711
            $pids[] = $node['row']['uid'];
712
        }
713
714
        $queryBuilder = $this->getQueryBuilder('tx_crawler_configuration');
715
716
        $queryBuilder->getRestrictions()
717
            ->removeAll()
718
            ->add(GeneralUtility::makeInstance(DeletedRestriction::class));
719
720
        $statement = $queryBuilder
721
            ->select('name')
722
            ->from('tx_crawler_configuration')
723
            ->where(
724
                $queryBuilder->expr()->in('pid', $queryBuilder->createNamedParameter($pids, Connection::PARAM_INT_ARRAY))
725
            )
726
            ->execute();
727
728
        while ($row = $statement->fetch()) {
729
            $configurationsForBranch[] = $row['name'];
730
        }
731
        return $configurationsForBranch;
732
    }
733
734
    /**
735
     * Get querybuilder for given table
736
     *
737
     * @param string $table
738
     * @return \TYPO3\CMS\Core\Database\Query\QueryBuilder
739
     */
740 9
    private function getQueryBuilder(string $table)
741
    {
742 9
        return GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($table);
743
    }
744
745
    /**
746
     * Check if a user has access to an item
747
     * (e.g. get the group list of the current logged in user from $GLOBALS['TSFE']->gr_list)
748
     *
749
     * @see \TYPO3\CMS\Frontend\Page\PageRepository::getMultipleGroupsWhereClause()
750
     * @param  string $groupList    Comma-separated list of (fe_)group UIDs from a user
751
     * @param  string $accessList   Comma-separated list of (fe_)group UIDs of the item to access
752
     * @return bool                 TRUE if at least one of the users group UIDs is in the access list or the access list is empty
753
     */
754 3
    public function hasGroupAccess($groupList, $accessList)
755
    {
756 3
        if (empty($accessList)) {
757 1
            return true;
758
        }
759 2
        foreach (GeneralUtility::intExplode(',', $groupList) as $groupUid) {
760 2
            if (GeneralUtility::inList($accessList, $groupUid)) {
761 1
                return true;
762
            }
763
        }
764 1
        return false;
765
    }
766
767
    /**
768
     * Will expand the parameters configuration to individual values. This follows a certain syntax of the value of each parameter.
769
     * Syntax of values:
770
     * - Basically: If the value is wrapped in [...] it will be expanded according to the following syntax, otherwise the value is taken literally
771
     * - Configuration is splitted by "|" and the parts are processed individually and finally added together
772
     * - For each configuration part:
773
     *         - "[int]-[int]" = Integer range, will be expanded to all values in between, values included, starting from low to high (max. 1000). Example "1-34" or "-40--30"
774
     *         - "_TABLE:[TCA table name];[_PID:[optional page id, default is current page]];[_ENABLELANG:1]" = Look up of table records from PID, filtering out deleted records. Example "_TABLE:tt_content; _PID:123"
775
     *        _ENABLELANG:1 picks only original records without their language overlays
776
     *         - Default: Literal value
777
     *
778
     * @param array $paramArray Array with key (GET var name) and values (value of GET var which is configuration for expansion)
779
     * @param integer $pid Current page ID
780
     * @return array
781
     *
782
     * TODO: Write Functional Tests
783
     */
784 2
    public function expandParameters($paramArray, $pid)
785
    {
786
        // Traverse parameter names:
787 2
        foreach ($paramArray as $p => $v) {
788 2
            $v = trim($v);
789
790
            // If value is encapsulated in square brackets it means there are some ranges of values to find, otherwise the value is literal
791 2
            if (substr($v, 0, 1) === '[' && substr($v, -1) === ']') {
792
                // So, find the value inside brackets and reset the paramArray value as an array.
793 2
                $v = substr($v, 1, -1);
794 2
                $paramArray[$p] = [];
795
796
                // Explode parts and traverse them:
797 2
                $parts = explode('|', $v);
798 2
                foreach ($parts as $pV) {
799
800
                        // Look for integer range: (fx. 1-34 or -40--30 // reads minus 40 to minus 30)
801 2
                    if (preg_match('/^(-?[0-9]+)\s*-\s*(-?[0-9]+)$/', trim($pV), $reg)) {
802
803
                        // Swap if first is larger than last:
804
                        if ($reg[1] > $reg[2]) {
805
                            $temp = $reg[2];
806
                            $reg[2] = $reg[1];
807
                            $reg[1] = $temp;
808
                        }
809
810
                        // Traverse range, add values:
811
                        $runAwayBrake = 1000; // Limit to size of range!
812
                        for ($a = $reg[1]; $a <= $reg[2];$a++) {
813
                            $paramArray[$p][] = $a;
814
                            $runAwayBrake--;
815
                            if ($runAwayBrake <= 0) {
816
                                break;
817
                            }
818
                        }
819 2
                    } elseif (substr(trim($pV), 0, 7) == '_TABLE:') {
820
821
                        // Parse parameters:
822
                        $subparts = GeneralUtility::trimExplode(';', $pV);
823
                        $subpartParams = [];
824
                        foreach ($subparts as $spV) {
825
                            list($pKey, $pVal) = GeneralUtility::trimExplode(':', $spV);
826
                            $subpartParams[$pKey] = $pVal;
827
                        }
828
829
                        // Table exists:
830
                        if (isset($GLOBALS['TCA'][$subpartParams['_TABLE']])) {
831
                            $lookUpPid = isset($subpartParams['_PID']) ? intval($subpartParams['_PID']) : $pid;
832
                            $pidField = isset($subpartParams['_PIDFIELD']) ? trim($subpartParams['_PIDFIELD']) : 'pid';
833
                            $where = isset($subpartParams['_WHERE']) ? $subpartParams['_WHERE'] : '';
834
                            $addTable = isset($subpartParams['_ADDTABLE']) ? $subpartParams['_ADDTABLE'] : '';
835
836
                            $fieldName = $subpartParams['_FIELD'] ? $subpartParams['_FIELD'] : 'uid';
837
                            if ($fieldName === 'uid' || $GLOBALS['TCA'][$subpartParams['_TABLE']]['columns'][$fieldName]) {
838
                                $queryBuilder = $this->getQueryBuilder($subpartParams['_TABLE']);
839
840
                                $queryBuilder->getRestrictions()
841
                                    ->removeAll()
842
                                    ->add(GeneralUtility::makeInstance(DeletedRestriction::class));
843
844
                                $queryBuilder
845
                                    ->select($fieldName)
846
                                    ->from($subpartParams['_TABLE'])
847
                                    // TODO: Check if this works as intended!
848
                                    ->add('from', $addTable)
849
                                    ->where(
850
                                        $queryBuilder->expr()->eq($queryBuilder->quoteIdentifier($pidField), $queryBuilder->createNamedParameter($lookUpPid, \PDO::PARAM_INT)),
851
                                        $where
852
                                    );
853
                                $transOrigPointerField = $GLOBALS['TCA'][$subpartParams['_TABLE']]['ctrl']['transOrigPointerField'];
854
855
                                if ($subpartParams['_ENABLELANG'] && $transOrigPointerField) {
856
                                    $queryBuilder->andWhere(
857
                                        $queryBuilder->expr()->lte(
858
                                            $queryBuilder->quoteIdentifier($transOrigPointerField),
859
                                            0
860
                                        )
861
                                    );
862
                                }
863
864
                                $statement = $queryBuilder->execute();
865
866
                                $rows = [];
867
                                while ($row = $statement->fetch()) {
868
                                    $rows[$fieldName] = $row;
869
                                }
870
871
                                if (is_array($rows)) {
872
                                    $paramArray[$p] = array_merge($paramArray[$p], array_keys($rows));
873
                                }
874
                            }
875
                        }
876
                    } else { // Just add value:
877 2
                        $paramArray[$p][] = $pV;
878
                    }
879
                    // Hook for processing own expandParameters place holder
880 2
                    if (is_array($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'])) {
881
                        $_params = [
882
                            'pObj' => &$this,
883
                            'paramArray' => &$paramArray,
884
                            'currentKey' => $p,
885
                            'currentValue' => $pV,
886
                            'pid' => $pid
887
                        ];
888
                        foreach ($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'] as $key => $_funcRef) {
889
                            GeneralUtility::callUserFunction($_funcRef, $_params, $this);
890
                        }
891
                    }
892
                }
893
894
                // Make unique set of values and sort array by key:
895 2
                $paramArray[$p] = array_unique($paramArray[$p]);
896 2
                ksort($paramArray);
897
            } else {
898
                // Set the literal value as only value in array:
899 2
                $paramArray[$p] = [$v];
900
            }
901
        }
902
903 2
        return $paramArray;
904
    }
905
906
    /**
907
     * Compiling URLs from parameter array (output of expandParameters())
908
     * The number of URLs will be the multiplication of the number of parameter values for each key
909
     *
910
     * @param array $paramArray Output of expandParameters(): Array with keys (GET var names) and for each an array of values
911
     * @param array $urls URLs accumulated in this array (for recursion)
912
     * @return array
913
     */
914 5
    public function compileUrls($paramArray, array $urls)
915
    {
916 5
        if (empty($paramArray)) {
917 5
            return $urls;
918
        }
919
        // shift first off stack:
920 4
        reset($paramArray);
921 4
        $varName = key($paramArray);
922 4
        $valueSet = array_shift($paramArray);
923
924
        // Traverse value set:
925 4
        $newUrls = [];
926 4
        foreach ($urls as $url) {
927 3
            foreach ($valueSet as $val) {
928 3
                $newUrls[] = $url . (strcmp($val, '') ? '&' . rawurlencode($varName) . '=' . rawurlencode($val) : '');
929
930 3
                if (count($newUrls) > $this->maximumUrlsToCompile) {
931
                    break;
932
                }
933
            }
934
        }
935 4
        return $this->compileUrls($paramArray, $newUrls);
936
    }
937
938
    /************************************
939
     *
940
     * Crawler log
941
     *
942
     ************************************/
943
944
    /**
945
     * Return array of records from crawler queue for input page ID
946
     *
947
     * @param integer $id Page ID for which to look up log entries.
948
     * @param string$filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
949
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
950
     * @param boolean $doFullFlush
951
     * @param integer $itemsPerPage Limit the amount of entries per page default is 10
952
     * @return array
953
     */
954 4
    public function getLogEntriesForPageId($id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
955
    {
956 4
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
957
        $queryBuilder
958 4
            ->select('*')
959 4
            ->from($this->tableName)
960 4
            ->where(
961 4
                $queryBuilder->expr()->eq('page_id', $queryBuilder->createNamedParameter($id, \PDO::PARAM_INT))
962
            )
963 4
            ->orderBy('scheduled', 'DESC');
964
965 4
        $expressionBuilder = GeneralUtility::makeInstance(ConnectionPool::class)
966 4
            ->getConnectionForTable($this->tableName)
967 4
            ->getExpressionBuilder();
968 4
        $query = $expressionBuilder->andX();
969
        // PHPStorm adds the highlight that the $addWhere is immediately overwritten,
970
        // but the $query = $expressionBuilder->andX() ensures that the $addWhere is written correctly with AND
971
        // between the statements, it's not a mistake in the code.
972 4
        $addWhere = '';
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
973 4
        switch ($filter) {
974 4
            case 'pending':
975
                $queryBuilder->andWhere($queryBuilder->expr()->eq('exec_time', 0));
976
                $addWhere = ' AND ' . $query->add($expressionBuilder->eq('exec_time', 0));
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
977
                break;
978 4
            case 'finished':
979
                $queryBuilder->andWhere($queryBuilder->expr()->gt('exec_time', 0));
980
                $addWhere = ' AND ' . $query->add($expressionBuilder->gt('exec_time', 0));
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
981
                break;
982
        }
983
984
        // FIXME: Write unit test that ensures that the right records are deleted.
985 4
        if ($doFlush) {
986 2
            $addWhere = $query->add($expressionBuilder->eq('page_id', intval($id)));
987 2
            $this->flushQueue($doFullFlush ? '1=1' : $addWhere);
988 2
            return [];
989
        } else {
990 2
            if ($itemsPerPage > 0) {
991
                $queryBuilder
992 2
                    ->setMaxResults((int)$itemsPerPage);
993
            }
994
995 2
            return $queryBuilder->execute()->fetchAll();
996
        }
997
    }
998
999
    /**
1000
     * Return array of records from crawler queue for input set ID
1001
     *
1002
     * @param integer $set_id Set ID for which to look up log entries.
1003
     * @param string $filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
1004
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
1005
     * @param integer $itemsPerPage Limit the amount of entires per page default is 10
1006
     * @return array
1007
     */
1008 6
    public function getLogEntriesForSetId($set_id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
1009
    {
1010 6
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1011
        $queryBuilder
1012 6
            ->select('*')
1013 6
            ->from($this->tableName)
1014 6
            ->where(
1015 6
                $queryBuilder->expr()->eq('set_id', $queryBuilder->createNamedParameter($set_id, \PDO::PARAM_INT))
1016
            )
1017 6
            ->orderBy('scheduled', 'DESC');
1018
1019 6
        $expressionBuilder = GeneralUtility::makeInstance(ConnectionPool::class)
1020 6
            ->getConnectionForTable($this->tableName)
1021 6
            ->getExpressionBuilder();
1022 6
        $query = $expressionBuilder->andX();
1023
        // FIXME: Write Unit tests for Filters
1024
        // PHPStorm adds the highlight that the $addWhere is immediately overwritten,
1025
        // but the $query = $expressionBuilder->andX() ensures that the $addWhere is written correctly with AND
1026
        // between the statements, it's not a mistake in the code.
1027 6
        $addWhere = '';
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
1028 6
        switch ($filter) {
1029 6
            case 'pending':
1030 1
                $queryBuilder->andWhere($queryBuilder->expr()->eq('exec_time', 0));
1031 1
                $addWhere = $query->add($expressionBuilder->eq('exec_time', 0));
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
1032 1
                break;
1033 5
            case 'finished':
1034 1
                $queryBuilder->andWhere($queryBuilder->expr()->gt('exec_time', 0));
1035 1
                $addWhere = $query->add($expressionBuilder->gt('exec_time', 0));
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
1036 1
                break;
1037
        }
1038
        // FIXME: Write unit test that ensures that the right records are deleted.
1039 6
        if ($doFlush) {
1040 4
            $addWhere = $query->add($expressionBuilder->eq('set_id', intval($set_id)));
1041 4
            $this->flushQueue($doFullFlush ? '' : $addWhere);
1042 4
            return [];
1043
        } else {
1044 2
            if ($itemsPerPage > 0) {
1045
                $queryBuilder
1046 2
                    ->setMaxResults((int)$itemsPerPage);
1047
            }
1048
1049 2
            return $queryBuilder->execute()->fetchAll();
1050
        }
1051
    }
1052
1053
    /**
1054
     * Removes queue entries
1055
     *
1056
     * @param string $where SQL related filter for the entries which should be removed
1057
     * @return void
1058
     */
1059 9
    protected function flushQueue($where = '')
1060
    {
1061 9
        $realWhere = strlen($where) > 0 ? $where : '1=1';
1062
1063 9
        $queryBuilder = $this->getQueryBuilder($this->tableName);
1064
1065 9
        if (EventDispatcher::getInstance()->hasObserver('queueEntryFlush')) {
1066
            $groups = $queryBuilder
1067
                ->select('DISTINCT set_id')
1068
                ->from($this->tableName)
1069
                ->where($realWhere)
1070
                ->execute()
1071
                ->fetchAll();
1072
            if (is_array($groups)) {
1073
                foreach ($groups as $group) {
1074
                    $subSet = $queryBuilder
1075
                        ->select('uid', 'set_id')
1076
                        ->from($this->tableName)
1077
                        ->where(
1078
                            $realWhere,
1079
                            $queryBuilder->expr()->eq('set_id', $group['set_id'])
1080
                        )
1081
                        ->execute()
1082
                        ->fetchAll();
1083
                    EventDispatcher::getInstance()->post('queueEntryFlush', $group['set_id'], $subSet);
1084
                }
1085
            }
1086
        }
1087
1088
        $queryBuilder
1089 9
            ->delete($this->tableName)
1090 9
            ->where($realWhere)
1091 9
            ->execute();
1092 9
    }
1093
1094
    /**
1095
     * Adding call back entries to log (called from hooks typically, see indexed search class "class.crawler.php"
1096
     *
1097
     * @param integer $setId Set ID
1098
     * @param array $params Parameters to pass to call back function
1099
     * @param string $callBack Call back object reference, eg. 'EXT:indexed_search/class.crawler.php:&tx_indexedsearch_crawler'
1100
     * @param integer $page_id Page ID to attach it to
1101
     * @param integer $schedule Time at which to activate
1102
     * @return void
1103
     */
1104
    public function addQueueEntry_callBack($setId, $params, $callBack, $page_id = 0, $schedule = 0)
1105
    {
1106
        if (!is_array($params)) {
1107
            $params = [];
1108
        }
1109
        $params['_CALLBACKOBJ'] = $callBack;
1110
1111
        GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_queue')
1112
            ->insert(
1113
                'tx_crawler_queue',
1114
                [
1115
                    'page_id' => intval($page_id),
1116
                    'parameters' => serialize($params),
1117
                    'scheduled' => intval($schedule) ? intval($schedule) : $this->getCurrentTime(),
1118
                    'exec_time' => 0,
1119
                    'set_id' => intval($setId),
1120
                    'result_data' => '',
1121
                ]
1122
            );
1123
    }
1124
1125
    /************************************
1126
     *
1127
     * URL setting
1128
     *
1129
     ************************************/
1130
1131
    /**
1132
     * Setting a URL for crawling:
1133
     *
1134
     * @param integer $id Page ID
1135
     * @param string $url Complete URL
1136
     * @param array $subCfg Sub configuration array (from TS config)
1137
     * @param integer $tstamp Scheduled-time
1138
     * @param string $configurationHash (optional) configuration hash
1139
     * @param bool $skipInnerDuplicationCheck (optional) skip inner duplication check
1140
     * @return bool
1141
     */
1142 2
    public function addUrl(
1143
        $id,
1144
        $url,
1145
        array $subCfg,
1146
        $tstamp,
1147
        $configurationHash = '',
1148
        $skipInnerDuplicationCheck = false
1149
    ) {
1150 2
        $urlAdded = false;
1151 2
        $rows = [];
1152
1153
        // Creating parameters:
1154
        $parameters = [
1155 2
            'url' => $url
1156
        ];
1157
1158
        // fe user group simulation:
1159 2
        $uGs = implode(',', array_unique(GeneralUtility::intExplode(',', $subCfg['userGroups'], true)));
1160 2
        if ($uGs) {
1161
            $parameters['feUserGroupList'] = $uGs;
1162
        }
1163
1164
        // Setting processing instructions
1165 2
        $parameters['procInstructions'] = GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']);
1166 2
        if (is_array($subCfg['procInstrParams.'])) {
1167 2
            $parameters['procInstrParams'] = $subCfg['procInstrParams.'];
1168
        }
1169
1170
        // Possible TypoScript Template Parents
1171 2
        $parameters['rootTemplatePid'] = $subCfg['rootTemplatePid'];
1172
1173
        // Compile value array:
1174 2
        $parameters_serialized = serialize($parameters);
1175
        $fieldArray = [
1176 2
            'page_id' => intval($id),
1177 2
            'parameters' => $parameters_serialized,
1178 2
            'parameters_hash' => GeneralUtility::shortMD5($parameters_serialized),
1179 2
            'configuration_hash' => $configurationHash,
1180 2
            'scheduled' => $tstamp,
1181 2
            'exec_time' => 0,
1182 2
            'set_id' => intval($this->setID),
1183 2
            'result_data' => '',
1184 2
            'configuration' => $subCfg['key'],
1185
        ];
1186
1187 2
        if ($this->registerQueueEntriesInternallyOnly) {
1188
            //the entries will only be registered and not stored to the database
1189
            $this->queueEntries[] = $fieldArray;
1190
        } else {
1191 2
            if (!$skipInnerDuplicationCheck) {
1192
                // check if there is already an equal entry
1193 2
                $rows = $this->getDuplicateRowsIfExist($tstamp, $fieldArray);
1194
            }
1195
1196 2
            if (empty($rows)) {
1197 2
                $connectionForCrawlerQueue = GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_queue');
1198 2
                $connectionForCrawlerQueue->insert(
1199 2
                    'tx_crawler_queue',
1200 2
                    $fieldArray
1201
                );
1202 2
                $uid = $connectionForCrawlerQueue->lastInsertId('tx_crawler_queue', 'qid');
1203 2
                $rows[] = $uid;
1204 2
                $urlAdded = true;
1205 2
                EventDispatcher::getInstance()->post('urlAddedToQueue', $this->setID, ['uid' => $uid, 'fieldArray' => $fieldArray]);
1206
            } else {
1207
                EventDispatcher::getInstance()->post('duplicateUrlInQueue', $this->setID, ['rows' => $rows, 'fieldArray' => $fieldArray]);
1208
            }
1209
        }
1210
1211 2
        return $urlAdded;
1212
    }
1213
1214
    /**
1215
     * This method determines duplicates for a queue entry with the same parameters and this timestamp.
1216
     * If the timestamp is in the past, it will check if there is any unprocessed queue entry in the past.
1217
     * If the timestamp is in the future it will check, if the queued entry has exactly the same timestamp
1218
     *
1219
     * @param int $tstamp
1220
     * @param array $fieldArray
1221
     *
1222
     * @return array
1223
     *
1224
     * TODO: Write Functional Tests
1225
     */
1226 2
    protected function getDuplicateRowsIfExist($tstamp, $fieldArray)
1227
    {
1228 2
        $rows = [];
1229
1230 2
        $currentTime = $this->getCurrentTime();
1231
1232 2
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1233
        $queryBuilder
1234 2
            ->select('qid')
1235 2
            ->from('tx_crawler_queue');
1236
        //if this entry is scheduled with "now"
1237 2
        if ($tstamp <= $currentTime) {
1238
            if ($this->extensionSettings['enableTimeslot']) {
1239
                $timeBegin = $currentTime - 100;
1240
                $timeEnd = $currentTime + 100;
1241
                $queryBuilder
1242
                    ->where(
1243
                        'scheduled BETWEEN ' . $timeBegin . ' AND ' . $timeEnd . ''
1244
                    )
1245
                    ->orWhere(
1246
                        $queryBuilder->expr()->lte('scheduled', $currentTime)
1247
                    );
1248
            } else {
1249
                $queryBuilder
1250
                    ->where(
1251
                        $queryBuilder->expr()->lte('scheduled', $currentTime)
1252
                    );
1253
            }
1254 2
        } elseif ($tstamp > $currentTime) {
1255
            //entry with a timestamp in the future need to have the same schedule time
1256
            $queryBuilder
1257 2
                ->where(
1258 2
                    $queryBuilder->expr()->eq('scheduled', $tstamp)
1259
                );
1260
        }
1261
1262
        $statement = $queryBuilder
1263 2
            ->andWhere('exec_time != 0')
1264 2
            ->andWhere('process_id != 0')
1265 2
            ->andWhere($queryBuilder->expr()->eq('page_id', $queryBuilder->createNamedParameter($fieldArray['page_id'], \PDO::PARAM_INT)))
1266 2
            ->andWhere($queryBuilder->expr()->eq('parameters_hash', $queryBuilder->createNamedParameter($fieldArray['parameters_hash'], \PDO::PARAM_STR)))
1267 2
            ->execute();
1268
1269 2
        while ($row = $statement->fetch()) {
1270
            $rows[] = $row['qid'];
1271
        }
1272
1273 2
        return $rows;
1274
    }
1275
1276
    /**
1277
     * Returns the current system time
1278
     *
1279
     * @return int
1280
     */
1281
    public function getCurrentTime()
1282
    {
1283
        return time();
1284
    }
1285
1286
    /************************************
1287
     *
1288
     * URL reading
1289
     *
1290
     ************************************/
1291
1292
    /**
1293
     * Read URL for single queue entry
1294
     *
1295
     * @param integer $queueId
1296
     * @param boolean $force If set, will process even if exec_time has been set!
1297
     * @return integer
1298
     */
1299
    public function readUrl($queueId, $force = false)
1300
    {
1301
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1302
        $ret = 0;
1303
        $this->logger->debug('crawler-readurl start ' . microtime(true));
1304
        // Get entry:
1305
        $queryBuilder
1306
            ->select('*')
1307
            ->from('tx_crawler_queue')
1308
            ->where(
1309
                $queryBuilder->expr()->eq('qid', $queryBuilder->createNamedParameter($queueId, \PDO::PARAM_INT))
1310
            );
1311
        if (!$force) {
1312
            $queryBuilder
1313
                ->andWhere('exec_time = 0')
1314
                ->andWhere('process_scheduled > 0');
1315
        }
1316
        $queueRec = $queryBuilder->execute()->fetch();
1317
1318
        if (!is_array($queueRec)) {
1319
            return;
1320
        }
1321
1322
        $parameters = unserialize($queueRec['parameters']);
1323
        if ($parameters['rootTemplatePid']) {
1324
            $this->initTSFE((int)$parameters['rootTemplatePid']);
1325
        } else {
1326
            $this->logger->warning(
1327
                'Page with (' . $queueRec['page_id'] . ') could not be crawled, please check your crawler configuration. Perhaps no Root Template Pid is set'
1328
            );
1329
        }
1330
1331
        SignalSlotUtility::emitSignal(
1332
            __CLASS__,
1333
            SignalSlotUtility::SIGNNAL_QUEUEITEM_PREPROCESS,
1334
            [$queueId, &$queueRec]
1335
        );
1336
1337
        // Set exec_time to lock record:
1338
        $field_array = ['exec_time' => $this->getCurrentTime()];
1339
1340
        if (isset($this->processID)) {
1341
            //if mulitprocessing is used we need to store the id of the process which has handled this entry
1342
            $field_array['process_id_completed'] = $this->processID;
1343
        }
1344
1345
        GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_queue')
1346
            ->update(
1347
                'tx_crawler_queue',
1348
                $field_array,
1349
                [ 'qid' => (int)$queueId ]
1350
            );
1351
1352
        $result = $this->readUrl_exec($queueRec);
1353
        $resultData = unserialize($result['content']);
1354
1355
        //atm there's no need to point to specific pollable extensions
1356
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'])) {
1357
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'] as $pollable) {
1358
                // only check the success value if the instruction is runnig
1359
                // it is important to name the pollSuccess key same as the procInstructions key
1360
                if (is_array($resultData['parameters']['procInstructions']) && in_array(
1361
                    $pollable,
1362
                    $resultData['parameters']['procInstructions']
1363
                )
1364
                ) {
1365
                    if (!empty($resultData['success'][$pollable]) && $resultData['success'][$pollable]) {
1366
                        $ret |= self::CLI_STATUS_POLLABLE_PROCESSED;
1367
                    }
1368
                }
1369
            }
1370
        }
1371
1372
        // Set result in log which also denotes the end of the processing of this entry.
1373
        $field_array = ['result_data' => serialize($result)];
1374
1375
        SignalSlotUtility::emitSignal(
1376
            __CLASS__,
1377
            SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1378
            [$queueId, &$field_array]
1379
        );
1380
1381
        GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_queue')
1382
            ->update(
1383
                'tx_crawler_queue',
1384
                $field_array,
1385
                [ 'qid' => (int)$queueId ]
1386
            );
1387
1388
        $this->logger->debug('crawler-readurl stop ' . microtime(true));
1389
        return $ret;
1390
    }
1391
1392
    /**
1393
     * Read URL for not-yet-inserted log-entry
1394
     *
1395
     * @param array $field_array Queue field array,
1396
     *
1397
     * @return string
1398
     */
1399
    public function readUrlFromArray($field_array)
1400
    {
1401
1402
            // Set exec_time to lock record:
1403
        $field_array['exec_time'] = $this->getCurrentTime();
1404
        $connectionForCrawlerQueue = GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_queue');
1405
        $connectionForCrawlerQueue->insert(
1406
            'tx_crawler_queue',
1407
            $field_array
1408
        );
1409
        $queueId = $field_array['qid'] = $connectionForCrawlerQueue->lastInsertId('tx_crawler_queue', 'qid');
1410
1411
        $result = $this->readUrl_exec($field_array);
1412
1413
        // Set result in log which also denotes the end of the processing of this entry.
1414
        $field_array = ['result_data' => serialize($result)];
1415
1416
        SignalSlotUtility::emitSignal(
1417
            __CLASS__,
1418
            SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1419
            [$queueId, &$field_array]
1420
        );
1421
1422
        $connectionForCrawlerQueue->update(
1423
            'tx_crawler_queue',
1424
            $field_array,
1425
            ['qid' => $queueId]
1426
        );
1427
1428
        return $result;
1429
    }
1430
1431
    /**
1432
     * Read URL for a queue record
1433
     *
1434
     * @param array $queueRec Queue record
1435
     * @return string
1436
     */
1437
    public function readUrl_exec($queueRec)
1438
    {
1439
        // Decode parameters:
1440
        $parameters = unserialize($queueRec['parameters']);
1441
        $result = 'ERROR';
1442
        if (is_array($parameters)) {
1443
            if ($parameters['_CALLBACKOBJ']) { // Calling object:
1444
                $objRef = $parameters['_CALLBACKOBJ'];
1445
                $callBackObj = GeneralUtility::makeInstance($objRef);
1446
                if (is_object($callBackObj)) {
1447
                    unset($parameters['_CALLBACKOBJ']);
1448
                    $result = ['content' => serialize($callBackObj->crawler_execute($parameters, $this))];
1449
                } else {
1450
                    $result = ['content' => 'No object: ' . $objRef];
1451
                }
1452
            } else { // Regular FE request:
1453
1454
                // Prepare:
1455
                $crawlerId = $queueRec['qid'] . ':' . md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey']);
1456
1457
                // Get result:
1458
                $result = $this->requestUrl($parameters['url'], $crawlerId);
1459
1460
                EventDispatcher::getInstance()->post('urlCrawled', $queueRec['set_id'], ['url' => $parameters['url'], 'result' => $result]);
1461
            }
1462
        }
1463
1464
        return $result;
1465
    }
1466
1467
    /**
1468
     * Gets the content of a URL.
1469
     *
1470
     * @param string $originalUrl URL to read
1471
     * @param string $crawlerId Crawler ID string (qid + hash to verify)
1472
     * @param integer $timeout Timeout time
1473
     * @param integer $recursion Recursion limiter for 302 redirects
1474
     * @return array|boolean
1475
     */
1476 2
    public function requestUrl($originalUrl, $crawlerId, $timeout = 2, $recursion = 10)
1477
    {
1478 2
        if (!$recursion) {
1479
            return false;
1480
        }
1481
1482
        // Parse URL, checking for scheme:
1483 2
        $url = parse_url($originalUrl);
1484
1485 2
        if ($url === false) {
1486
            $this->logger->debug(
1487
                sprintf('Could not parse_url() for string "%s"', $url),
1488
                ['crawlerId' => $crawlerId]
1489
            );
1490
            return false;
1491
        }
1492
1493 2
        if (!in_array($url['scheme'], ['','http','https'])) {
1494
            $this->logger->debug(
1495
                sprintf('Scheme does not match for url "%s"', $url),
1496
                ['crawlerId' => $crawlerId]
1497
            );
1498
            return false;
1499
        }
1500
1501
        // direct request
1502 2
        if ($this->extensionSettings['makeDirectRequests']) {
1503 2
            $result = $this->sendDirectRequest($originalUrl, $crawlerId);
1504 2
            return $result;
1505
        }
1506
1507
        $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1508
1509
        // thanks to Pierrick Caillon for adding proxy support
1510
        $rurl = $url;
1511
1512
        if ($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlUse'] && $GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']) {
1513
            $rurl = parse_url($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']);
1514
            $url['path'] = $url['scheme'] . '://' . $url['host'] . ($url['port'] > 0 ? ':' . $url['port'] : '') . $url['path'];
1515
            $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1516
        }
1517
1518
        $host = $rurl['host'];
1519
1520
        if ($url['scheme'] == 'https') {
1521
            $host = 'ssl://' . $host;
1522
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 443;
1523
        } else {
1524
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 80;
1525
        }
1526
1527
        $startTime = microtime(true);
1528
        $fp = fsockopen($host, $port, $errno, $errstr, $timeout);
1529
1530
        if (!$fp) {
1531
            $this->logger->debug(
1532
                sprintf('Error while opening "%s"', $url),
1533
                ['crawlerId' => $crawlerId]
1534
            );
1535
            return false;
1536
        } else {
1537
            // Request message:
1538
            $msg = implode("\r\n", $reqHeaders) . "\r\n\r\n";
1539
            fputs($fp, $msg);
1540
1541
            // Read response:
1542
            $d = $this->getHttpResponseFromStream($fp);
1543
            fclose($fp);
1544
1545
            $time = microtime(true) - $startTime;
1546
            $this->logger->info($originalUrl . ' ' . $time);
1547
1548
            // Implode content and headers:
1549
            $result = [
1550
                'request' => $msg,
1551
                'headers' => implode('', $d['headers']),
1552
                'content' => implode('', (array)$d['content'])
1553
            ];
1554
1555
            if (($this->extensionSettings['follow30x']) && ($newUrl = $this->getRequestUrlFrom302Header($d['headers'], $url['user'], $url['pass']))) {
1556
                $result = array_merge(['parentRequest' => $result], $this->requestUrl($newUrl, $crawlerId, $recursion--));
1557
                $newRequestUrl = $this->requestUrl($newUrl, $crawlerId, $timeout, --$recursion);
1558
1559
                if (is_array($newRequestUrl)) {
1560
                    $result = array_merge(['parentRequest' => $result], $newRequestUrl);
1561
                } else {
1562
                    $this->logger->debug(
1563
                        sprintf('Error while opening "%s"', $url),
1564
                        ['crawlerId' => $crawlerId]
1565
                    );
1566
                    return false;
1567
                }
1568
            }
1569
1570
            return $result;
1571
        }
1572
    }
1573
1574
    /**
1575
     * Gets the base path of the website frontend.
1576
     * (e.g. if you call http://mydomain.com/cms/index.php in
1577
     * the browser the base path is "/cms/")
1578
     *
1579
     * @return string Base path of the website frontend
1580
     */
1581
    protected function getFrontendBasePath()
1582
    {
1583
        $frontendBasePath = '/';
1584
1585
        // Get the path from the extension settings:
1586
        if (isset($this->extensionSettings['frontendBasePath']) && $this->extensionSettings['frontendBasePath']) {
1587
            $frontendBasePath = $this->extensionSettings['frontendBasePath'];
1588
        // If empty, try to use config.absRefPrefix:
1589
        } elseif (isset($GLOBALS['TSFE']->absRefPrefix) && !empty($GLOBALS['TSFE']->absRefPrefix)) {
1590
            $frontendBasePath = $GLOBALS['TSFE']->absRefPrefix;
1591
        // If not in CLI mode the base path can be determined from $_SERVER environment:
1592
        } elseif (!Environment::isCli()) {
1593
            $frontendBasePath = GeneralUtility::getIndpEnv('TYPO3_SITE_PATH');
1594
        }
1595
1596
        // Base path must be '/<pathSegements>/':
1597
        if ($frontendBasePath !== '/') {
1598
            $frontendBasePath = '/' . ltrim($frontendBasePath, '/');
1599
            $frontendBasePath = rtrim($frontendBasePath, '/') . '/';
1600
        }
1601
1602
        return $frontendBasePath;
1603
    }
1604
1605
    /**
1606
     * Executes a shell command and returns the outputted result.
1607
     *
1608
     * @param string $command Shell command to be executed
1609
     * @return string Outputted result of the command execution
1610
     */
1611
    protected function executeShellCommand($command)
1612
    {
1613
        return shell_exec($command);
1614
    }
1615
1616
    /**
1617
     * Reads HTTP response from the given stream.
1618
     *
1619
     * @param  resource $streamPointer  Pointer to connection stream.
1620
     * @return array                    Associative array with the following items:
1621
     *                                  headers <array> Response headers sent by server.
1622
     *                                  content <array> Content, with each line as an array item.
1623
     */
1624 1
    protected function getHttpResponseFromStream($streamPointer)
1625
    {
1626 1
        $response = ['headers' => [], 'content' => []];
1627
1628 1
        if (is_resource($streamPointer)) {
1629
            // read headers
1630 1
            while ($line = fgets($streamPointer, '2048')) {
1631 1
                $line = trim($line);
1632 1
                if ($line !== '') {
1633 1
                    $response['headers'][] = $line;
1634
                } else {
1635 1
                    break;
1636
                }
1637
            }
1638
1639
            // read content
1640 1
            while ($line = fgets($streamPointer, '2048')) {
1641 1
                $response['content'][] = $line;
1642
            }
1643
        }
1644
1645 1
        return $response;
1646
    }
1647
1648
    /**
1649
     * Builds HTTP request headers.
1650
     *
1651
     * @param array $url
1652
     * @param string $crawlerId
1653
     *
1654
     * @return array
1655
     */
1656 6
    protected function buildRequestHeaderArray(array $url, $crawlerId)
1657
    {
1658 6
        $reqHeaders = [];
1659 6
        $reqHeaders[] = 'GET ' . $url['path'] . ($url['query'] ? '?' . $url['query'] : '') . ' HTTP/1.0';
1660 6
        $reqHeaders[] = 'Host: ' . $url['host'];
1661 6
        if (stristr($url['query'], 'ADMCMD_previewWS')) {
1662 2
            $reqHeaders[] = 'Cookie: $Version="1"; be_typo_user="1"; $Path=/';
1663
        }
1664 6
        $reqHeaders[] = 'Connection: close';
1665 6
        if ($url['user'] != '') {
1666 2
            $reqHeaders[] = 'Authorization: Basic ' . base64_encode($url['user'] . ':' . $url['pass']);
1667
        }
1668 6
        $reqHeaders[] = 'X-T3crawler: ' . $crawlerId;
1669 6
        $reqHeaders[] = 'User-Agent: TYPO3 crawler';
1670 6
        return $reqHeaders;
1671
    }
1672
1673
    /**
1674
     * Check if the submitted HTTP-Header contains a redirect location and built new crawler-url
1675
     *
1676
     * @param array $headers HTTP Header
1677
     * @param string $user HTTP Auth. User
1678
     * @param string $pass HTTP Auth. Password
1679
     * @return bool|string
1680
     */
1681 12
    protected function getRequestUrlFrom302Header($headers, $user = '', $pass = '')
1682
    {
1683 12
        $header = [];
1684 12
        if (!is_array($headers)) {
1685 1
            return false;
1686
        }
1687 11
        if (!(stristr($headers[0], '301 Moved') || stristr($headers[0], '302 Found') || stristr($headers[0], '302 Moved'))) {
1688 2
            return false;
1689
        }
1690
1691 9
        foreach ($headers as $hl) {
1692 9
            $tmp = explode(": ", $hl);
1693 9
            $header[trim($tmp[0])] = trim($tmp[1]);
1694 9
            if (trim($tmp[0]) == 'Location') {
1695 6
                break;
1696
            }
1697
        }
1698 9
        if (!array_key_exists('Location', $header)) {
1699 3
            return false;
1700
        }
1701
1702 6
        if ($user != '') {
1703 3
            if (!($tmp = parse_url($header['Location']))) {
1704 1
                return false;
1705
            }
1706 2
            $newUrl = $tmp['scheme'] . '://' . $user . ':' . $pass . '@' . $tmp['host'] . $tmp['path'];
1707 2
            if ($tmp['query'] != '') {
1708 2
                $newUrl .= '?' . $tmp['query'];
1709
            }
1710
        } else {
1711 3
            $newUrl = $header['Location'];
1712
        }
1713 5
        return $newUrl;
1714
    }
1715
1716
    /**************************
1717
     *
1718
     * tslib_fe hooks:
1719
     *
1720
     **************************/
1721
1722
    /**
1723
     * Initialization hook (called after database connection)
1724
     * Takes the "HTTP_X_T3CRAWLER" header and looks up queue record and verifies if the session comes from the system (by comparing hashes)
1725
     *
1726
     * @param array $params Parameters from frontend
1727
     * @param object $ref TSFE object (reference under PHP5)
1728
     * @return void
1729
     *
1730
     * FIXME: Look like this is not used, in commit 9910d3f40cce15f4e9b7bcd0488bf21f31d53ebc it's added as public,
1731
     * FIXME: I think this can be removed. (TNM)
1732
     */
1733
    public function fe_init(&$params, $ref)
1734
    {
1735
        // Authenticate crawler request:
1736
        if (isset($_SERVER['HTTP_X_T3CRAWLER'])) {
1737
            $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1738
            list($queueId, $hash) = explode(':', $_SERVER['HTTP_X_T3CRAWLER']);
1739
1740
            $queueRec = $queryBuilder
1741
                ->select('*')
1742
                ->from('tx_crawler_queue')
1743
                ->where(
1744
                    $queryBuilder->expr()->eq('qid', $queryBuilder->createNamedParameter($queueId, \PDO::PARAM_INT))
1745
                )
1746
                ->execute()
1747
                ->fetch();
1748
1749
            // If a crawler record was found and hash was matching, set it up:
1750
            if (is_array($queueRec) && $hash === md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey'])) {
1751
                $params['pObj']->applicationData['tx_crawler']['running'] = true;
1752
                $params['pObj']->applicationData['tx_crawler']['parameters'] = unserialize($queueRec['parameters']);
1753
                $params['pObj']->applicationData['tx_crawler']['log'] = [];
1754
            } else {
1755
                die('No crawler entry found!');
1756
            }
1757
        }
1758
    }
1759
1760
    /*****************************
1761
     *
1762
     * Compiling URLs to crawl - tools
1763
     *
1764
     *****************************/
1765
1766
    /**
1767
     * @param integer $id Root page id to start from.
1768
     * @param integer $depth Depth of tree, 0=only id-page, 1= on sublevel, 99 = infinite
1769
     * @param integer $scheduledTime Unix Time when the URL is timed to be visited when put in queue
1770
     * @param integer $reqMinute Number of requests per minute (creates the interleave between requests)
1771
     * @param boolean $submitCrawlUrls If set, submits the URLs to queue in database (real crawling)
1772
     * @param boolean $downloadCrawlUrls If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
1773
     * @param array $incomingProcInstructions Array of processing instructions
1774
     * @param array $configurationSelection Array of configuration keys
1775
     * @return string
1776
     */
1777
    public function getPageTreeAndUrls(
1778
        $id,
1779
        $depth,
1780
        $scheduledTime,
1781
        $reqMinute,
1782
        $submitCrawlUrls,
1783
        $downloadCrawlUrls,
1784
        array $incomingProcInstructions,
1785
        array $configurationSelection
1786
    ) {
1787
        $this->scheduledTime = $scheduledTime;
1788
        $this->reqMinute = $reqMinute;
1789
        $this->submitCrawlUrls = $submitCrawlUrls;
1790
        $this->downloadCrawlUrls = $downloadCrawlUrls;
1791
        $this->incomingProcInstructions = $incomingProcInstructions;
1792
        $this->incomingConfigurationSelection = $configurationSelection;
1793
1794
        $this->duplicateTrack = [];
1795
        $this->downloadUrls = [];
1796
1797
        // Drawing tree:
1798
        /* @var PageTreeView $tree */
1799
        $tree = GeneralUtility::makeInstance(PageTreeView::class);
1800
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
1801
        $tree->init('AND ' . $perms_clause);
1802
1803
        $pageInfo = BackendUtility::readPageAccess($id, $perms_clause);
1804
        if (is_array($pageInfo)) {
1805
            // Set root row:
1806
            $tree->tree[] = [
1807
                'row' => $pageInfo,
1808
                'HTML' => IconUtility::getIconForRecord('pages', $pageInfo)
1809
            ];
1810
        }
1811
1812
        // Get branch beneath:
1813
        if ($depth) {
1814
            $tree->getTree($id, $depth, '');
1815
        }
1816
1817
        // Traverse page tree:
1818
        $code = '';
1819
1820
        foreach ($tree->tree as $data) {
1821
            $this->MP = false;
1822
1823
            // recognize mount points
1824
            if ($data['row']['doktype'] == PageRepository::DOKTYPE_MOUNTPOINT) {
1825
                $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1826
                $queryBuilder->getRestrictions()->removeAll()->add(GeneralUtility::makeInstance(DeletedRestriction::class));
1827
                $mountpage = $queryBuilder
1828
                    ->select('*')
1829
                    ->from('pages')
1830
                    ->where(
1831
                        $queryBuilder->expr()->eq('uid', $queryBuilder->createNamedParameter($data['row']['uid'], \PDO::PARAM_INT))
1832
                    )
1833
                    ->execute()
1834
                    ->fetchAll();
1835
                $queryBuilder->getRestrictions()->reset();
1836
1837
                // fetch mounted pages
1838
                $this->MP = $mountpage[0]['mount_pid'] . '-' . $data['row']['uid'];
0 ignored issues
show
Documentation Bug introduced by
The property $MP was declared of type boolean, but $mountpage[0]['mount_pid...' . $data['row']['uid'] is of type string. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
1839
1840
                $mountTree = GeneralUtility::makeInstance(PageTreeView::class);
1841
                $mountTree->init('AND ' . $perms_clause);
1842
                $mountTree->getTree($mountpage[0]['mount_pid'], $depth);
1843
1844
                foreach ($mountTree->tree as $mountData) {
1845
                    $code .= $this->drawURLs_addRowsForPage(
1846
                        $mountData['row'],
1847
                        $mountData['HTML'] . BackendUtility::getRecordTitle('pages', $mountData['row'], true)
1848
                    );
1849
                }
1850
1851
                // replace page when mount_pid_ol is enabled
1852
                if ($mountpage[0]['mount_pid_ol']) {
1853
                    $data['row']['uid'] = $mountpage[0]['mount_pid'];
1854
                } else {
1855
                    // if the mount_pid_ol is not set the MP must not be used for the mountpoint page
1856
                    $this->MP = false;
1857
                }
1858
            }
1859
1860
            $code .= $this->drawURLs_addRowsForPage(
1861
                $data['row'],
1862
                $data['HTML'] . BackendUtility::getRecordTitle('pages', $data['row'], true)
1863
            );
1864
        }
1865
1866
        return $code;
1867
    }
1868
1869
    /**
1870
     * Expands exclude string
1871
     *
1872
     * @param string $excludeString Exclude string
1873
     * @return array
1874
     */
1875 1
    public function expandExcludeString($excludeString)
1876
    {
1877
        // internal static caches;
1878 1
        static $expandedExcludeStringCache;
1879 1
        static $treeCache;
1880
1881 1
        if (empty($expandedExcludeStringCache[$excludeString])) {
1882 1
            $pidList = [];
1883
1884 1
            if (!empty($excludeString)) {
1885
                /** @var PageTreeView $tree */
1886
                $tree = GeneralUtility::makeInstance(PageTreeView::class);
1887
                $tree->init('AND ' . $this->backendUser->getPagePermsClause(1));
1888
1889
                $excludeParts = GeneralUtility::trimExplode(',', $excludeString);
1890
1891
                foreach ($excludeParts as $excludePart) {
1892
                    list($pid, $depth) = GeneralUtility::trimExplode('+', $excludePart);
1893
1894
                    // default is "page only" = "depth=0"
1895
                    if (empty($depth)) {
1896
                        $depth = (stristr($excludePart, '+')) ? 99 : 0;
1897
                    }
1898
1899
                    $pidList[] = $pid;
1900
1901
                    if ($depth > 0) {
1902
                        if (empty($treeCache[$pid][$depth])) {
1903
                            $tree->reset();
1904
                            $tree->getTree($pid, $depth);
1905
                            $treeCache[$pid][$depth] = $tree->tree;
1906
                        }
1907
1908
                        foreach ($treeCache[$pid][$depth] as $data) {
1909
                            $pidList[] = $data['row']['uid'];
1910
                        }
1911
                    }
1912
                }
1913
            }
1914
1915 1
            $expandedExcludeStringCache[$excludeString] = array_unique($pidList);
1916
        }
1917
1918 1
        return $expandedExcludeStringCache[$excludeString];
1919
    }
1920
1921
    /**
1922
     * Create the rows for display of the page tree
1923
     * For each page a number of rows are shown displaying GET variable configuration
1924
     *
1925
     * @param    array        Page row
1926
     * @param    string        Page icon and title for row
1927
     * @return    string        HTML <tr> content (one or more)
1928
     */
1929
    public function drawURLs_addRowsForPage(array $pageRow, $pageTitleAndIcon)
1930
    {
1931
        $skipMessage = '';
1932
1933
        // Get list of configurations
1934
        $configurations = $this->getUrlsForPageRow($pageRow, $skipMessage);
1935
1936
        if (!empty($this->incomingConfigurationSelection)) {
1937
            // remove configuration that does not match the current selection
1938
            foreach ($configurations as $confKey => $confArray) {
1939
                if (!in_array($confKey, $this->incomingConfigurationSelection)) {
1940
                    unset($configurations[$confKey]);
1941
                }
1942
            }
1943
        }
1944
1945
        // Traverse parameter combinations:
1946
        $c = 0;
1947
        $content = '';
1948
        if (!empty($configurations)) {
1949
            foreach ($configurations as $confKey => $confArray) {
1950
1951
                    // Title column:
1952
                if (!$c) {
1953
                    $titleClm = '<td rowspan="' . count($configurations) . '">' . $pageTitleAndIcon . '</td>';
1954
                } else {
1955
                    $titleClm = '';
1956
                }
1957
1958
                if (!in_array($pageRow['uid'], $this->expandExcludeString($confArray['subCfg']['exclude']))) {
1959
1960
                        // URL list:
1961
                    $urlList = $this->urlListFromUrlArray(
1962
                        $confArray,
1963
                        $pageRow,
1964
                        $this->scheduledTime,
1965
                        $this->reqMinute,
1966
                        $this->submitCrawlUrls,
1967
                        $this->downloadCrawlUrls,
1968
                        $this->duplicateTrack,
1969
                        $this->downloadUrls,
1970
                        $this->incomingProcInstructions // if empty the urls won't be filtered by processing instructions
1971
                    );
1972
1973
                    // Expanded parameters:
1974
                    $paramExpanded = '';
1975
                    $calcAccu = [];
1976
                    $calcRes = 1;
1977
                    foreach ($confArray['paramExpanded'] as $gVar => $gVal) {
1978
                        $paramExpanded .= '
1979
                            <tr>
1980
                                <td class="bgColor4-20">' . htmlspecialchars('&' . $gVar . '=') . '<br/>' .
1981
                                                '(' . count($gVal) . ')' .
1982
                                                '</td>
1983
                                <td class="bgColor4" nowrap="nowrap">' . nl2br(htmlspecialchars(implode(chr(10), $gVal))) . '</td>
1984
                            </tr>
1985
                        ';
1986
                        $calcRes *= count($gVal);
1987
                        $calcAccu[] = count($gVal);
1988
                    }
1989
                    $paramExpanded = '<table class="lrPadding c-list param-expanded">' . $paramExpanded . '</table>';
1990
                    $paramExpanded .= 'Comb: ' . implode('*', $calcAccu) . '=' . $calcRes;
1991
1992
                    // Options
1993
                    $optionValues = '';
1994
                    if ($confArray['subCfg']['userGroups']) {
1995
                        $optionValues .= 'User Groups: ' . $confArray['subCfg']['userGroups'] . '<br/>';
1996
                    }
1997
                    if ($confArray['subCfg']['procInstrFilter']) {
1998
                        $optionValues .= 'ProcInstr: ' . $confArray['subCfg']['procInstrFilter'] . '<br/>';
1999
                    }
2000
2001
                    // Compile row:
2002
                    $content .= '
2003
                        <tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
2004
                            ' . $titleClm . '
2005
                            <td>' . htmlspecialchars($confKey) . '</td>
2006
                            <td>' . nl2br(htmlspecialchars(rawurldecode(trim(str_replace('&', chr(10) . '&', GeneralUtility::implodeArrayForUrl('', $confArray['paramParsed'])))))) . '</td>
2007
                            <td>' . $paramExpanded . '</td>
2008
                            <td nowrap="nowrap">' . $urlList . '</td>
2009
                            <td nowrap="nowrap">' . $optionValues . '</td>
2010
                            <td nowrap="nowrap">' . DebugUtility::viewArray($confArray['subCfg']['procInstrParams.']) . '</td>
2011
                        </tr>';
2012
                } else {
2013
                    $content .= '<tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
2014
                            ' . $titleClm . '
2015
                            <td>' . htmlspecialchars($confKey) . '</td>
2016
                            <td colspan="5"><em>No entries</em> (Page is excluded in this configuration)</td>
2017
                        </tr>';
2018
                }
2019
2020
                $c++;
2021
            }
2022
        } else {
2023
            $message = !empty($skipMessage) ? ' (' . $skipMessage . ')' : '';
2024
2025
            // Compile row:
2026
            $content .= '
2027
                <tr class="bgColor-20" style="border-bottom: 1px solid black;">
2028
                    <td>' . $pageTitleAndIcon . '</td>
2029
                    <td colspan="6"><em>No entries</em>' . $message . '</td>
2030
                </tr>';
2031
        }
2032
2033
        return $content;
2034
    }
2035
2036
    /*****************************
2037
     *
2038
     * CLI functions
2039
     *
2040
     *****************************/
2041
2042
    /**
2043
     * Running the functionality of the CLI (crawling URLs from queue)
2044
     *
2045
     * @param int $countInARun
2046
     * @param int $sleepTime
2047
     * @param int $sleepAfterFinish
2048
     * @return string
2049
     */
2050
    public function CLI_run($countInARun, $sleepTime, $sleepAfterFinish)
2051
    {
2052
        $result = 0;
2053
        $counter = 0;
2054
2055
        // First, run hooks:
2056
        $this->CLI_runHooks();
2057
2058
        // Clean up the queue
2059
        if (intval($this->extensionSettings['purgeQueueDays']) > 0) {
2060
            $purgeDate = $this->getCurrentTime() - 24 * 60 * 60 * intval($this->extensionSettings['purgeQueueDays']);
2061
2062
            $queryBuilderDelete = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
2063
            $del = $queryBuilderDelete
2064
                ->delete($this->tableName)
2065
                ->where(
2066
                    'exec_time != 0 AND exec_time < ' . $purgeDate
2067
                )->execute();
2068
2069
            if (false === $del) {
2070
                $this->logger->info(
2071
                    'Records could not be deleted.'
2072
                );
2073
            }
2074
        }
2075
2076
        // Select entries:
2077
        //TODO Shouldn't this reside within the transaction?
2078
        $queryBuilderSelect = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
2079
        $rows = $queryBuilderSelect
2080
            ->select('qid', 'scheduled')
2081
            ->from('tx_crawler_queue')
2082
            ->where(
2083
                $queryBuilderSelect->expr()->eq('exec_time', 0),
2084
                $queryBuilderSelect->expr()->eq('process_scheduled', 0),
2085
                $queryBuilderSelect->expr()->lte('scheduled', $this->getCurrentTime())
2086
            )
2087
            ->orderBy('scheduled')
2088
            ->addOrderBy('qid')
2089
            ->setMaxResults($countInARun)
2090
            ->execute()
2091
            ->fetchAll();
2092
2093
        if (!empty($rows)) {
2094
            $quidList = [];
2095
2096
            foreach ($rows as $r) {
2097
                $quidList[] = $r['qid'];
2098
            }
2099
2100
            $processId = $this->CLI_buildProcessId();
2101
2102
            //reserve queue entries for process
2103
2104
            //$this->queryBuilder->getConnection()->executeQuery('BEGIN');
2105
            //TODO make sure we're not taking assigned queue-entires
2106
2107
            //save the number of assigned queue entrys to determine who many have been processed later
2108
            $queryBuilderUpdate = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
2109
            $numberOfAffectedRows = $queryBuilderUpdate
2110
                ->update('tx_crawler_queue')
2111
                ->where(
2112
                    $queryBuilderUpdate->expr()->in('qid', $quidList)
2113
                )
2114
                ->set('process_scheduled', $this->getCurrentTime())
2115
                ->set('process_id', $processId)
2116
                ->execute();
2117
2118
            GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_process')
2119
                ->update(
2120
                    'tx_crawler_process',
2121
                    [ 'assigned_items_count' => (int)$numberOfAffectedRows ],
2122
                    [ 'process_id' => $processId ]
2123
                );
2124
2125
            if ($numberOfAffectedRows == count($quidList)) {
2126
                //$this->queryBuilder->getConnection()->executeQuery('COMMIT');
2127
            } else {
2128
                //$this->queryBuilder->getConnection()->executeQuery('ROLLBACK');
2129
                $this->CLI_debug("Nothing processed due to multi-process collision (" . $this->CLI_buildProcessId() . ")");
2130
                return ($result | self::CLI_STATUS_ABORTED);
2131
            }
2132
2133
            foreach ($rows as $r) {
2134
                $result |= $this->readUrl($r['qid']);
2135
2136
                $counter++;
2137
                usleep(intval($sleepTime)); // Just to relax the system
2138
2139
                // if during the start and the current read url the cli has been disable we need to return from the function
2140
                // mark the process NOT as ended.
2141
                if ($this->getDisabled()) {
2142
                    return ($result | self::CLI_STATUS_ABORTED);
2143
                }
2144
2145
                if (!$this->CLI_checkIfProcessIsActive($this->CLI_buildProcessId())) {
2146
                    $this->CLI_debug("conflict / timeout (" . $this->CLI_buildProcessId() . ")");
2147
2148
                    //TODO might need an additional returncode
2149
                    $result |= self::CLI_STATUS_ABORTED;
2150
                    break; //possible timeout
2151
                }
2152
            }
2153
2154
            sleep(intval($sleepAfterFinish));
2155
2156
            $msg = 'Rows: ' . $counter;
2157
            $this->CLI_debug($msg . " (" . $this->CLI_buildProcessId() . ")");
2158
        } else {
2159
            $this->CLI_debug("Nothing within queue which needs to be processed (" . $this->CLI_buildProcessId() . ")");
2160
        }
2161
2162
        if ($counter > 0) {
2163
            $result |= self::CLI_STATUS_PROCESSED;
2164
        }
2165
2166
        return $result;
2167
    }
2168
2169
    /**
2170
     * Activate hooks
2171
     *
2172
     * @return void
2173
     */
2174
    public function CLI_runHooks()
2175
    {
2176
        foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['cli_hooks'] ?? [] as $objRef) {
2177
            $hookObj = GeneralUtility::makeInstance($objRef);
2178
            if (is_object($hookObj)) {
2179
                $hookObj->crawler_init($this);
2180
            }
2181
        }
2182
    }
2183
2184
    /**
2185
     * Try to acquire a new process with the given id
2186
     * also performs some auto-cleanup for orphan processes
2187
     * @todo preemption might not be the most elegant way to clean up
2188
     *
2189
     * @param string $id identification string for the process
2190
     * @return boolean
2191
     */
2192
    public function CLI_checkAndAcquireNewProcess($id)
2193
    {
2194
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
2195
        $ret = true;
2196
2197
        $systemProcessId = getmypid();
2198
        if ($systemProcessId < 1) {
2199
            return false;
2200
        }
2201
2202
        $processCount = 0;
2203
        $orphanProcesses = [];
2204
2205
        //$this->queryBuilder->getConnection()->executeQuery('BEGIN');
2206
2207
        $statement = $queryBuilder
2208
            ->select('process_id', 'ttl')
2209
            ->from('tx_crawler_process')
2210
            ->where(
2211
                'active = 1 AND deleted = 0'
2212
            )
2213
            ->execute();
2214
2215
        $currentTime = $this->getCurrentTime();
2216
2217
        while ($row = $statement->fetch()) {
2218
            if ($row['ttl'] < $currentTime) {
2219
                $orphanProcesses[] = $row['process_id'];
2220
            } else {
2221
                $processCount++;
2222
            }
2223
        }
2224
2225
        // if there are less than allowed active processes then add a new one
2226
        if ($processCount < intval($this->extensionSettings['processLimit'])) {
2227
            $this->CLI_debug("add process " . $this->CLI_buildProcessId() . " (" . ($processCount + 1) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2228
2229
            GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_process')->insert(
2230
                'tx_crawler_process',
2231
                [
2232
                    'process_id' => $id,
2233
                    'active' => 1,
2234
                    'ttl' => $currentTime + (int)$this->extensionSettings['processMaxRunTime'],
2235
                    'system_process_id' => $systemProcessId
2236
                ]
2237
            );
2238
        } else {
2239
            $this->CLI_debug("Processlimit reached (" . ($processCount) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2240
            $ret = false;
2241
        }
2242
2243
        $this->processRepository->deleteProcessesMarkedAsDeleted();
2244
        $this->CLI_releaseProcesses($orphanProcesses, true); // maybe this should be somehow included into the current lock
2245
2246
        return $ret;
2247
    }
2248
2249
    /**
2250
     * Release a process and the required resources
2251
     *
2252
     * @param  mixed    $releaseIds   string with a single process-id or array with multiple process-ids
2253
     * @param  boolean  $withinLock   show whether the DB-actions are included within an existing lock
2254
     * @return boolean
2255
     */
2256
    public function CLI_releaseProcesses($releaseIds, $withinLock = false)
2257
    {
2258
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
2259
2260
        if (!is_array($releaseIds)) {
2261
            $releaseIds = [$releaseIds];
2262
        }
2263
2264
        if (empty($releaseIds)) {
2265
            return false;   //nothing to release
2266
        }
2267
2268
        if (!$withinLock) {
2269
            //$this->queryBuilder->getConnection()->executeQuery('BEGIN');
2270
        }
2271
2272
        // some kind of 2nd chance algo - this way you need at least 2 processes to have a real cleanup
2273
        // this ensures that a single process can't mess up the entire process table
2274
2275
        // mark all processes as deleted which have no "waiting" queue-entires and which are not active
2276
2277
        $queryBuilder
2278
        ->update('tx_crawler_queue', 'q')
2279
        ->where(
2280
            'q.process_id IN(SELECT p.process_id FROM tx_crawler_process as p WHERE p.active = 0)'
2281
        )
2282
        ->set('q.process_scheduled', 0)
2283
        ->set('q.process_id', '')
2284
        ->execute();
2285
2286
        // FIXME: Not entirely sure that this is equivalent to the previous version
2287
        $queryBuilder->resetQueryPart('set');
2288
2289
        $queryBuilder
2290
            ->update('tx_crawler_process')
2291
            ->where(
2292
                $queryBuilder->expr()->eq('active', 0),
2293
                'process_id IN(SELECT q.process_id FROM tx_crawler_queue as q WHERE q.exec_time = 0)'
2294
            )
2295
            ->set('system_process_id', 0)
2296
            ->execute();
2297
        // previous version for reference
2298
        /*
2299
        $GLOBALS['TYPO3_DB']->exec_UPDATEquery(
2300
            'tx_crawler_process',
2301
            'active=0 AND deleted=0
2302
            AND NOT EXISTS (
2303
                SELECT * FROM tx_crawler_queue
2304
                WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
2305
                AND tx_crawler_queue.exec_time = 0
2306
            )',
2307
            [
2308
                'deleted' => '1',
2309
                'system_process_id' => 0
2310
            ]
2311
        );*/
2312
        // mark all requested processes as non-active
2313
        $queryBuilder
2314
            ->update('tx_crawler_process')
2315
            ->where(
2316
                'NOT EXISTS (
2317
                SELECT * FROM tx_crawler_queue
2318
                    WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
2319
                    AND tx_crawler_queue.exec_time = 0
2320
                )',
2321
                $queryBuilder->expr()->in('process_id', $queryBuilder->createNamedParameter($releaseIds, Connection::PARAM_STR_ARRAY)),
2322
                $queryBuilder->expr()->eq('deleted', 0)
2323
            )
2324
            ->set('active', 0)
2325
            ->execute();
2326
        $queryBuilder->resetQueryPart('set');
2327
        $queryBuilder
2328
            ->update('tx_crawler_queue')
2329
            ->where(
2330
                $queryBuilder->expr()->eq('exec_time', 0),
2331
                $queryBuilder->expr()->in('process_id', $queryBuilder->createNamedParameter($releaseIds, Connection::PARAM_STR_ARRAY))
2332
            )
2333
            ->set('process_scheduled', 0)
2334
            ->set('process_id', '')
2335
            ->execute();
2336
2337
        if (!$withinLock) {
2338
            //$this->queryBuilder->getConnection()->executeQuery('COMMIT');
2339
        }
2340
2341
        return true;
2342
    }
2343
2344
    /**
2345
     * Check if there are still resources left for the process with the given id
2346
     * Used to determine timeouts and to ensure a proper cleanup if there's a timeout
2347
     *
2348
     * @param  string  identification string for the process
2349
     * @return boolean determines if the process is still active / has resources
2350
     *
2351
     * TODO: Please consider moving this to Domain Model for Process or in ProcessRepository
2352
     */
2353 1
    public function CLI_checkIfProcessIsActive($pid)
2354
    {
2355 1
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
2356 1
        $ret = false;
2357
2358
        $statement = $queryBuilder
2359 1
            ->from('tx_crawler_process')
2360 1
            ->select('active')
2361 1
            ->where(
2362 1
                $queryBuilder->expr()->eq('process_id', intval($pid))
2363
            )
2364 1
            ->orderBy('ttl')
2365 1
            ->execute();
2366
2367 1
        if ($row = $statement->fetch(0)) {
2368 1
            $ret = intVal($row['active']) == 1;
2369
        }
2370
2371 1
        return $ret;
2372
    }
2373
2374
    /**
2375
     * Create a unique Id for the current process
2376
     *
2377
     * @return string  the ID
2378
     */
2379 2
    public function CLI_buildProcessId()
2380
    {
2381 2
        if (!$this->processID) {
2382 1
            $this->processID = GeneralUtility::shortMD5($this->microtime(true));
2383
        }
2384 2
        return $this->processID;
2385
    }
2386
2387
    /**
2388
     * @param bool $get_as_float
2389
     *
2390
     * @return mixed
2391
     */
2392
    protected function microtime($get_as_float = false)
2393
    {
2394
        return microtime($get_as_float);
2395
    }
2396
2397
    /**
2398
     * Prints a message to the stdout (only if debug-mode is enabled)
2399
     *
2400
     * @param  string $msg  the message
2401
     */
2402
    public function CLI_debug($msg)
2403
    {
2404
        if (intval($this->extensionSettings['processDebug'])) {
2405
            echo $msg . "\n";
2406
            flush();
2407
        }
2408
    }
2409
2410
    /**
2411
     * Get URL content by making direct request to TYPO3.
2412
     *
2413
     * @param  string $url          Page URL
2414
     * @param  int    $crawlerId    Crawler-ID
2415
     * @return array
2416
     */
2417 2
    protected function sendDirectRequest($url, $crawlerId)
2418
    {
2419 2
        $parsedUrl = parse_url($url);
2420 2
        if (!is_array($parsedUrl)) {
2421
            return [];
2422
        }
2423
2424 2
        $requestHeaders = $this->buildRequestHeaderArray($parsedUrl, $crawlerId);
2425
2426 2
        $cmd = escapeshellcmd($this->extensionSettings['phpPath']);
2427 2
        $cmd .= ' ';
2428 2
        $cmd .= escapeshellarg(ExtensionManagementUtility::extPath('crawler') . 'cli/bootstrap.php');
2429 2
        $cmd .= ' ';
2430 2
        $cmd .= escapeshellarg($this->getFrontendBasePath());
2431 2
        $cmd .= ' ';
2432 2
        $cmd .= escapeshellarg($url);
2433 2
        $cmd .= ' ';
2434 2
        $cmd .= escapeshellarg(base64_encode(serialize($requestHeaders)));
2435
2436 2
        $startTime = microtime(true);
2437 2
        $content = $this->executeShellCommand($cmd);
2438 2
        $this->logger->info($url . ' ' . (microtime(true) - $startTime));
2439
2440
        $result = [
2441 2
            'request' => implode("\r\n", $requestHeaders) . "\r\n\r\n",
2442 2
            'headers' => '',
2443 2
            'content' => $content
2444
        ];
2445
2446 2
        return $result;
2447
    }
2448
2449
    /**
2450
     * Cleans up entries that stayed for too long in the queue. These are:
2451
     * - processed entries that are over 1.5 days in age
2452
     * - scheduled entries that are over 7 days old
2453
     *
2454
     * @return void
2455
     */
2456
    public function cleanUpOldQueueEntries()
2457
    {
2458
        $processedAgeInSeconds = $this->extensionSettings['cleanUpProcessedAge'] * 86400; // 24*60*60 Seconds in 24 hours
2459
        $scheduledAgeInSeconds = $this->extensionSettings['cleanUpScheduledAge'] * 86400;
2460
2461
        $now = time();
2462
        $condition = '(exec_time<>0 AND exec_time<' . ($now - $processedAgeInSeconds) . ') OR scheduled<=' . ($now - $scheduledAgeInSeconds);
2463
        $this->flushQueue($condition);
2464
    }
2465
2466
    /**
2467
     * Initializes a TypoScript Frontend necessary for using TypoScript and TypoLink functions
2468
     *
2469
     * @param int $pageId
2470
     * @return void
2471
     * @throws \TYPO3\CMS\Core\Error\Http\ServiceUnavailableException
2472
     * @throws \TYPO3\CMS\Core\Http\ImmediateResponseException
2473
     */
2474
    protected function initTSFE(int $pageId): void
2475
    {
2476
        $GLOBALS['TSFE'] = GeneralUtility::makeInstance(
2477
            TypoScriptFrontendController::class,
2478
            null,
2479
            $pageId,
2480
            0
2481
        );
2482
        $GLOBALS['TSFE']->initFEuser();
2483
        $GLOBALS['TSFE']->determineId();
2484
        $GLOBALS['TSFE']->getConfigArray();
2485
        $GLOBALS['TSFE']->settingLanguage();
2486
        $GLOBALS['TSFE']->settingLocale();
2487
        $GLOBALS['TSFE']->newCObj();
2488
    }
2489
2490
    /**
2491
     * Returns a md5 hash generated from a serialized configuration array.
2492
     *
2493
     * @param array $configuration
2494
     *
2495
     * @return string
2496
     */
2497 7
    protected function getConfigurationHash(array $configuration)
2498
    {
2499 7
        unset($configuration['paramExpanded']);
2500 7
        unset($configuration['URLs']);
2501 7
        return md5(serialize($configuration));
2502
    }
2503
2504
    /**
2505
     * Build a URL from a Page and the Query String. If the page has a Site configuration, it can be built by using
2506
     * the Site instance.
2507
     *
2508
     * @param int $pageId
2509
     * @param string $queryString
2510
     * @param string|null $alternativeBaseUrl
2511
     * @return UriInterface
2512
     * @throws \TYPO3\CMS\Core\Exception\SiteNotFoundException
2513
     * @throws \TYPO3\CMS\Core\Routing\InvalidRouteArgumentsException
2514
     */
2515 2
    protected function getUrlFromPageAndQueryParameters(int $pageId, string $queryString, ?string $alternativeBaseUrl): UriInterface
2516
    {
2517 2
        $site = GeneralUtility::makeInstance(SiteMatcher::class)->matchByPageId((int)$pageId);
2518 2
        if ($site instanceof Site) {
0 ignored issues
show
Bug introduced by
The class TYPO3\CMS\Core\Site\Entity\Site does not exist. Did you forget a USE statement, or did you not list all dependencies?

This error could be the result of:

1. Missing dependencies

PHP Analyzer uses your composer.json file (if available) to determine the dependencies of your project and to determine all the available classes and functions. It expects the composer.json to be in the root folder of your repository.

Are you sure this class is defined by one of your dependencies, or did you maybe not list a dependency in either the require or require-dev section?

2. Missing use statement

PHP does not complain about undefined classes in ìnstanceof checks. For example, the following PHP code will work perfectly fine:

if ($x instanceof DoesNotExist) {
    // Do something.
}

If you have not tested against this specific condition, such errors might go unnoticed.

Loading history...
2519
            $queryString = ltrim($queryString, '?&');
2520
            $queryParts = [];
2521
            parse_str($queryString, $queryParts);
2522
            unset($queryParts['id']);
2523
            // workaround as long as we don't have native language support in crawler configurations
2524
            if (isset($queryParts['L'])) {
2525
                $queryParts['_language'] = $queryParts['L'];
2526
                unset($queryParts['L']);
2527
                $siteLanguage = $site->getLanguageById((int)$queryParts['_language']);
0 ignored issues
show
Unused Code introduced by
$siteLanguage is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
2528
            } else {
2529
                $siteLanguage = $site->getDefaultLanguage();
0 ignored issues
show
Unused Code introduced by
$siteLanguage is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
2530
            }
2531
            $url = $site->getRouter()->generateUri($pageId, $queryParts);
2532
            if (!empty($alternativeBaseUrl)) {
2533
                $alternativeBaseUrl = new Uri($alternativeBaseUrl);
2534
                $url = $url->withHost($alternativeBaseUrl->getHost());
2535
                $url = $url->withScheme($alternativeBaseUrl->getScheme());
2536
                $url = $url->withPort($alternativeBaseUrl->getPort());
2537
            }
2538
        } else {
2539
            // Technically this is not possible with site handling, but kept for backwards-compatibility reasons
2540
            // Once EXT:crawler is v10-only compatible, this should be removed completely
2541 2
            $baseUrl = ($alternativeBaseUrl ?: GeneralUtility::getIndpEnv('TYPO3_SITE_URL'));
2542 2
            $cacheHashCalculator = GeneralUtility::makeInstance(CacheHashCalculator::class);
2543 2
            $queryString .= '&cHash=' . $cacheHashCalculator->generateForParameters($queryString);
2544 2
            $url = rtrim($baseUrl, '/') . '/index.php' . $queryString;
2545 2
            $url = new Uri($url);
2546
        }
2547 2
        return $url;
2548
    }
2549
}
2550