Completed
Push — typo3v9 ( cd95b9...9e3523 )
by Tomas Norre
05:42
created

CrawlerController   F

Complexity

Total Complexity 218

Size/Duplication

Total Lines 2037
Duplicated Lines 0 %

Coupling/Cohesion

Components 1
Dependencies 6

Test Coverage

Coverage 49.47%

Importance

Changes 0
Metric Value
dl 0
loc 2037
ccs 421
cts 851
cp 0.4947
rs 0.8
c 0
b 0
f 0
wmc 218
lcom 1
cbo 6

42 Methods

Rating   Name   Duplication   Size   Complexity  
A setProcessFilename() 0 4 1
A getProcessFilename() 0 4 1
A setExtensionSettings() 0 4 1
A getAccessMode() 0 4 1
A setAccessMode() 0 4 1
A setDisabled() 0 10 3
A getDisabled() 0 4 1
A __construct() 0 24 3
B urlListFromUrlArray() 0 67 8
A CLI_debug() 0 7 2
A getBackendUser() 0 6 2
F checkIfPageShouldBeSkipped() 0 53 14
A getUrlsForPageRow() 0 14 2
A noUnprocessedQueueEntriesForPageWithConfigurationHashExist() 0 22 2
A drawURLs_PIfilter() 0 13 4
A getPageTSconfigForId() 0 21 4
D getUrlsForPageId() 0 92 16
B getConfigurationsForBranch() 0 39 8
A getQueryBuilder() 0 4 1
A hasGroupAccess() 0 12 4
F expandParameters() 0 133 28
B compileUrls() 0 23 6
B getLogEntriesForPageId() 0 44 6
B getLogEntriesForSetId() 0 44 6
A flushQueue() 0 34 5
A addQueueEntry_callBack() 0 20 3
B addUrl() 0 68 6
B getDuplicateRowsIfExist() 0 51 5
A getCurrentTime() 0 4 1
C readUrl() 0 83 10
A readUrlFromArray() 0 30 1
B getPageTreeAndUrls() 0 91 7
B expandExcludeString() 0 45 9
C drawURLs_addRowsForPage() 0 106 12
C CLI_run() 0 118 10
A CLI_runHooks() 0 9 3
B CLI_checkAndAcquireNewProcess() 0 56 5
B CLI_releaseProcesses() 0 87 5
A CLI_buildProcessId() 0 7 2
A cleanUpOldQueueEntries() 0 9 1
A getConfigurationHash() 0 6 1
B getUrlFromPageAndQueryParameters() 0 41 7

How to fix   Complexity   

Complex Class

Complex classes like CrawlerController often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use CrawlerController, and based on these observations, apply Extract Interface, too.

1
<?php
2
namespace AOE\Crawler\Controller;
3
4
/***************************************************************
5
 *  Copyright notice
6
 *
7
 *  (c) 2019 AOE GmbH <[email protected]>
8
 *
9
 *  All rights reserved
10
 *
11
 *  This script is part of the TYPO3 project. The TYPO3 project is
12
 *  free software; you can redistribute it and/or modify
13
 *  it under the terms of the GNU General Public License as published by
14
 *  the Free Software Foundation; either version 3 of the License, or
15
 *  (at your option) any later version.
16
 *
17
 *  The GNU General Public License can be found at
18
 *  http://www.gnu.org/copyleft/gpl.html.
19
 *
20
 *  This script is distributed in the hope that it will be useful,
21
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
22
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
23
 *  GNU General Public License for more details.
24
 *
25
 *  This copyright notice MUST APPEAR in all copies of the script!
26
 ***************************************************************/
27
28
use AOE\Crawler\Configuration\ExtensionConfigurationProvider;
29
use AOE\Crawler\Domain\Repository\ConfigurationRepository;
30
use AOE\Crawler\Domain\Repository\ProcessRepository;
31
use AOE\Crawler\Domain\Repository\QueueRepository;
32
use AOE\Crawler\Event\EventDispatcher;
33
use AOE\Crawler\QueueExecutor;
34
use AOE\Crawler\Utility\SignalSlotUtility;
35
use Psr\Http\Message\UriInterface;
36
use Psr\Log\LoggerAwareInterface;
37
use Psr\Log\LoggerAwareTrait;
38
use TYPO3\CMS\Backend\Tree\View\PageTreeView;
39
use TYPO3\CMS\Backend\Utility\BackendUtility;
40
use TYPO3\CMS\Core\Authentication\BackendUserAuthentication;
41
use TYPO3\CMS\Core\Core\Environment;
42
use TYPO3\CMS\Core\Database\Connection;
43
use TYPO3\CMS\Core\Database\ConnectionPool;
44
use TYPO3\CMS\Core\Database\Query\Restriction\DeletedRestriction;
45
use TYPO3\CMS\Core\Http\Uri;
46
use TYPO3\CMS\Core\Imaging\Icon;
47
use TYPO3\CMS\Core\Imaging\IconFactory;
48
use TYPO3\CMS\Core\Routing\SiteMatcher;
49
use TYPO3\CMS\Core\Site\Entity\Site;
50
use TYPO3\CMS\Core\Type\Bitmask\Permission;
51
use TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser;
52
use TYPO3\CMS\Core\Utility\DebugUtility;
53
use TYPO3\CMS\Core\Utility\GeneralUtility;
54
use TYPO3\CMS\Core\Utility\MathUtility;
55
use TYPO3\CMS\Extbase\Object\ObjectManager;
56
use TYPO3\CMS\Frontend\Page\CacheHashCalculator;
57
use TYPO3\CMS\Frontend\Page\PageRepository;
58
59
/**
60
 * Class CrawlerController
61
 *
62
 * @package AOE\Crawler\Controller
63
 */
64
class CrawlerController implements LoggerAwareInterface
65
{
66
    use LoggerAwareTrait;
67
68
    const CLI_STATUS_NOTHING_PROCCESSED = 0;
69
    const CLI_STATUS_REMAIN = 1; //queue not empty
70
    const CLI_STATUS_PROCESSED = 2; //(some) queue items where processed
71
    const CLI_STATUS_ABORTED = 4; //instance didn't finish
72
    const CLI_STATUS_POLLABLE_PROCESSED = 8;
73
74
    /**
75
     * @var integer
76
     */
77
    public $setID = 0;
78
79
    /**
80
     * @var string
81
     */
82
    public $processID = '';
83
84
    /**
85
     * @var array
86
     */
87
    public $duplicateTrack = [];
88
89
    /**
90
     * @var array
91
     */
92
    public $downloadUrls = [];
93
94
    /**
95
     * @var array
96
     */
97
    public $incomingProcInstructions = [];
98
99
    /**
100
     * @var array
101
     */
102
    public $incomingConfigurationSelection = [];
103
104
    /**
105
     * @var bool
106
     */
107
    public $registerQueueEntriesInternallyOnly = false;
108
109
    /**
110
     * @var array
111
     */
112
    public $queueEntries = [];
113
114
    /**
115
     * @var array
116
     */
117
    public $urlList = [];
118
119
    /**
120
     * @var array
121
     */
122
    public $extensionSettings = [];
123
124
    /**
125
     * Mount Point
126
     *
127
     * @var boolean
128
     */
129
    public $MP = false;
130
131
    /**
132
     * @var string
133
     */
134
    protected $processFilename;
135
136
    /**
137
     * Holds the internal access mode can be 'gui','cli' or 'cli_im'
138
     *
139
     * @var string
140
     */
141
    protected $accessMode;
142
143
    /**
144
     * @var BackendUserAuthentication|null
145
     */
146
    private $backendUser = null;
147
148
    /**
149
     * @var integer
150
     */
151
    private $scheduledTime = 0;
152
153
    /**
154
     * @var integer
155
     */
156
    private $reqMinute = 0;
157
158
    /**
159
     * @var bool
160
     */
161
    private $submitCrawlUrls = false;
162
163
    /**
164
     * @var bool
165
     */
166
    private $downloadCrawlUrls = false;
167
168
    /**
169
     * @var QueueRepository
170
     */
171
    protected $queueRepository;
172
173
    /**
174
     * @var ProcessRepository
175
     */
176
    protected $processRepository;
177
178
    /**
179
     * @var ConfigurationRepository
180
     */
181
    protected $configurationRepository;
182
183
    /**
184
     * @var string
185
     */
186
    protected $tableName = 'tx_crawler_queue';
187
188
    /**
189
     * @var QueueExecutor
190
     */
191
    protected $queueExecutor;
192
193
    /**
194
     * @var int
195
     */
196
    protected $maximumUrlsToCompile = 10000;
197
198
    /**
199
     * @var IconFactory
200
     */
201
    protected $iconFactory;
202
203
    /**
204
     * Method to set the accessMode can be gui, cli or cli_im
205
     *
206
     * @return string
207
     */
208 1
    public function getAccessMode()
209
    {
210 1
        return $this->accessMode;
211
    }
212
213
    /**
214
     * @param string $accessMode
215
     */
216 1
    public function setAccessMode($accessMode)
217
    {
218 1
        $this->accessMode = $accessMode;
219 1
    }
220
221
    /**
222
     * Set disabled status to prevent processes from being processed
223
     *
224
     * @param  bool $disabled (optional, defaults to true)
225
     * @return void
226
     */
227 3
    public function setDisabled($disabled = true)
228
    {
229 3
        if ($disabled) {
230 2
            GeneralUtility::writeFile($this->processFilename, '');
231
        } else {
232 1
            if (is_file($this->processFilename)) {
233 1
                unlink($this->processFilename);
234
            }
235
        }
236 3
    }
237
238
    /**
239
     * Get disable status
240
     *
241
     * @return bool true if disabled
242
     */
243 3
    public function getDisabled()
244
    {
245 3
        return is_file($this->processFilename);
246
    }
247
248
    /**
249
     * @param string $filenameWithPath
250
     *
251
     * @return void
252
     */
253 4
    public function setProcessFilename($filenameWithPath)
254
    {
255 4
        $this->processFilename = $filenameWithPath;
256 4
    }
257
258
    /**
259
     * @return string
260
     */
261 1
    public function getProcessFilename()
262
    {
263 1
        return $this->processFilename;
264
    }
265
266
    /************************************
267
     *
268
     * Getting URLs based on Page TSconfig
269
     *
270
     ************************************/
271
272 44
    public function __construct()
273
    {
274 44
        $objectManager = GeneralUtility::makeInstance(ObjectManager::class);
275 44
        $this->queueRepository = $objectManager->get(QueueRepository::class);
276 44
        $this->processRepository = $objectManager->get(ProcessRepository::class);
277 44
        $this->configurationRepository = $objectManager->get(ConfigurationRepository::class);
278 44
        $this->queueExecutor = $objectManager->get(QueueExecutor::class);
279 44
        $this->iconFactory = GeneralUtility::makeInstance(IconFactory::class);
280
281 44
        $this->processFilename = Environment::getVarPath() . '/locks/tx_crawler.proc';
282
283
        /** @var ExtensionConfigurationProvider $configurationProvider */
284 44
        $configurationProvider = GeneralUtility::makeInstance(ExtensionConfigurationProvider::class);
285 44
        $settings = $configurationProvider->getExtensionConfiguration();
286 44
        $this->extensionSettings = is_array($settings) ? $settings : [];
287
288
        // set defaults:
289 44
        if (MathUtility::convertToPositiveInteger($this->extensionSettings['countInARun']) == 0) {
290
            $this->extensionSettings['countInARun'] = 100;
291
        }
292
293 44
        $this->extensionSettings['processLimit'] = MathUtility::forceIntegerInRange($this->extensionSettings['processLimit'], 1, 99, 1);
294 44
        $this->maximumUrlsToCompile = MathUtility::forceIntegerInRange($this->extensionSettings['maxCompileUrls'], 1, 1000000000, 10000);
295 44
    }
296
    
297
    /**
298
     * @return BackendUserAuthentication
299
     */
300 1
    private function getBackendUser() {
301 1
        if($this->backendUser === null) {
302 1
            $this->backendUser = $GLOBALS['BE_USER'];
303
        }
304 1
        return $this->backendUser;
305
    }
306
307
    /**
308
     * Sets the extensions settings (unserialized pendant of $TYPO3_CONF_VARS['EXT']['extConf']['crawler']).
309
     *
310
     * @param array $extensionSettings
311
     * @return void
312
     */
313 12
    public function setExtensionSettings(array $extensionSettings)
314
    {
315 12
        $this->extensionSettings = $extensionSettings;
316 12
    }
317
318
    /**
319
     * Check if the given page should be crawled
320
     *
321
     * @param array $pageRow
322
     * @return false|string false if the page should be crawled (not excluded), true / skipMessage if it should be skipped
323
     */
324 8
    public function checkIfPageShouldBeSkipped(array $pageRow)
325
    {
326 8
        $skipPage = false;
327 8
        $skipMessage = 'Skipped'; // message will be overwritten later
328
329
        // if page is hidden
330 8
        if (!$this->extensionSettings['crawlHiddenPages']) {
331 8
            if ($pageRow['hidden']) {
332 1
                $skipPage = true;
333 1
                $skipMessage = 'Because page is hidden';
334
            }
335
        }
336
337 8
        if (!$skipPage) {
338 7
            if (GeneralUtility::inList('3,4', $pageRow['doktype']) || $pageRow['doktype'] >= 199) {
339 3
                $skipPage = true;
340 3
                $skipMessage = 'Because doktype is not allowed';
341
            }
342
        }
343
344 8
        if (!$skipPage) {
345 4
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'] ?? [] as $key => $doktypeList) {
346 1
                if (GeneralUtility::inList($doktypeList, $pageRow['doktype'])) {
347 1
                    $skipPage = true;
348 1
                    $skipMessage = 'Doktype was excluded by "' . $key . '"';
349 1
                    break;
350
                }
351
            }
352
        }
353
354 8
        if (!$skipPage) {
355
            // veto hook
356 3
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'] ?? [] as $key => $func) {
357
                $params = [
358
                    'pageRow' => $pageRow
359
                ];
360
                // expects "false" if page is ok and "true" or a skipMessage if this page should _not_ be crawled
361
                $veto = GeneralUtility::callUserFunction($func, $params, $this);
362
                if ($veto !== false) {
363
                    $skipPage = true;
364
                    if (is_string($veto)) {
365
                        $skipMessage = $veto;
366
                    } else {
367
                        $skipMessage = 'Veto from hook "' . htmlspecialchars($key) . '"';
368
                    }
369
                    // no need to execute other hooks if a previous one return a veto
370
                    break;
371
                }
372
            }
373
        }
374
375 8
        return $skipPage ? $skipMessage : false;
376
    }
377
378
    /**
379
     * Wrapper method for getUrlsForPageId()
380
     * It returns an array of configurations and no urls!
381
     *
382
     * @param array $pageRow Page record with at least dok-type and uid columns.
383
     * @param string $skipMessage
384
     * @return array
385
     * @see getUrlsForPageId()
386
     */
387 4
    public function getUrlsForPageRow(array $pageRow, &$skipMessage = '')
388
    {
389 4
        $message = $this->checkIfPageShouldBeSkipped($pageRow);
390
391 4
        if ($message === false) {
392 3
            $res = $this->getUrlsForPageId($pageRow['uid']);
393 3
            $skipMessage = '';
394
        } else {
395 1
            $skipMessage = $message;
396 1
            $res = [];
397
        }
398
399 4
        return $res;
400
    }
401
402
    /**
403
     * This method is used to count if there are ANY unprocessed queue entries
404
     * of a given page_id and the configuration which matches a given hash.
405
     * If there if none, we can skip an inner detail check
406
     *
407
     * @param  int $uid
408
     * @param  string $configurationHash
409
     * @return boolean
410
     */
411 5
    protected function noUnprocessedQueueEntriesForPageWithConfigurationHashExist($uid, $configurationHash)
412
    {
413 5
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
414 5
        $noUnprocessedQueueEntriesFound = true;
415
416
        $result = $queryBuilder
417 5
            ->count('*')
418 5
            ->from($this->tableName)
419 5
            ->where(
420 5
                $queryBuilder->expr()->eq('page_id', (int)$uid),
421 5
                $queryBuilder->expr()->eq('configuration_hash', $queryBuilder->createNamedParameter($configurationHash)),
422 5
                $queryBuilder->expr()->eq('exec_time', 0)
423
            )
424 5
            ->execute()
425 5
            ->fetchColumn();
426
427 5
        if ($result) {
428 3
            $noUnprocessedQueueEntriesFound = false;
429
        }
430
431 5
        return $noUnprocessedQueueEntriesFound;
432
    }
433
434
    /**
435
     * Creates a list of URLs from input array (and submits them to queue if asked for)
436
     * See Web > Info module script + "indexed_search"'s crawler hook-client using this!
437
     *
438
     * @param    array        Information about URLs from pageRow to crawl.
439
     * @param    array        Page row
440
     * @param    integer        Unix time to schedule indexing to, typically time()
441
     * @param    integer        Number of requests per minute (creates the interleave between requests)
442
     * @param    boolean        If set, submits the URLs to queue
443
     * @param    boolean        If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
444
     * @param    array        Array which is passed by reference and contains the an id per url to secure we will not crawl duplicates
445
     * @param    array        Array which will be filled with URLS for download if flag is set.
446
     * @param    array        Array of processing instructions
447
     * @return    string        List of URLs (meant for display in backend module)
448
     *
449
     */
450 2
    public function urlListFromUrlArray(
451
        array $vv,
452
        array $pageRow,
453
        $scheduledTime,
454
        $reqMinute,
455
        $submitCrawlUrls,
456
        $downloadCrawlUrls,
457
        array &$duplicateTrack,
458
        array &$downloadUrls,
459
        array $incomingProcInstructions
460
    ) {
461 2
        if (!is_array($vv['URLs'])) {
462
            return 'ERROR - no URL generated';
463
        }
464 2
        $urlLog = [];
465 2
        $pageId = (int)$pageRow['uid'];
466 2
        $configurationHash = $this->getConfigurationHash($vv);
467 2
        $skipInnerCheck = $this->noUnprocessedQueueEntriesForPageWithConfigurationHashExist($pageId, $configurationHash);
468
469 2
        foreach ($vv['URLs'] as $urlQuery) {
470 2
            if (!$this->drawURLs_PIfilter($vv['subCfg']['procInstrFilter'], $incomingProcInstructions)) {
471
                continue;
472
            }
473 2
            $url = (string)$this->getUrlFromPageAndQueryParameters(
474 2
                $pageId,
475 2
                $urlQuery,
476 2
                $vv['subCfg']['baseUrl'] ?? null,
477 2
                $vv['subCfg']['force_ssl'] ?? 0
478
            );
479
480
            // Create key by which to determine unique-ness:
481 2
            $uKey = $url . '|' . $vv['subCfg']['userGroups'] . '|' . $vv['subCfg']['procInstrFilter'];
482
483 2
            if (isset($duplicateTrack[$uKey])) {
484
                //if the url key is registered just display it and do not resubmit is
485
                $urlLog[] = '<em><span class="text-muted">' . htmlspecialchars($url) . '</span></em>';
486
            } else {
487
                // Scheduled time:
488 2
                $schTime = $scheduledTime + round(count($duplicateTrack) * (60 / $reqMinute));
489 2
                $schTime = floor($schTime / 60) * 60;
490 2
                $formattedDate = BackendUtility::datetime($schTime);
491 2
                $this->urlList[] = '[' . $formattedDate . '] ' . $url;
492 2
                $urlList = '[' . $formattedDate . '] ' . htmlspecialchars($url);
493
494
                // Submit for crawling!
495 2
                if ($submitCrawlUrls) {
496 2
                    $added = $this->addUrl(
497 2
                        $pageId,
498 2
                        $url,
499 2
                        $vv['subCfg'],
500 2
                        $scheduledTime,
501 2
                        $configurationHash,
502 2
                        $skipInnerCheck
503
                    );
504 2
                    if ($added === false) {
505 2
                        $urlList .= ' (URL already existed)';
506
                    }
507
                } elseif ($downloadCrawlUrls) {
508
                    $downloadUrls[$url] = $url;
509
                }
510 2
                $urlLog[] = $urlList;
511
            }
512 2
            $duplicateTrack[$uKey] = true;
513
        }
514
515 2
        return implode('<br>', $urlLog);
516
    }
517
518
    /**
519
     * Returns true if input processing instruction is among registered ones.
520
     *
521
     * @param string $piString PI to test
522
     * @param array $incomingProcInstructions Processing instructions
523
     * @return boolean
524
     */
525 5
    public function drawURLs_PIfilter($piString, array $incomingProcInstructions)
526
    {
527 5
        if (empty($incomingProcInstructions)) {
528 1
            return true;
529
        }
530
531 4
        foreach ($incomingProcInstructions as $pi) {
532 4
            if (GeneralUtility::inList($piString, $pi)) {
533 2
                return true;
534
            }
535
        }
536 2
        return false;
537
    }
538
539 3
    public function getPageTSconfigForId($id)
540
    {
541 3
        if (!$this->MP) {
542 3
            $pageTSconfig = BackendUtility::getPagesTSconfig($id);
543
        } else {
544
            [, $mountPointId] = explode('-', $this->MP);
0 ignored issues
show
Bug introduced by
The variable $mountPointId does not exist. Did you forget to declare it?

This check marks access to variables or properties that have not been declared yet. While PHP has no explicit notion of declaring a variable, accessing it before a value is assigned to it is most likely a bug.

Loading history...
545
            $pageTSconfig = BackendUtility::getPagesTSconfig($mountPointId);
546
        }
547
548
        // Call a hook to alter configuration
549 3
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'])) {
550
            $params = [
551
                'pageId' => $id,
552
                'pageTSConfig' => &$pageTSconfig
553
            ];
554
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'] as $userFunc) {
555
                GeneralUtility::callUserFunction($userFunc, $params, $this);
556
            }
557
        }
558 3
        return $pageTSconfig;
559
    }
560
561
    /**
562
     * This methods returns an array of configurations.
563
     * And no urls!
564
     *
565
     * @param integer $id Page ID
0 ignored issues
show
Bug introduced by
There is no parameter named $id. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
566
     * @return array
567
     */
568 2
    public function getUrlsForPageId($pageId)
569
    {
570
        // Get page TSconfig for page ID
571 2
        $pageTSconfig = $this->getPageTSconfigForId($pageId);
572
573 2
        $res = [];
574
575
        // Fetch Crawler Configuration from pageTSconfig
576 2
        $crawlerCfg = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'] ?? [];
577 2
        foreach ($crawlerCfg as $key => $values) {
578 1
            if (!is_array($values)) {
579 1
                continue;
580
            }
581 1
            $key = str_replace('.', '', $key);
582
            // Sub configuration for a single configuration string:
583 1
            $subCfg = (array)$crawlerCfg[$key . '.'];
584 1
            $subCfg['key'] = $key;
585
586 1
            if (strcmp($subCfg['procInstrFilter'], '')) {
587 1
                $subCfg['procInstrFilter'] = implode(',', GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']));
588
            }
589 1
            $pidOnlyList = implode(',', GeneralUtility::trimExplode(',', $subCfg['pidsOnly'], true));
590
591
            // process configuration if it is not page-specific or if the specific page is the current page:
592 1
            if (!strcmp($subCfg['pidsOnly'], '') || GeneralUtility::inList($pidOnlyList, $pageId)) {
593
594
                // Explode, process etc.:
595 1
                $res[$key] = [];
596 1
                $res[$key]['subCfg'] = $subCfg;
597 1
                $res[$key]['paramParsed'] = GeneralUtility::explodeUrl2Array($crawlerCfg[$key]);
598 1
                $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $pageId);
599 1
                $res[$key]['origin'] = 'pagets';
600
601
                // recognize MP value
602 1
                if (!$this->MP) {
603 1
                    $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $pageId]);
604
                } else {
605
                    $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $pageId . '&MP=' . $this->MP]);
606
                }
607
            }
608
        }
609
610
        // Get configuration from tx_crawler_configuration records up the rootline
611 2
        $crawlerConfigurations = $this->configurationRepository->getCrawlerConfigurationRecordsFromRootLine($pageId);
612 2
        foreach ($crawlerConfigurations as $configurationRecord) {
613
614
                // check access to the configuration record
615 1
            if (empty($configurationRecord['begroups']) || $this->getBackendUser()->isAdmin() || $this->hasGroupAccess($this->getBackendUser()->user['usergroup_cached_list'], $configurationRecord['begroups'])) {
616 1
                $pidOnlyList = implode(',', GeneralUtility::trimExplode(',', $configurationRecord['pidsonly'], true));
617
618
                // process configuration if it is not page-specific or if the specific page is the current page:
619 1
                if (!strcmp($configurationRecord['pidsonly'], '') || GeneralUtility::inList($pidOnlyList, $pageId)) {
620 1
                    $key = $configurationRecord['name'];
621
622
                    // don't overwrite previously defined paramSets
623 1
                    if (!isset($res[$key])) {
624
625
                            /* @var $TSparserObject \TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser */
626 1
                        $TSparserObject = GeneralUtility::makeInstance(TypoScriptParser::class);
627 1
                        $TSparserObject->parse($configurationRecord['processing_instruction_parameters_ts']);
628
629
                        $subCfg = [
630 1
                            'procInstrFilter' => $configurationRecord['processing_instruction_filter'],
631 1
                            'procInstrParams.' => $TSparserObject->setup,
632 1
                            'baseUrl' => $configurationRecord['base_url'],
633 1
                            'force_ssl' => (int)$configurationRecord['force_ssl'],
634 1
                            'userGroups' => $configurationRecord['fegroups'],
635 1
                            'exclude' => $configurationRecord['exclude'],
636 1
                            'key' => $key
637
                        ];
638
639 1
                        if (!in_array($pageId, $this->expandExcludeString($subCfg['exclude']))) {
640 1
                            $res[$key] = [];
641 1
                            $res[$key]['subCfg'] = $subCfg;
642 1
                            $res[$key]['paramParsed'] = GeneralUtility::explodeUrl2Array($configurationRecord['configuration']);
643 1
                            $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $pageId);
644 1
                            $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $pageId]);
645 1
                            $res[$key]['origin'] = 'tx_crawler_configuration_' . $configurationRecord['uid'];
646
                        }
647
                    }
648
                }
649
            }
650
        }
651
652 2
        foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'] ?? [] as $func) {
653
            $params = [
654
                'res' => &$res,
655
            ];
656
            GeneralUtility::callUserFunction($func, $params, $this);
657
        }
658 2
        return $res;
659
    }
660
661
    /**
662
     * Find all configurations of subpages of a page
663
     *
664
     * @param int $rootid
665
     * @param $depth
666
     * @return array
667
     *
668
     * TODO: Write Functional Tests
669
     */
670 1
    public function getConfigurationsForBranch(int $rootid, $depth)
671
    {
672 1
        $configurationsForBranch = [];
673 1
        $pageTSconfig = $this->getPageTSconfigForId($rootid);
674 1
        $sets = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'] ?? [];
675 1
        foreach ($sets as $key => $value) {
676
            if (!is_array($value)) {
677
                continue;
678
            }
679
            $configurationsForBranch[] = substr($key, -1) == '.' ? substr($key, 0, -1) : $key;
680
        }
681 1
        $pids = [];
682 1
        $rootLine = BackendUtility::BEgetRootLine($rootid);
683 1
        foreach ($rootLine as $node) {
684 1
            $pids[] = $node['uid'];
685
        }
686
        /* @var PageTreeView $tree */
687 1
        $tree = GeneralUtility::makeInstance(PageTreeView::class);
688 1
        $perms_clause = $this->getBackendUser()->getPagePermsClause(Permission::PAGE_SHOW);
689 1
        $tree->init( empty($perms_clause) ? '' : ('AND ' . $perms_clause) );
690 1
        $tree->getTree($rootid, $depth, '');
691 1
        foreach ($tree->tree as $node) {
692
            $pids[] = $node['row']['uid'];
693
        }
694
695 1
        $queryBuilder = $this->getQueryBuilder('tx_crawler_configuration');
696
        $statement = $queryBuilder
697 1
            ->select('name')
698 1
            ->from('tx_crawler_configuration')
699 1
            ->where(
700 1
                $queryBuilder->expr()->in('pid', $queryBuilder->createNamedParameter($pids, Connection::PARAM_INT_ARRAY))
701
            )
702 1
            ->execute();
703
704 1
        while ($row = $statement->fetch()) {
705 1
            $configurationsForBranch[] = $row['name'];
706
        }
707 1
        return $configurationsForBranch;
708
    }
709
710
    /**
711
     * Get querybuilder for given table
712
     *
713
     * @param string $table
714
     * @return \TYPO3\CMS\Core\Database\Query\QueryBuilder
715
     */
716 17
    private function getQueryBuilder(string $table)
717
    {
718 17
        return GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($table);
719
    }
720
721
    /**
722
     * Check if a user has access to an item
723
     * (e.g. get the group list of the current logged in user from $GLOBALS['TSFE']->gr_list)
724
     *
725
     * @see \TYPO3\CMS\Frontend\Page\PageRepository::getMultipleGroupsWhereClause()
726
     * @param  string $groupList    Comma-separated list of (fe_)group UIDs from a user
727
     * @param  string $accessList   Comma-separated list of (fe_)group UIDs of the item to access
728
     * @return bool                 TRUE if at least one of the users group UIDs is in the access list or the access list is empty
729
     */
730 3
    public function hasGroupAccess($groupList, $accessList)
731
    {
732 3
        if (empty($accessList)) {
733 1
            return true;
734
        }
735 2
        foreach (GeneralUtility::intExplode(',', $groupList) as $groupUid) {
736 2
            if (GeneralUtility::inList($accessList, $groupUid)) {
737 1
                return true;
738
            }
739
        }
740 1
        return false;
741
    }
742
743
    /**
744
     * Will expand the parameters configuration to individual values. This follows a certain syntax of the value of each parameter.
745
     * Syntax of values:
746
     * - Basically: If the value is wrapped in [...] it will be expanded according to the following syntax, otherwise the value is taken literally
747
     * - Configuration is splitted by "|" and the parts are processed individually and finally added together
748
     * - For each configuration part:
749
     *         - "[int]-[int]" = Integer range, will be expanded to all values in between, values included, starting from low to high (max. 1000). Example "1-34" or "-40--30"
750
     *         - "_TABLE:[TCA table name];[_PID:[optional page id, default is current page]];[_ENABLELANG:1]" = Look up of table records from PID, filtering out deleted records. Example "_TABLE:tt_content; _PID:123"
751
     *        _ENABLELANG:1 picks only original records without their language overlays
752
     *         - Default: Literal value
753
     *
754
     * @param array $paramArray Array with key (GET var name) and values (value of GET var which is configuration for expansion)
755
     * @param integer $pid Current page ID
756
     * @return array
757
     *
758
     * TODO: Write Functional Tests
759
     */
760 9
    public function expandParameters($paramArray, $pid)
761
    {
762
        // Traverse parameter names:
763 9
        foreach ($paramArray as $p => $v) {
764 9
            $v = trim($v);
765
766
            // If value is encapsulated in square brackets it means there are some ranges of values to find, otherwise the value is literal
767 9
            if (substr($v, 0, 1) === '[' && substr($v, -1) === ']') {
768
                // So, find the value inside brackets and reset the paramArray value as an array.
769 9
                $v = substr($v, 1, -1);
770 9
                $paramArray[$p] = [];
771
772
                // Explode parts and traverse them:
773 9
                $parts = explode('|', $v);
774 9
                foreach ($parts as $pV) {
775
776
                        // Look for integer range: (fx. 1-34 or -40--30 // reads minus 40 to minus 30)
777 9
                    if (preg_match('/^(-?[0-9]+)\s*-\s*(-?[0-9]+)$/', trim($pV), $reg)) {
778
779
                        // Swap if first is larger than last:
780 1
                        if ($reg[1] > $reg[2]) {
781
                            $temp = $reg[2];
782
                            $reg[2] = $reg[1];
783
                            $reg[1] = $temp;
784
                        }
785
786
                        // Traverse range, add values:
787 1
                        $runAwayBrake = 1000; // Limit to size of range!
788 1
                        for ($a = $reg[1]; $a <= $reg[2];$a++) {
789 1
                            $paramArray[$p][] = $a;
790 1
                            $runAwayBrake--;
791 1
                            if ($runAwayBrake <= 0) {
792
                                break;
793
                            }
794
                        }
795 8
                    } elseif (substr(trim($pV), 0, 7) == '_TABLE:') {
796
797
                        // Parse parameters:
798 6
                        $subparts = GeneralUtility::trimExplode(';', $pV);
799 6
                        $subpartParams = [];
800 6
                        foreach ($subparts as $spV) {
801 6
                            list($pKey, $pVal) = GeneralUtility::trimExplode(':', $spV);
802 6
                            $subpartParams[$pKey] = $pVal;
803
                        }
804
805
                        // Table exists:
806 6
                        if (isset($GLOBALS['TCA'][$subpartParams['_TABLE']])) {
807 6
                            $lookUpPid = isset($subpartParams['_PID']) ? intval($subpartParams['_PID']) : intval($pid);
808 6
                            $recursiveDepth = isset($subpartParams['_RECURSIVE']) ? intval($subpartParams['_RECURSIVE']) : 0;
809 6
                            $pidField = isset($subpartParams['_PIDFIELD']) ? trim($subpartParams['_PIDFIELD']) : 'pid';
810 6
                            $where = isset($subpartParams['_WHERE']) ? $subpartParams['_WHERE'] : '';
811 6
                            $addTable = isset($subpartParams['_ADDTABLE']) ? $subpartParams['_ADDTABLE'] : '';
812
813 6
                            $fieldName = $subpartParams['_FIELD'] ? $subpartParams['_FIELD'] : 'uid';
814 6
                            if ($fieldName === 'uid' || $GLOBALS['TCA'][$subpartParams['_TABLE']]['columns'][$fieldName]) {
815 6
                                $queryBuilder = $this->getQueryBuilder($subpartParams['_TABLE']);
816
817 6
                                if($recursiveDepth > 0) {
818
                                    /** @var \TYPO3\CMS\Core\Database\QueryGenerator $queryGenerator */
819 2
                                    $queryGenerator = GeneralUtility::makeInstance(\TYPO3\CMS\Core\Database\QueryGenerator::class);
820 2
                                    $pidList = $queryGenerator->getTreeList($lookUpPid, $recursiveDepth, 0, 1);
821 2
                                    $pidArray = GeneralUtility::intExplode(',', $pidList);
822
                                } else {
823 4
                                    $pidArray = [(string)$lookUpPid];
824
                                }
825
                                
826 6
                                $queryBuilder->getRestrictions()
827 6
                                    ->removeAll()
828 6
                                    ->add(GeneralUtility::makeInstance(DeletedRestriction::class));
829
830
                                $queryBuilder
831 6
                                    ->select($fieldName)
832 6
                                    ->from($subpartParams['_TABLE'])
833 6
                                    ->where(
834 6
                                        $queryBuilder->expr()->in($pidField, $queryBuilder->createNamedParameter($pidArray, Connection::PARAM_INT_ARRAY)),
835 6
                                        $where
836
                                    );
837 6
                                if(!empty($addTable)) {
838
                                    // TODO: Check if this works as intended!
839
                                    $queryBuilder->add('from', $addTable);
840
                                }
841 6
                                $transOrigPointerField = $GLOBALS['TCA'][$subpartParams['_TABLE']]['ctrl']['transOrigPointerField'];
842
843 6
                                if ($subpartParams['_ENABLELANG'] && $transOrigPointerField) {
844
                                    $queryBuilder->andWhere(
845
                                        $queryBuilder->expr()->lte(
846
                                            $transOrigPointerField,
847
                                            0
848
                                        )
849
                                    );
850
                                }
851
852 6
                                $statement = $queryBuilder->execute();
853
854 6
                                $rows = [];
855 6
                                while ($row = $statement->fetch()) {
856 6
                                    $rows[$row[$fieldName]] = $row;
857
                                }
858
859 6
                                if (is_array($rows)) {
860 6
                                    $paramArray[$p] = array_merge($paramArray[$p], array_keys($rows));
861
                                }
862
                            }
863
                        }
864
                    } else { // Just add value:
865 2
                        $paramArray[$p][] = $pV;
866
                    }
867
                    // Hook for processing own expandParameters place holder
868 9
                    if (is_array($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'])) {
869
                        $_params = [
870
                            'pObj' => &$this,
871
                            'paramArray' => &$paramArray,
872
                            'currentKey' => $p,
873
                            'currentValue' => $pV,
874
                            'pid' => $pid
875
                        ];
876
                        foreach ($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'] as $key => $_funcRef) {
877
                            GeneralUtility::callUserFunction($_funcRef, $_params, $this);
878
                        }
879
                    }
880
                }
881
882
                // Make unique set of values and sort array by key:
883 9
                $paramArray[$p] = array_unique($paramArray[$p]);
884 9
                ksort($paramArray);
885
            } else {
886
                // Set the literal value as only value in array:
887 2
                $paramArray[$p] = [$v];
888
            }
889
        }
890
891 9
        return $paramArray;
892
    }
893
894
    /**
895
     * Compiling URLs from parameter array (output of expandParameters())
896
     * The number of URLs will be the multiplication of the number of parameter values for each key
897
     *
898
     * @param array $paramArray Output of expandParameters(): Array with keys (GET var names) and for each an array of values
899
     * @param array $urls URLs accumulated in this array (for recursion)
900
     * @return array
901
     */
902 5
    public function compileUrls($paramArray, array $urls)
903
    {
904 5
        if (empty($paramArray)) {
905 5
            return $urls;
906
        }
907
        // shift first off stack:
908 4
        reset($paramArray);
909 4
        $varName = key($paramArray);
910 4
        $valueSet = array_shift($paramArray);
911
912
        // Traverse value set:
913 4
        $newUrls = [];
914 4
        foreach ($urls as $url) {
915 3
            foreach ($valueSet as $val) {
916 3
                $newUrls[] = $url . (strcmp($val, '') ? '&' . rawurlencode($varName) . '=' . rawurlencode($val) : '');
917
918 3
                if (count($newUrls) > $this->maximumUrlsToCompile) {
919
                    break;
920
                }
921
            }
922
        }
923 4
        return $this->compileUrls($paramArray, $newUrls);
924
    }
925
926
    /************************************
927
     *
928
     * Crawler log
929
     *
930
     ************************************/
931
932
    /**
933
     * Return array of records from crawler queue for input page ID
934
     *
935
     * @param integer $id Page ID for which to look up log entries.
936
     * @param string$filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
937
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
938
     * @param boolean $doFullFlush
939
     * @param integer $itemsPerPage Limit the amount of entries per page default is 10
940
     * @return array
941
     */
942 4
    public function getLogEntriesForPageId($id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
943
    {
944 4
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
945
        $queryBuilder
946 4
            ->select('*')
947 4
            ->from($this->tableName)
948 4
            ->where(
949 4
                $queryBuilder->expr()->eq('page_id', $queryBuilder->createNamedParameter($id, \PDO::PARAM_INT))
950
            )
951 4
            ->orderBy('scheduled', 'DESC');
952
953 4
        $expressionBuilder = GeneralUtility::makeInstance(ConnectionPool::class)
954 4
            ->getConnectionForTable($this->tableName)
955 4
            ->getExpressionBuilder();
956 4
        $query = $expressionBuilder->andX();
957
        // PHPStorm adds the highlight that the $addWhere is immediately overwritten,
958
        // but the $query = $expressionBuilder->andX() ensures that the $addWhere is written correctly with AND
959
        // between the statements, it's not a mistake in the code.
960 4
        $addWhere = '';
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
961 4
        switch ($filter) {
962 4
            case 'pending':
963
                $queryBuilder->andWhere($queryBuilder->expr()->eq('exec_time', 0));
964
                $addWhere = ' AND ' . $query->add($expressionBuilder->eq('exec_time', 0));
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
965
                break;
966 4
            case 'finished':
967
                $queryBuilder->andWhere($queryBuilder->expr()->gt('exec_time', 0));
968
                $addWhere = ' AND ' . $query->add($expressionBuilder->gt('exec_time', 0));
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
969
                break;
970
        }
971
972
        // FIXME: Write unit test that ensures that the right records are deleted.
973 4
        if ($doFlush) {
974 2
            $addWhere = $query->add($expressionBuilder->eq('page_id', (int)$id));
975 2
            $this->flushQueue($doFullFlush ? '1=1' : $addWhere);
976 2
            return [];
977
        } else {
978 2
            if ($itemsPerPage > 0) {
979
                $queryBuilder
980 2
                    ->setMaxResults((int)$itemsPerPage);
981
            }
982
983 2
            return $queryBuilder->execute()->fetchAll();
984
        }
985
    }
986
987
    /**
988
     * Return array of records from crawler queue for input set ID
989
     *
990
     * @param integer $set_id Set ID for which to look up log entries.
991
     * @param string $filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
992
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
993
     * @param integer $itemsPerPage Limit the amount of entires per page default is 10
994
     * @return array
995
     */
996 6
    public function getLogEntriesForSetId($set_id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
997
    {
998 6
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
999
        $queryBuilder
1000 6
            ->select('*')
1001 6
            ->from($this->tableName)
1002 6
            ->where(
1003 6
                $queryBuilder->expr()->eq('set_id', $queryBuilder->createNamedParameter($set_id, \PDO::PARAM_INT))
1004
            )
1005 6
            ->orderBy('scheduled', 'DESC');
1006
1007 6
        $expressionBuilder = GeneralUtility::makeInstance(ConnectionPool::class)
1008 6
            ->getConnectionForTable($this->tableName)
1009 6
            ->getExpressionBuilder();
1010 6
        $query = $expressionBuilder->andX();
1011
        // FIXME: Write Unit tests for Filters
1012
        // PHPStorm adds the highlight that the $addWhere is immediately overwritten,
1013
        // but the $query = $expressionBuilder->andX() ensures that the $addWhere is written correctly with AND
1014
        // between the statements, it's not a mistake in the code.
1015 6
        $addWhere = '';
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
1016 6
        switch ($filter) {
1017 6
            case 'pending':
1018 1
                $queryBuilder->andWhere($queryBuilder->expr()->eq('exec_time', 0));
1019 1
                $addWhere = $query->add($expressionBuilder->eq('exec_time', 0));
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
1020 1
                break;
1021 5
            case 'finished':
1022 1
                $queryBuilder->andWhere($queryBuilder->expr()->gt('exec_time', 0));
1023 1
                $addWhere = $query->add($expressionBuilder->gt('exec_time', 0));
0 ignored issues
show
Unused Code introduced by
$addWhere is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
1024 1
                break;
1025
        }
1026
        // FIXME: Write unit test that ensures that the right records are deleted.
1027 6
        if ($doFlush) {
1028 4
            $addWhere = $query->add($expressionBuilder->eq('set_id', (int)$set_id));
1029 4
            $this->flushQueue($doFullFlush ? '' : $addWhere);
1030 4
            return [];
1031
        } else {
1032 2
            if ($itemsPerPage > 0) {
1033
                $queryBuilder
1034 2
                    ->setMaxResults((int)$itemsPerPage);
1035
            }
1036
1037 2
            return $queryBuilder->execute()->fetchAll();
1038
        }
1039
    }
1040
1041
    /**
1042
     * Removes queue entries
1043
     *
1044
     * @param string $where SQL related filter for the entries which should be removed
1045
     * @return void
1046
     */
1047 10
    protected function flushQueue($where = '')
1048
    {
1049 10
        $realWhere = strlen($where) > 0 ? $where : '1=1';
1050
1051 10
        $queryBuilder = $this->getQueryBuilder($this->tableName);
1052
1053 10
        if (EventDispatcher::getInstance()->hasObserver('queueEntryFlush')) {
1054
            $groups = $queryBuilder
1055
                ->select('DISTINCT set_id')
1056
                ->from($this->tableName)
1057
                ->where($realWhere)
1058
                ->execute()
1059
                ->fetchAll();
1060
            if (is_array($groups)) {
1061
                foreach ($groups as $group) {
1062
                    $subSet = $queryBuilder
1063
                        ->select('uid', 'set_id')
1064
                        ->from($this->tableName)
1065
                        ->where(
1066
                            $realWhere,
1067
                            $queryBuilder->expr()->eq('set_id', $group['set_id'])
1068
                        )
1069
                        ->execute()
1070
                        ->fetchAll();
1071
                    EventDispatcher::getInstance()->post('queueEntryFlush', $group['set_id'], $subSet);
1072
                }
1073
            }
1074
        }
1075
1076
        $queryBuilder
1077 10
            ->delete($this->tableName)
1078 10
            ->where($realWhere)
1079 10
            ->execute();
1080 10
    }
1081
1082
    /**
1083
     * Adding call back entries to log (called from hooks typically, see indexed search class "class.crawler.php"
1084
     *
1085
     * @param integer $setId Set ID
1086
     * @param array $params Parameters to pass to call back function
1087
     * @param string $callBack Call back object reference, eg. 'EXT:indexed_search/class.crawler.php:&tx_indexedsearch_crawler'
1088
     * @param integer $page_id Page ID to attach it to
1089
     * @param integer $schedule Time at which to activate
1090
     * @return void
1091
     */
1092
    public function addQueueEntry_callBack($setId, $params, $callBack, $page_id = 0, $schedule = 0)
1093
    {
1094
        if (!is_array($params)) {
1095
            $params = [];
1096
        }
1097
        $params['_CALLBACKOBJ'] = $callBack;
1098
1099
        GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_queue')
1100
            ->insert(
1101
                'tx_crawler_queue',
1102
                [
1103
                    'page_id' => (int)$page_id,
1104
                    'parameters' => serialize($params),
1105
                    'scheduled' => (int)$schedule ?: $this->getCurrentTime(),
1106
                    'exec_time' => 0,
1107
                    'set_id' => (int)$setId,
1108
                    'result_data' => '',
1109
                ]
1110
            );
1111
    }
1112
1113
    /************************************
1114
     *
1115
     * URL setting
1116
     *
1117
     ************************************/
1118
1119
    /**
1120
     * Setting a URL for crawling:
1121
     *
1122
     * @param integer $id Page ID
1123
     * @param string $url Complete URL
1124
     * @param array $subCfg Sub configuration array (from TS config)
1125
     * @param integer $tstamp Scheduled-time
1126
     * @param string $configurationHash (optional) configuration hash
1127
     * @param bool $skipInnerDuplicationCheck (optional) skip inner duplication check
1128
     * @return bool
1129
     */
1130 6
    public function addUrl(
1131
        $id,
1132
        $url,
1133
        array $subCfg,
1134
        $tstamp,
1135
        $configurationHash = '',
1136
        $skipInnerDuplicationCheck = false
1137
    ) {
1138 6
        $urlAdded = false;
1139 6
        $rows = [];
1140
1141
        // Creating parameters:
1142
        $parameters = [
1143 6
            'url' => $url
1144
        ];
1145
1146
        // fe user group simulation:
1147 6
        $uGs = implode(',', array_unique(GeneralUtility::intExplode(',', $subCfg['userGroups'], true)));
1148 6
        if ($uGs) {
1149 1
            $parameters['feUserGroupList'] = $uGs;
1150
        }
1151
1152
        // Setting processing instructions
1153 6
        $parameters['procInstructions'] = GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']);
1154 6
        if (is_array($subCfg['procInstrParams.'])) {
1155 3
            $parameters['procInstrParams'] = $subCfg['procInstrParams.'];
1156
        }
1157
1158
        // Compile value array:
1159 6
        $parameters_serialized = serialize($parameters);
1160
        $fieldArray = [
1161 6
            'page_id' => (int)$id,
1162 6
            'parameters' => $parameters_serialized,
1163 6
            'parameters_hash' => GeneralUtility::shortMD5($parameters_serialized),
1164 6
            'configuration_hash' => $configurationHash,
1165 6
            'scheduled' => $tstamp,
1166 6
            'exec_time' => 0,
1167 6
            'set_id' => (int)$this->setID,
1168 6
            'result_data' => '',
1169 6
            'configuration' => $subCfg['key'],
1170
        ];
1171
1172 6
        if ($this->registerQueueEntriesInternallyOnly) {
1173
            //the entries will only be registered and not stored to the database
1174 1
            $this->queueEntries[] = $fieldArray;
1175
        } else {
1176 5
            if (!$skipInnerDuplicationCheck) {
1177
                // check if there is already an equal entry
1178 4
                $rows = $this->getDuplicateRowsIfExist($tstamp, $fieldArray);
1179
            }
1180
1181 5
            if (empty($rows)) {
1182 4
                $connectionForCrawlerQueue = GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_queue');
1183 4
                $connectionForCrawlerQueue->insert(
1184 4
                    'tx_crawler_queue',
1185 4
                    $fieldArray
1186
                );
1187 4
                $uid = $connectionForCrawlerQueue->lastInsertId('tx_crawler_queue', 'qid');
1188 4
                $rows[] = $uid;
1189 4
                $urlAdded = true;
1190 4
                EventDispatcher::getInstance()->post('urlAddedToQueue', $this->setID, ['uid' => $uid, 'fieldArray' => $fieldArray]);
1191
            } else {
1192 1
                EventDispatcher::getInstance()->post('duplicateUrlInQueue', $this->setID, ['rows' => $rows, 'fieldArray' => $fieldArray]);
1193
            }
1194
        }
1195
1196 6
        return $urlAdded;
1197
    }
1198
1199
    /**
1200
     * This method determines duplicates for a queue entry with the same parameters and this timestamp.
1201
     * If the timestamp is in the past, it will check if there is any unprocessed queue entry in the past.
1202
     * If the timestamp is in the future it will check, if the queued entry has exactly the same timestamp
1203
     *
1204
     * @param int $tstamp
1205
     * @param array $fieldArray
1206
     *
1207
     * @return array
1208
     */
1209 7
    protected function getDuplicateRowsIfExist($tstamp, $fieldArray)
1210
    {
1211 7
        $rows = [];
1212
1213 7
        $currentTime = $this->getCurrentTime();
1214
1215 7
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1216
        $queryBuilder
1217 7
            ->select('qid')
1218 7
            ->from('tx_crawler_queue');
1219
        //if this entry is scheduled with "now"
1220 7
        if ($tstamp <= $currentTime) {
1221 2
            if ($this->extensionSettings['enableTimeslot']) {
1222 1
                $timeBegin = $currentTime - 100;
1223 1
                $timeEnd = $currentTime + 100;
1224
                $queryBuilder
1225 1
                    ->where(
1226 1
                        'scheduled BETWEEN ' . $timeBegin . ' AND ' . $timeEnd . ''
1227
                    )
1228 1
                    ->orWhere(
1229 1
                        $queryBuilder->expr()->lte('scheduled', $currentTime)
1230
                    );
1231
            } else {
1232
                $queryBuilder
1233 1
                    ->where(
1234 2
                        $queryBuilder->expr()->lte('scheduled', $currentTime)
1235
                    );
1236
            }
1237 5
        } elseif ($tstamp > $currentTime) {
1238
            //entry with a timestamp in the future need to have the same schedule time
1239
            $queryBuilder
1240 5
                ->where(
1241 5
                    $queryBuilder->expr()->eq('scheduled', $tstamp)
1242
                );
1243
        }
1244
1245
        $queryBuilder
1246 7
            ->andWhere('NOT exec_time')
1247 7
            ->andWhere('NOT process_id')
1248 7
            ->andWhere($queryBuilder->expr()->eq('page_id', $queryBuilder->createNamedParameter($fieldArray['page_id'], \PDO::PARAM_INT)))
1249 7
            ->andWhere($queryBuilder->expr()->eq('parameters_hash', $queryBuilder->createNamedParameter($fieldArray['parameters_hash'], \PDO::PARAM_STR)))
1250
            ;
1251
1252 7
        $statement = $queryBuilder->execute();
1253
        
1254 7
        while ($row = $statement->fetch()) {
1255 5
            $rows[] = $row['qid'];
1256
        }
1257
1258 7
        return $rows;
1259
    }
1260
1261
    /**
1262
     * Returns the current system time
1263
     *
1264
     * @return int
1265
     */
1266
    public function getCurrentTime()
1267
    {
1268
        return time();
1269
    }
1270
1271
    /************************************
1272
     *
1273
     * URL reading
1274
     *
1275
     ************************************/
1276
1277
    /**
1278
     * Read URL for single queue entry
1279
     *
1280
     * @param integer $queueId
1281
     * @param boolean $force If set, will process even if exec_time has been set!
1282
     * @return integer
1283
     */
1284
    public function readUrl($queueId, $force = false)
1285
    {
1286
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1287
        $ret = 0;
1288
        $this->logger->debug('crawler-readurl start ' . microtime(true));
1289
        // Get entry:
1290
        $queryBuilder
1291
            ->select('*')
1292
            ->from('tx_crawler_queue')
1293
            ->where(
1294
                $queryBuilder->expr()->eq('qid', $queryBuilder->createNamedParameter($queueId, \PDO::PARAM_INT))
1295
            );
1296
        if (!$force) {
1297
            $queryBuilder
1298
                ->andWhere('exec_time = 0')
1299
                ->andWhere('process_scheduled > 0');
1300
        }
1301
        $queueRec = $queryBuilder->execute()->fetch();
1302
1303
        if (!is_array($queueRec)) {
1304
            return;
1305
        }
1306
1307
        SignalSlotUtility::emitSignal(
1308
            __CLASS__,
1309
            SignalSlotUtility::SIGNNAL_QUEUEITEM_PREPROCESS,
1310
            [$queueId, &$queueRec]
1311
        );
1312
1313
        // Set exec_time to lock record:
1314
        $field_array = ['exec_time' => $this->getCurrentTime()];
1315
1316
        if (isset($this->processID)) {
1317
            //if mulitprocessing is used we need to store the id of the process which has handled this entry
1318
            $field_array['process_id_completed'] = $this->processID;
1319
        }
1320
1321
        GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_queue')
1322
            ->update(
1323
                'tx_crawler_queue',
1324
                $field_array,
1325
                [ 'qid' => (int)$queueId ]
1326
            );
1327
1328
        $result = $this->queueExecutor->executeQueueItem($queueRec, $this);
1329
        $resultData = unserialize($result['content']);
1330
1331
        //atm there's no need to point to specific pollable extensions
1332
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'])) {
1333
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'] as $pollable) {
1334
                // only check the success value if the instruction is runnig
1335
                // it is important to name the pollSuccess key same as the procInstructions key
1336
                if (is_array($resultData['parameters']['procInstructions']) && in_array(
1337
                    $pollable,
1338
                    $resultData['parameters']['procInstructions']
1339
                )
1340
                ) {
1341
                    if (!empty($resultData['success'][$pollable]) && $resultData['success'][$pollable]) {
1342
                        $ret |= self::CLI_STATUS_POLLABLE_PROCESSED;
1343
                    }
1344
                }
1345
            }
1346
        }
1347
1348
        // Set result in log which also denotes the end of the processing of this entry.
1349
        $field_array = ['result_data' => serialize($result)];
1350
1351
        SignalSlotUtility::emitSignal(
1352
            __CLASS__,
1353
            SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1354
            [$queueId, &$field_array]
1355
        );
1356
1357
        GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_queue')
1358
            ->update(
1359
                'tx_crawler_queue',
1360
                $field_array,
1361
                [ 'qid' => (int)$queueId ]
1362
            );
1363
1364
        $this->logger->debug('crawler-readurl stop ' . microtime(true));
1365
        return $ret;
1366
    }
1367
1368
    /**
1369
     * Read URL for not-yet-inserted log-entry
1370
     *
1371
     * @param array $field_array Queue field array,
1372
     *
1373
     * @return string
1374
     */
1375
    public function readUrlFromArray($field_array)
1376
    {
1377
        // Set exec_time to lock record:
1378
        $field_array['exec_time'] = $this->getCurrentTime();
1379
        $connectionForCrawlerQueue = GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable($this->tableName);
1380
        $connectionForCrawlerQueue->insert(
1381
            $this->tableName,
1382
            $field_array
1383
        );
1384
        $queueId = $field_array['qid'] = $connectionForCrawlerQueue->lastInsertId($this->tableName, 'qid');
1385
1386
        $result = $this->queueExecutor->executeQueueItem($field_array, $this);
1387
1388
        // Set result in log which also denotes the end of the processing of this entry.
1389
        $field_array = ['result_data' => serialize($result)];
1390
1391
        SignalSlotUtility::emitSignal(
1392
            __CLASS__,
1393
            SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1394
            [$queueId, &$field_array]
1395
        );
1396
1397
        $connectionForCrawlerQueue->update(
1398
            $this->tableName,
1399
            $field_array,
1400
            ['qid' => $queueId]
1401
        );
1402
1403
        return $result;
1404
    }
1405
1406
    /*****************************
1407
     *
1408
     * Compiling URLs to crawl - tools
1409
     *
1410
     *****************************/
1411
1412
    /**
1413
     * @param integer $id Root page id to start from.
1414
     * @param integer $depth Depth of tree, 0=only id-page, 1= on sublevel, 99 = infinite
1415
     * @param integer $scheduledTime Unix Time when the URL is timed to be visited when put in queue
1416
     * @param integer $reqMinute Number of requests per minute (creates the interleave between requests)
1417
     * @param boolean $submitCrawlUrls If set, submits the URLs to queue in database (real crawling)
1418
     * @param boolean $downloadCrawlUrls If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
1419
     * @param array $incomingProcInstructions Array of processing instructions
1420
     * @param array $configurationSelection Array of configuration keys
1421
     * @return string
1422
     */
1423
    public function getPageTreeAndUrls(
1424
        $id,
1425
        $depth,
1426
        $scheduledTime,
1427
        $reqMinute,
1428
        $submitCrawlUrls,
1429
        $downloadCrawlUrls,
1430
        array $incomingProcInstructions,
1431
        array $configurationSelection
1432
    ) {
1433
        $this->scheduledTime = $scheduledTime;
1434
        $this->reqMinute = $reqMinute;
1435
        $this->submitCrawlUrls = $submitCrawlUrls;
1436
        $this->downloadCrawlUrls = $downloadCrawlUrls;
1437
        $this->incomingProcInstructions = $incomingProcInstructions;
1438
        $this->incomingConfigurationSelection = $configurationSelection;
1439
1440
        $this->duplicateTrack = [];
1441
        $this->downloadUrls = [];
1442
1443
        // Drawing tree:
1444
        /* @var PageTreeView $tree */
1445
        $tree = GeneralUtility::makeInstance(PageTreeView::class);
1446
        $perms_clause = $this->getBackendUser()->getPagePermsClause(Permission::PAGE_SHOW);
1447
        $tree->init('AND ' . $perms_clause);
1448
1449
        $pageInfo = BackendUtility::readPageAccess($id, $perms_clause);
1450
        if (is_array($pageInfo)) {
1451
            // Set root row:
1452
            $tree->tree[] = [
1453
                'row' => $pageInfo,
1454
                'HTML' => $this->iconFactory->getIconForRecord('pages', $pageInfo, Icon::SIZE_SMALL)
1455
            ];
1456
        }
1457
1458
        // Get branch beneath:
1459
        if ($depth) {
1460
            $tree->getTree($id, $depth, '');
1461
        }
1462
1463
        // Traverse page tree:
1464
        $code = '';
1465
1466
        foreach ($tree->tree as $data) {
1467
            $this->MP = false;
1468
1469
            // recognize mount points
1470
            if ($data['row']['doktype'] == PageRepository::DOKTYPE_MOUNTPOINT) {
1471
                $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1472
                $queryBuilder->getRestrictions()->removeAll()->add(GeneralUtility::makeInstance(DeletedRestriction::class));
1473
                $mountpage = $queryBuilder
1474
                    ->select('*')
1475
                    ->from('pages')
1476
                    ->where(
1477
                        $queryBuilder->expr()->eq('uid', $queryBuilder->createNamedParameter($data['row']['uid'], \PDO::PARAM_INT))
1478
                    )
1479
                    ->execute()
1480
                    ->fetchAll();
1481
                $queryBuilder->resetRestrictions();
1482
1483
                // fetch mounted pages
1484
                $this->MP = $mountpage[0]['mount_pid'] . '-' . $data['row']['uid'];
0 ignored issues
show
Documentation Bug introduced by
The property $MP was declared of type boolean, but $mountpage[0]['mount_pid...' . $data['row']['uid'] is of type string. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
1485
1486
                $mountTree = GeneralUtility::makeInstance(PageTreeView::class);
1487
                $mountTree->init('AND ' . $perms_clause);
1488
                $mountTree->getTree($mountpage[0]['mount_pid'], $depth);
1489
1490
                foreach ($mountTree->tree as $mountData) {
1491
                    $code .= $this->drawURLs_addRowsForPage(
1492
                        $mountData['row'],
1493
                        $mountData['HTML'] . BackendUtility::getRecordTitle('pages', $mountData['row'], true)
1494
                    );
1495
                }
1496
1497
                // replace page when mount_pid_ol is enabled
1498
                if ($mountpage[0]['mount_pid_ol']) {
1499
                    $data['row']['uid'] = $mountpage[0]['mount_pid'];
1500
                } else {
1501
                    // if the mount_pid_ol is not set the MP must not be used for the mountpoint page
1502
                    $this->MP = false;
1503
                }
1504
            }
1505
1506
            $code .= $this->drawURLs_addRowsForPage(
1507
                $data['row'],
1508
                $data['HTML'] . BackendUtility::getRecordTitle('pages', $data['row'], true)
1509
            );
1510
        }
1511
1512
        return $code;
1513
    }
1514
1515
    /**
1516
     * Expands exclude string
1517
     *
1518
     * @param string $excludeString Exclude string
1519
     * @return array
1520
     */
1521 1
    public function expandExcludeString($excludeString)
1522
    {
1523
        // internal static caches;
1524 1
        static $expandedExcludeStringCache;
1525 1
        static $treeCache;
1526
1527 1
        if (empty($expandedExcludeStringCache[$excludeString])) {
1528 1
            $pidList = [];
1529
1530 1
            if (!empty($excludeString)) {
1531
                /** @var PageTreeView $tree */
1532
                $tree = GeneralUtility::makeInstance(PageTreeView::class);
1533
                $tree->init('AND ' . $this->getBackendUser()->getPagePermsClause(Permission::PAGE_SHOW));
1534
1535
                $excludeParts = GeneralUtility::trimExplode(',', $excludeString);
1536
1537
                foreach ($excludeParts as $excludePart) {
1538
                    list($pid, $depth) = GeneralUtility::trimExplode('+', $excludePart);
1539
1540
                    // default is "page only" = "depth=0"
1541
                    if (empty($depth)) {
1542
                        $depth = (stristr($excludePart, '+')) ? 99 : 0;
1543
                    }
1544
1545
                    $pidList[] = $pid;
1546
1547
                    if ($depth > 0) {
1548
                        if (empty($treeCache[$pid][$depth])) {
1549
                            $tree->reset();
1550
                            $tree->getTree($pid, $depth);
1551
                            $treeCache[$pid][$depth] = $tree->tree;
1552
                        }
1553
1554
                        foreach ($treeCache[$pid][$depth] as $data) {
1555
                            $pidList[] = $data['row']['uid'];
1556
                        }
1557
                    }
1558
                }
1559
            }
1560
1561 1
            $expandedExcludeStringCache[$excludeString] = array_unique($pidList);
1562
        }
1563
1564 1
        return $expandedExcludeStringCache[$excludeString];
1565
    }
1566
1567
    /**
1568
     * Create the rows for display of the page tree
1569
     * For each page a number of rows are shown displaying GET variable configuration
1570
     *
1571
     * @param    array        Page row
1572
     * @param    string        Page icon and title for row
1573
     * @return    string        HTML <tr> content (one or more)
1574
     */
1575
    public function drawURLs_addRowsForPage(array $pageRow, $pageTitleAndIcon)
1576
    {
1577
        $skipMessage = '';
1578
1579
        // Get list of configurations
1580
        $configurations = $this->getUrlsForPageRow($pageRow, $skipMessage);
1581
1582
        if (!empty($this->incomingConfigurationSelection)) {
1583
            // remove configuration that does not match the current selection
1584
            foreach ($configurations as $confKey => $confArray) {
1585
                if (!in_array($confKey, $this->incomingConfigurationSelection)) {
1586
                    unset($configurations[$confKey]);
1587
                }
1588
            }
1589
        }
1590
1591
        // Traverse parameter combinations:
1592
        $c = 0;
1593
        $content = '';
1594
        if (!empty($configurations)) {
1595
            foreach ($configurations as $confKey => $confArray) {
1596
1597
                    // Title column:
1598
                if (!$c) {
1599
                    $titleClm = '<td rowspan="' . count($configurations) . '">' . $pageTitleAndIcon . '</td>';
1600
                } else {
1601
                    $titleClm = '';
1602
                }
1603
1604
                if (!in_array($pageRow['uid'], $this->expandExcludeString($confArray['subCfg']['exclude']))) {
1605
1606
                        // URL list:
1607
                    $urlList = $this->urlListFromUrlArray(
1608
                        $confArray,
1609
                        $pageRow,
1610
                        $this->scheduledTime,
1611
                        $this->reqMinute,
1612
                        $this->submitCrawlUrls,
1613
                        $this->downloadCrawlUrls,
1614
                        $this->duplicateTrack,
1615
                        $this->downloadUrls,
1616
                        $this->incomingProcInstructions // if empty the urls won't be filtered by processing instructions
1617
                    );
1618
1619
                    // Expanded parameters:
1620
                    $paramExpanded = '';
1621
                    $calcAccu = [];
1622
                    $calcRes = 1;
1623
                    foreach ($confArray['paramExpanded'] as $gVar => $gVal) {
1624
                        $paramExpanded .= '
1625
                            <tr>
1626
                                <td>' . htmlspecialchars('&' . $gVar . '=') . '<br/>' .
1627
                                '(' . count($gVal) . ')' .
1628
                                '</td>
1629
                                <td nowrap="nowrap">' . nl2br(htmlspecialchars(implode(chr(10), $gVal))) . '</td>
1630
                            </tr>
1631
                        ';
1632
                        $calcRes *= count($gVal);
1633
                        $calcAccu[] = count($gVal);
1634
                    }
1635
                    $paramExpanded = '<table>' . $paramExpanded . '</table>';
1636
                    $paramExpanded .= 'Comb: ' . implode('*', $calcAccu) . '=' . $calcRes;
1637
1638
                    // Options
1639
                    $optionValues = '';
1640
                    if ($confArray['subCfg']['userGroups']) {
1641
                        $optionValues .= 'User Groups: ' . $confArray['subCfg']['userGroups'] . '<br/>';
1642
                    }
1643
                    if ($confArray['subCfg']['procInstrFilter']) {
1644
                        $optionValues .= 'ProcInstr: ' . $confArray['subCfg']['procInstrFilter'] . '<br/>';
1645
                    }
1646
1647
                    // Compile row:
1648
                    $content .= '
1649
                        <tr>
1650
                            ' . $titleClm . '
1651
                            <td>' . htmlspecialchars($confKey) . '</td>
1652
                            <td>' . nl2br(htmlspecialchars(rawurldecode(trim(str_replace('&', chr(10) . '&', GeneralUtility::implodeArrayForUrl('', $confArray['paramParsed'])))))) . '</td>
1653
                            <td>' . $paramExpanded . '</td>
1654
                            <td nowrap="nowrap">' . $urlList . '</td>
1655
                            <td nowrap="nowrap">' . $optionValues . '</td>
1656
                            <td nowrap="nowrap">' . DebugUtility::viewArray($confArray['subCfg']['procInstrParams.']) . '</td>
1657
                        </tr>';
1658
                } else {
1659
                    $content .= '<tr>
1660
                            ' . $titleClm . '
1661
                            <td>' . htmlspecialchars($confKey) . '</td>
1662
                            <td colspan="5"><em>No entries</em> (Page is excluded in this configuration)</td>
1663
                        </tr>';
1664
                }
1665
1666
                $c++;
1667
            }
1668
        } else {
1669
            $message = !empty($skipMessage) ? ' (' . $skipMessage . ')' : '';
1670
1671
            // Compile row:
1672
            $content .= '
1673
                <tr>
1674
                    <td>' . $pageTitleAndIcon . '</td>
1675
                    <td colspan="6"><em>No entries</em>' . $message . '</td>
1676
                </tr>';
1677
        }
1678
1679
        return $content;
1680
    }
1681
1682
    /*****************************
1683
     *
1684
     * CLI functions
1685
     *
1686
     *****************************/
1687
1688
    /**
1689
     * Running the functionality of the CLI (crawling URLs from queue)
1690
     *
1691
     * @param int $countInARun
1692
     * @param int $sleepTime
1693
     * @param int $sleepAfterFinish
1694
     * @return string
1695
     */
1696
    public function CLI_run($countInARun, $sleepTime, $sleepAfterFinish)
1697
    {
1698
        $result = 0;
1699
        $counter = 0;
1700
1701
        // First, run hooks:
1702
        $this->CLI_runHooks();
1703
1704
        // Clean up the queue
1705
        if ((int)$this->extensionSettings['purgeQueueDays'] > 0) {
1706
            $purgeDate = $this->getCurrentTime() - 24 * 60 * 60 * (int)$this->extensionSettings['purgeQueueDays'];
1707
1708
            $queryBuilderDelete = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1709
            $del = $queryBuilderDelete
1710
                ->delete($this->tableName)
1711
                ->where(
1712
                    'exec_time != 0 AND exec_time < ' . $purgeDate
1713
                )->execute();
1714
1715
            if (false === $del) {
1716
                $this->logger->info(
1717
                    'Records could not be deleted.'
1718
                );
1719
            }
1720
        }
1721
1722
        // Select entries:
1723
        //TODO Shouldn't this reside within the transaction?
1724
        $queryBuilderSelect = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1725
        $rows = $queryBuilderSelect
1726
            ->select('qid', 'scheduled')
1727
            ->from($this->tableName)
1728
            ->where(
1729
                $queryBuilderSelect->expr()->eq('exec_time', 0),
1730
                $queryBuilderSelect->expr()->eq('process_scheduled', 0),
1731
                $queryBuilderSelect->expr()->lte('scheduled', $this->getCurrentTime())
1732
            )
1733
            ->orderBy('scheduled')
1734
            ->addOrderBy('qid')
1735
            ->setMaxResults($countInARun)
1736
            ->execute()
1737
            ->fetchAll();
1738
1739
        if (!empty($rows)) {
1740
            $quidList = [];
1741
1742
            foreach ($rows as $r) {
1743
                $quidList[] = $r['qid'];
1744
            }
1745
1746
            $processId = $this->CLI_buildProcessId();
1747
1748
            //reserve queue entries for process
1749
1750
            //$this->queryBuilder->getConnection()->executeQuery('BEGIN');
1751
            //TODO make sure we're not taking assigned queue-entires
1752
1753
            //save the number of assigned queue entrys to determine who many have been processed later
1754
            $queryBuilderUpdate = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1755
            $numberOfAffectedRows = $queryBuilderUpdate
1756
                ->update($this->tableName)
1757
                ->where(
1758
                    $queryBuilderUpdate->expr()->in('qid', $quidList)
1759
                )
1760
                ->set('process_scheduled', $this->getCurrentTime())
1761
                ->set('process_id', $processId)
1762
                ->execute();
1763
1764
            GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_process')
1765
                ->update(
1766
                    'tx_crawler_process',
1767
                    [ 'assigned_items_count' => (int)$numberOfAffectedRows ],
1768
                    [ 'process_id' => $processId ]
1769
                );
1770
1771
            if ($numberOfAffectedRows == count($quidList)) {
1772
                //$this->queryBuilder->getConnection()->executeQuery('COMMIT');
1773
            } else {
1774
                //$this->queryBuilder->getConnection()->executeQuery('ROLLBACK');
1775
                $this->CLI_debug("Nothing processed due to multi-process collision (" . $this->CLI_buildProcessId() . ")");
1776
                return ($result | self::CLI_STATUS_ABORTED);
1777
            }
1778
1779
            foreach ($rows as $r) {
1780
                $result |= $this->readUrl($r['qid']);
1781
1782
                $counter++;
1783
                usleep((int)$sleepTime); // Just to relax the system
1784
1785
                // if during the start and the current read url the cli has been disable we need to return from the function
1786
                // mark the process NOT as ended.
1787
                if ($this->getDisabled()) {
1788
                    return ($result | self::CLI_STATUS_ABORTED);
1789
                }
1790
1791
                if (!$this->processRepository->isProcessActive($this->CLI_buildProcessId())) {
1792
                    $this->CLI_debug("conflict / timeout (" . $this->CLI_buildProcessId() . ")");
1793
1794
                    //TODO might need an additional returncode
1795
                    $result |= self::CLI_STATUS_ABORTED;
1796
                    break; //possible timeout
1797
                }
1798
            }
1799
1800
            sleep((int)$sleepAfterFinish);
1801
1802
            $msg = 'Rows: ' . $counter;
1803
            $this->CLI_debug($msg . " (" . $this->CLI_buildProcessId() . ")");
1804
        } else {
1805
            $this->CLI_debug("Nothing within queue which needs to be processed (" . $this->CLI_buildProcessId() . ")");
1806
        }
1807
1808
        if ($counter > 0) {
1809
            $result |= self::CLI_STATUS_PROCESSED;
1810
        }
1811
1812
        return $result;
1813
    }
1814
1815
    /**
1816
     * Activate hooks
1817
     *
1818
     * @return void
1819
     */
1820
    public function CLI_runHooks()
1821
    {
1822
        foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['cli_hooks'] ?? [] as $objRef) {
1823
            $hookObj = GeneralUtility::makeInstance($objRef);
1824
            if (is_object($hookObj)) {
1825
                $hookObj->crawler_init($this);
1826
            }
1827
        }
1828
    }
1829
1830
    /**
1831
     * Try to acquire a new process with the given id
1832
     * also performs some auto-cleanup for orphan processes
1833
     * @todo preemption might not be the most elegant way to clean up
1834
     *
1835
     * @param string $id identification string for the process
1836
     * @return boolean
1837
     */
1838
    public function CLI_checkAndAcquireNewProcess($id)
1839
    {
1840
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1841
        $ret = true;
1842
1843
        $systemProcessId = getmypid();
1844
        if ($systemProcessId < 1) {
1845
            return false;
1846
        }
1847
1848
        $processCount = 0;
1849
        $orphanProcesses = [];
1850
1851
        //$this->queryBuilder->getConnection()->executeQuery('BEGIN');
1852
1853
        $statement = $queryBuilder
1854
            ->select('process_id', 'ttl')
1855
            ->from('tx_crawler_process')
1856
            ->where(
1857
                'active = 1 AND deleted = 0'
1858
            )
1859
            ->execute();
1860
1861
        $currentTime = $this->getCurrentTime();
1862
1863
        while ($row = $statement->fetch()) {
1864
            if ($row['ttl'] < $currentTime) {
1865
                $orphanProcesses[] = $row['process_id'];
1866
            } else {
1867
                $processCount++;
1868
            }
1869
        }
1870
1871
        // if there are less than allowed active processes then add a new one
1872
        if ($processCount < (int)$this->extensionSettings['processLimit']) {
1873
            $this->CLI_debug("add process " . $this->CLI_buildProcessId() . " (" . ($processCount + 1) . "/" . (int)$this->extensionSettings['processLimit'] . ")");
1874
1875
            GeneralUtility::makeInstance(ConnectionPool::class)->getConnectionForTable('tx_crawler_process')->insert(
1876
                'tx_crawler_process',
1877
                [
1878
                    'process_id' => $id,
1879
                    'active' => 1,
1880
                    'ttl' => $currentTime + (int)$this->extensionSettings['processMaxRunTime'],
1881
                    'system_process_id' => $systemProcessId
1882
                ]
1883
            );
1884
        } else {
1885
            $this->CLI_debug("Processlimit reached (" . ($processCount) . "/" . (int)$this->extensionSettings['processLimit'] . ")");
1886
            $ret = false;
1887
        }
1888
1889
        $this->processRepository->deleteProcessesMarkedAsDeleted();
1890
        $this->CLI_releaseProcesses($orphanProcesses, true); // maybe this should be somehow included into the current lock
1891
1892
        return $ret;
1893
    }
1894
1895
    /**
1896
     * Release a process and the required resources
1897
     *
1898
     * @param  mixed    $releaseIds   string with a single process-id or array with multiple process-ids
1899
     * @param  boolean  $withinLock   show whether the DB-actions are included within an existing lock
1900
     * @return boolean
1901
     */
1902
    public function CLI_releaseProcesses($releaseIds, $withinLock = false)
1903
    {
1904
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
1905
1906
        if (!is_array($releaseIds)) {
1907
            $releaseIds = [$releaseIds];
1908
        }
1909
1910
        if (empty($releaseIds)) {
1911
            return false;   //nothing to release
1912
        }
1913
1914
        if (!$withinLock) {
1915
            //$this->queryBuilder->getConnection()->executeQuery('BEGIN');
1916
        }
1917
1918
        // some kind of 2nd chance algo - this way you need at least 2 processes to have a real cleanup
1919
        // this ensures that a single process can't mess up the entire process table
1920
1921
        // mark all processes as deleted which have no "waiting" queue-entires and which are not active
1922
1923
        $queryBuilder
1924
        ->update($this->tableName, 'q')
1925
        ->where(
1926
            'q.process_id IN(SELECT p.process_id FROM tx_crawler_process as p WHERE p.active = 0)'
1927
        )
1928
        ->set('q.process_scheduled', 0)
1929
        ->set('q.process_id', '')
1930
        ->execute();
1931
1932
        // FIXME: Not entirely sure that this is equivalent to the previous version
1933
        $queryBuilder->resetQueryPart('set');
1934
1935
        $queryBuilder
1936
            ->update('tx_crawler_process')
1937
            ->where(
1938
                $queryBuilder->expr()->eq('active', 0),
1939
                'process_id IN(SELECT q.process_id FROM tx_crawler_queue as q WHERE q.exec_time = 0)'
1940
            )
1941
            ->set('system_process_id', 0)
1942
            ->execute();
1943
        // previous version for reference
1944
        /*
1945
        $GLOBALS['TYPO3_DB']->exec_UPDATEquery(
1946
            'tx_crawler_process',
1947
            'active=0 AND deleted=0
1948
            AND NOT EXISTS (
1949
                SELECT * FROM tx_crawler_queue
1950
                WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
1951
                AND tx_crawler_queue.exec_time = 0
1952
            )',
1953
            [
1954
                'deleted' => '1',
1955
                'system_process_id' => 0
1956
            ]
1957
        );*/
1958
        // mark all requested processes as non-active
1959
        $queryBuilder
1960
            ->update('tx_crawler_process')
1961
            ->where(
1962
                'NOT EXISTS (
1963
                SELECT * FROM tx_crawler_queue
1964
                    WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
1965
                    AND tx_crawler_queue.exec_time = 0
1966
                )',
1967
                $queryBuilder->expr()->in('process_id', $queryBuilder->createNamedParameter($releaseIds, Connection::PARAM_STR_ARRAY)),
1968
                $queryBuilder->expr()->eq('deleted', 0)
1969
            )
1970
            ->set('active', 0)
1971
            ->execute();
1972
        $queryBuilder->resetQueryPart('set');
1973
        $queryBuilder
1974
            ->update($this->tableName)
1975
            ->where(
1976
                $queryBuilder->expr()->eq('exec_time', 0),
1977
                $queryBuilder->expr()->in('process_id', $queryBuilder->createNamedParameter($releaseIds, Connection::PARAM_STR_ARRAY))
1978
            )
1979
            ->set('process_scheduled', 0)
1980
            ->set('process_id', '')
1981
            ->execute();
1982
1983
        if (!$withinLock) {
1984
            //$this->queryBuilder->getConnection()->executeQuery('COMMIT');
1985
        }
1986
1987
        return true;
1988
    }
1989
1990
    /**
1991
     * Create a unique Id for the current process
1992
     *
1993
     * @return string  the ID
1994
     */
1995 1
    public function CLI_buildProcessId()
1996
    {
1997 1
        if (!$this->processID) {
1998
            $this->processID = GeneralUtility::shortMD5(microtime(true));
1999
        }
2000 1
        return $this->processID;
2001
    }
2002
2003
    /**
2004
     * Prints a message to the stdout (only if debug-mode is enabled)
2005
     *
2006
     * @param  string $msg  the message
2007
     */
2008
    public function CLI_debug($msg)
2009
    {
2010
        if ((int)$this->extensionSettings['processDebug']) {
2011
            echo $msg . "\n";
2012
            flush();
2013
        }
2014
    }
2015
2016
    /**
2017
     * Cleans up entries that stayed for too long in the queue. These are:
2018
     * - processed entries that are over 1.5 days in age
2019
     * - scheduled entries that are over 7 days old
2020
     *
2021
     * @return void
2022
     */
2023
    public function cleanUpOldQueueEntries()
2024
    {
2025
        $processedAgeInSeconds = $this->extensionSettings['cleanUpProcessedAge'] * 86400; // 24*60*60 Seconds in 24 hours
2026
        $scheduledAgeInSeconds = $this->extensionSettings['cleanUpScheduledAge'] * 86400;
2027
2028
        $now = time();
2029
        $condition = '(exec_time<>0 AND exec_time<' . ($now - $processedAgeInSeconds) . ') OR scheduled<=' . ($now - $scheduledAgeInSeconds);
2030
        $this->flushQueue($condition);
2031
    }
2032
2033
    /**
2034
     * Returns a md5 hash generated from a serialized configuration array.
2035
     *
2036
     * @param array $configuration
2037
     *
2038
     * @return string
2039
     */
2040 8
    protected function getConfigurationHash(array $configuration)
2041
    {
2042 8
        unset($configuration['paramExpanded']);
2043 8
        unset($configuration['URLs']);
2044 8
        return md5(serialize($configuration));
2045
    }
2046
2047
    /**
2048
     * Build a URL from a Page and the Query String. If the page has a Site configuration, it can be built by using
2049
     * the Site instance.
2050
     *
2051
     * @param int $pageId
2052
     * @param string $queryString
2053
     * @param string|null $alternativeBaseUrl
2054
     * @param int $httpsOrHttp see tx_crawler_configuration.force_ssl
2055
     * @return UriInterface
2056
     * @throws \TYPO3\CMS\Core\Exception\SiteNotFoundException
2057
     * @throws \TYPO3\CMS\Core\Routing\InvalidRouteArgumentsException
2058
     */
2059 2
    protected function getUrlFromPageAndQueryParameters(int $pageId, string $queryString, ?string $alternativeBaseUrl, int $httpsOrHttp): UriInterface
2060
    {
2061 2
        $site = GeneralUtility::makeInstance(SiteMatcher::class)->matchByPageId((int)$pageId);
2062 2
        if ($site instanceof Site) {
0 ignored issues
show
Bug introduced by
The class TYPO3\CMS\Core\Site\Entity\Site does not exist. Did you forget a USE statement, or did you not list all dependencies?

This error could be the result of:

1. Missing dependencies

PHP Analyzer uses your composer.json file (if available) to determine the dependencies of your project and to determine all the available classes and functions. It expects the composer.json to be in the root folder of your repository.

Are you sure this class is defined by one of your dependencies, or did you maybe not list a dependency in either the require or require-dev section?

2. Missing use statement

PHP does not complain about undefined classes in ìnstanceof checks. For example, the following PHP code will work perfectly fine:

if ($x instanceof DoesNotExist) {
    // Do something.
}

If you have not tested against this specific condition, such errors might go unnoticed.

Loading history...
2063
            $queryString = ltrim($queryString, '?&');
2064
            $queryParts = [];
2065
            parse_str($queryString, $queryParts);
2066
            unset($queryParts['id']);
2067
            // workaround as long as we don't have native language support in crawler configurations
2068
            if (isset($queryParts['L'])) {
2069
                $queryParts['_language'] = $queryParts['L'];
2070
                unset($queryParts['L']);
2071
                $siteLanguage = $site->getLanguageById((int)$queryParts['_language']);
0 ignored issues
show
Unused Code introduced by
$siteLanguage is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
2072
            } else {
2073
                $siteLanguage = $site->getDefaultLanguage();
0 ignored issues
show
Unused Code introduced by
$siteLanguage is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
2074
            }
2075
            $url = $site->getRouter()->generateUri($pageId, $queryParts);
2076
            if (!empty($alternativeBaseUrl)) {
2077
                $alternativeBaseUrl = new Uri($alternativeBaseUrl);
2078
                $url = $url->withHost($alternativeBaseUrl->getHost());
2079
                $url = $url->withScheme($alternativeBaseUrl->getScheme());
2080
                $url = $url->withPort($alternativeBaseUrl->getPort());
2081
            }
2082
        } else {
2083
            // Technically this is not possible with site handling, but kept for backwards-compatibility reasons
2084
            // Once EXT:crawler is v10-only compatible, this should be removed completely
2085 2
            $baseUrl = ($alternativeBaseUrl ?: GeneralUtility::getIndpEnv('TYPO3_SITE_URL'));
2086 2
            $cacheHashCalculator = GeneralUtility::makeInstance(CacheHashCalculator::class);
2087 2
            $queryString .= '&cHash=' . $cacheHashCalculator->generateForParameters($queryString);
2088 2
            $url = rtrim($baseUrl, '/') . '/index.php' . $queryString;
2089 2
            $url = new Uri($url);
2090
        }
2091
2092 2
        if ($httpsOrHttp === -1) {
2093
            $url = $url->withScheme('http');
2094 2
        } elseif ($httpsOrHttp === 1) {
2095
            $url = $url->withScheme('https');
2096
        }
2097
2098 2
        return $url;
2099
    }
2100
}
2101