Test Failed
Push — 6-0 ( cfb4d5...b26d37 )
by
unknown
04:59
created

tx_crawler_lib::urlListFromUrlArray()   D

Complexity

Conditions 21
Paths 114

Size

Total Lines 114
Code Lines 57

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 21
eloc 57
nc 114
nop 9
dl 0
loc 114
rs 4.4991
c 0
b 0
f 0

How to fix   Long Method    Complexity    Many Parameters   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

1
<?php
2
3
/***************************************************************
4
 *  Copyright notice
5
 *
6
 *  (c) 2016 AOE GmbH <[email protected]>
7
 *
8
 *  All rights reserved
9
 *
10
 *  This script is part of the TYPO3 project. The TYPO3 project is
11
 *  free software; you can redistribute it and/or modify
12
 *  it under the terms of the GNU General Public License as published by
13
 *  the Free Software Foundation; either version 3 of the License, or
14
 *  (at your option) any later version.
15
 *
16
 *  The GNU General Public License can be found at
17
 *  http://www.gnu.org/copyleft/gpl.html.
18
 *
19
 *  This script is distributed in the hope that it will be useful,
20
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
21
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
22
 *  GNU General Public License for more details.
23
 *
24
 *  This copyright notice MUST APPEAR in all copies of the script!
25
 ***************************************************************/
26
27
/**
28
 * Class tx_crawler_lib
29
 */
30
class tx_crawler_lib
31
{
32
    /**
33
     * @var integer
34
     */
35
    public $setID = 0;
36
37
    /**
38
     * @var string
39
     */
40
    public $processID = '';
41
42
    /**
43
     * One hour is max stalled time for the CLI
44
     * If the process had the status "start" for 3600 seconds, it will be regarded stalled and a new process is started
45
     *
46
     * @var integer
47
     */
48
    public $max_CLI_exec_time = 3600;
49
50
    /**
51
     * @var array
52
     */
53
    public $duplicateTrack = [];
54
55
    /**
56
     * @var array
57
     */
58
    public $downloadUrls = [];
59
60
    /**
61
     * @var array
62
     */
63
    public $incomingProcInstructions = [];
64
65
    /**
66
     * @var array
67
     */
68
    public $incomingConfigurationSelection = [];
69
70
    /**
71
     * @var array
72
     */
73
    public $registerQueueEntriesInternallyOnly = [];
74
75
    /**
76
     * @var array
77
     */
78
    public $queueEntries = [];
79
80
    /**
81
     * @var array
82
     */
83
    public $urlList = [];
84
85
    /**
86
     * @var boolean
87
     */
88
    public $debugMode = false;
89
90
    /**
91
     * @var array
92
     */
93
    public $extensionSettings = [];
94
95
    /**
96
     * Mount Point
97
     *
98
     * @var boolean
99
     */
100
    public $MP = false;
101
102
    /**
103
     * @var string
104
     */
105
    protected $processFilename;
106
107
    /**
108
     * Holds the internal access mode can be 'gui','cli' or 'cli_im'
109
     *
110
     * @var string
111
     */
112
    protected $accessMode;
113
114
    /**
115
     * @var \TYPO3\CMS\Core\Database\DatabaseConnection
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Database\DatabaseConnection was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
116
     */
117
    private $db;
118
119
    /**
120
     * @var TYPO3\CMS\Core\Authentication\BackendUserAuthentication
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Authentic...ckendUserAuthentication was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
121
     */
122
    private $backendUser;
123
124
    const CLI_STATUS_NOTHING_PROCCESSED = 0;
125
    const CLI_STATUS_REMAIN = 1; //queue not empty
126
    const CLI_STATUS_PROCESSED = 2; //(some) queue items where processed
127
    const CLI_STATUS_ABORTED = 4; //instance didn't finish
128
    const CLI_STATUS_POLLABLE_PROCESSED = 8;
129
130
    /**
131
     * Method to set the accessMode can be gui, cli or cli_im
132
     *
133
     * @return string
134
     */
135
    public function getAccessMode()
136
    {
137
        return $this->accessMode;
138
    }
139
140
    /**
141
     * @param string $accessMode
142
     */
143
    public function setAccessMode($accessMode)
144
    {
145
        $this->accessMode = $accessMode;
146
    }
147
148
    /**
149
     * Set disabled status to prevent processes from being processed
150
     *
151
     * @param  bool $disabled (optional, defaults to true)
152
     * @return void
153
     */
154
    public function setDisabled($disabled = true)
155
    {
156
        if ($disabled) {
157
            \TYPO3\CMS\Core\Utility\GeneralUtility::writeFile($this->processFilename, '');
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\GeneralUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
158
        } else {
159
            if (is_file($this->processFilename)) {
160
                unlink($this->processFilename);
161
            }
162
        }
163
    }
164
165
    /**
166
     * Get disable status
167
     *
168
     * @return bool true if disabled
169
     */
170
    public function getDisabled()
171
    {
172
        if (is_file($this->processFilename)) {
173
            return true;
174
        } else {
175
            return false;
176
        }
177
    }
178
179
    /**
180
     * @param string $filenameWithPath
181
     *
182
     * @return void
183
     */
184
    public function setProcessFilename($filenameWithPath)
185
    {
186
        $this->processFilename = $filenameWithPath;
187
    }
188
189
    /**
190
     * @return string
191
     */
192
    public function getProcessFilename()
193
    {
194
        return $this->processFilename;
195
    }
196
197
    /************************************
198
     *
199
     * Getting URLs based on Page TSconfig
200
     *
201
     ************************************/
202
203
    public function __construct()
204
    {
205
        $this->db = $GLOBALS['TYPO3_DB'];
206
        $this->backendUser = $GLOBALS['BE_USER'];
207
        $this->processFilename = PATH_site . 'typo3temp/tx_crawler.proc';
0 ignored issues
show
Bug introduced by
The constant PATH_site was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
208
209
        $settings = unserialize($GLOBALS['TYPO3_CONF_VARS']['EXT']['extConf']['crawler']);
210
        $settings = is_array($settings) ? $settings : [];
211
212
        // read ext_em_conf_template settings and set
213
        $this->setExtensionSettings($settings);
214
215
        // set defaults:
216
        if (\TYPO3\CMS\Core\Utility\MathUtility::convertToPositiveInteger($this->extensionSettings['countInARun']) == 0) {
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\MathUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
217
            $this->extensionSettings['countInARun'] = 100;
218
        }
219
220
        $this->extensionSettings['processLimit'] = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($this->extensionSettings['processLimit'], 1, 99, 1);
221
    }
222
223
    /**
224
     * Sets the extensions settings (unserialized pendant of $TYPO3_CONF_VARS['EXT']['extConf']['crawler']).
225
     *
226
     * @param array $extensionSettings
227
     * @return void
228
     */
229
    public function setExtensionSettings(array $extensionSettings)
230
    {
231
        $this->extensionSettings = $extensionSettings;
232
    }
233
234
    /**
235
     * Check if the given page should be crawled
236
     *
237
     * @param array $pageRow
238
     * @return false|string false if the page should be crawled (not excluded), true / skipMessage if it should be skipped
239
     */
240
    public function checkIfPageShouldBeSkipped(array $pageRow)
241
    {
242
        $skipPage = false;
243
        $skipMessage = 'Skipped'; // message will be overwritten later
244
245
        // if page is hidden
246
        if (!$this->extensionSettings['crawlHiddenPages']) {
247
            if ($pageRow['hidden']) {
248
                $skipPage = true;
249
                $skipMessage = 'Because page is hidden';
250
            }
251
        }
252
253
        if (!$skipPage) {
254
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList('3,4', $pageRow['doktype']) || $pageRow['doktype'] >= 199) {
255
                $skipPage = true;
256
                $skipMessage = 'Because doktype is not allowed';
257
            }
258
        }
259
260
        if (!$skipPage) {
261
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'])) {
262
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'] as $key => $doktypeList) {
263
                    if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($doktypeList, $pageRow['doktype'])) {
264
                        $skipPage = true;
265
                        $skipMessage = 'Doktype was excluded by "' . $key . '"';
266
                        break;
267
                    }
268
                }
269
            }
270
        }
271
272
        if (!$skipPage) {
273
            // veto hook
274
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'])) {
275
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'] as $key => $func) {
276
                    $params = [
277
                        'pageRow' => $pageRow
278
                    ];
279
                    // expects "false" if page is ok and "true" or a skipMessage if this page should _not_ be crawled
280
                    $veto = \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($func, $params, $this);
281
                    if ($veto !== false) {
282
                        $skipPage = true;
283
                        if (is_string($veto)) {
284
                            $skipMessage = $veto;
285
                        } else {
286
                            $skipMessage = 'Veto from hook "' . htmlspecialchars($key) . '"';
287
                        }
288
                        // no need to execute other hooks if a previous one return a veto
289
                        break;
290
                    }
291
                }
292
            }
293
        }
294
295
        return $skipPage ? $skipMessage : false;
296
    }
297
298
    /**
299
     * Wrapper method for getUrlsForPageId()
300
     * It returns an array of configurations and no urls!
301
     *
302
     * @param array $pageRow Page record with at least dok-type and uid columns.
303
     * @param string $skipMessage
304
     * @return array
305
     * @see getUrlsForPageId()
306
     */
307
    public function getUrlsForPageRow(array $pageRow, &$skipMessage = '')
308
    {
309
        $message = $this->checkIfPageShouldBeSkipped($pageRow);
310
311
        if ($message === false) {
312
            $forceSsl = ($pageRow['url_scheme'] === 2) ? true : false;
313
            $res = $this->getUrlsForPageId($pageRow['uid'], $forceSsl);
314
            $skipMessage = '';
315
        } else {
316
            $skipMessage = $message;
317
            $res = [];
318
        }
319
320
        return $res;
321
    }
322
323
    /**
324
     * This method is used to count if there are ANY unprocessed queue entries
325
     * of a given page_id and the configuration which matches a given hash.
326
     * If there if none, we can skip an inner detail check
327
     *
328
     * @param  int $uid
329
     * @param  string $configurationHash
330
     * @return boolean
331
     */
332
    protected function noUnprocessedQueueEntriesForPageWithConfigurationHashExist($uid, $configurationHash)
333
    {
334
        $configurationHash = $this->db->fullQuoteStr($configurationHash, 'tx_crawler_queue');
335
        $res = $this->db->exec_SELECTquery('count(*) as anz', 'tx_crawler_queue', "page_id=" . intval($uid) . " AND configuration_hash=" . $configurationHash . " AND exec_time=0");
336
        $row = $this->db->sql_fetch_assoc($res);
337
338
        return ($row['anz'] == 0);
339
    }
340
341
    /**
342
     * Creates a list of URLs from input array (and submits them to queue if asked for)
343
     * See Web > Info module script + "indexed_search"'s crawler hook-client using this!
344
     *
345
     * @param    array        Information about URLs from pageRow to crawl.
346
     * @param    array        Page row
347
     * @param    integer        Unix time to schedule indexing to, typically time()
348
     * @param    integer        Number of requests per minute (creates the interleave between requests)
349
     * @param    boolean        If set, submits the URLs to queue
0 ignored issues
show
Bug introduced by
The type If was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
350
     * @param    boolean        If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
351
     * @param    array        Array which is passed by reference and contains the an id per url to secure we will not crawl duplicates
352
     * @param    array        Array which will be filled with URLS for download if flag is set.
353
     * @param    array        Array of processing instructions
354
     * @return    string        List of URLs (meant for display in backend module)
355
     *
356
     */
357
    public function urlListFromUrlArray(
358
    array $vv,
359
    array $pageRow,
360
    $scheduledTime,
361
    $reqMinute,
362
    $submitCrawlUrls,
363
    $downloadCrawlUrls,
364
    array &$duplicateTrack,
365
    array &$downloadUrls,
366
    array $incomingProcInstructions
367
    ) {
368
369
        // realurl support (thanks to Ingo Renner)
370
        if (\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\ExtensionManagementUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
371
372
            /** @var tx_realurl $urlObj */
373
            $urlObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_realurl');
374
375
            if (!empty($vv['subCfg']['baseUrl'])) {
376
                $urlParts = parse_url($vv['subCfg']['baseUrl']);
377
                $host = strtolower($urlParts['host']);
378
                $urlObj->host = $host;
379
380
                // First pass, finding configuration OR pointer string:
381
                $urlObj->extConf = isset($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
382
383
                // If it turned out to be a string pointer, then look up the real config:
384
                if (is_string($urlObj->extConf)) {
385
                    $urlObj->extConf = is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
386
                }
387
            }
388
389
            if (!$GLOBALS['TSFE']->sys_page) {
390
                $GLOBALS['TSFE']->sys_page = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\PageRepository');
391
            }
392
            if (!$GLOBALS['TSFE']->csConvObj) {
393
                $GLOBALS['TSFE']->csConvObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Core\Charset\CharsetConverter');
394
            }
395
            if (!$GLOBALS['TSFE']->tmpl->rootLine[0]['uid']) {
396
                $GLOBALS['TSFE']->tmpl->rootLine[0]['uid'] = $urlObj->extConf['pagePath']['rootpage_id'];
397
            }
398
        }
399
400
        if (is_array($vv['URLs'])) {
401
            $configurationHash = md5(serialize($vv));
402
            $skipInnerCheck = $this->noUnprocessedQueueEntriesForPageWithConfigurationHashExist($pageRow['uid'], $configurationHash);
403
404
            foreach ($vv['URLs'] as $urlQuery) {
405
                if ($this->drawURLs_PIfilter($vv['subCfg']['procInstrFilter'], $incomingProcInstructions)) {
406
407
                    // Calculate cHash:
408
                    if ($vv['subCfg']['cHash']) {
409
                        /* @var $cacheHash \TYPO3\CMS\Frontend\Page\CacheHashCalculator */
410
                        $cacheHash = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\CacheHashCalculator');
411
                        $urlQuery .= '&cHash=' . $cacheHash->generateForParameters($urlQuery);
412
                    }
413
414
                    // Create key by which to determine unique-ness:
415
                    $uKey = $urlQuery . '|' . $vv['subCfg']['userGroups'] . '|' . $vv['subCfg']['baseUrl'] . '|' . $vv['subCfg']['procInstrFilter'];
416
417
                    // realurl support (thanks to Ingo Renner)
418
                    $urlQuery = 'index.php' . $urlQuery;
419
                    if (\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
420
                        $params = [
421
                            'LD' => [
422
                                'totalURL' => $urlQuery
423
                            ],
424
                            'TCEmainHook' => true
425
                        ];
426
                        $urlObj->encodeSpURL($params);
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $urlObj does not seem to be defined for all execution paths leading up to this point.
Loading history...
427
                        $urlQuery = $params['LD']['totalURL'];
428
                    }
429
430
                    // Scheduled time:
431
                    $schTime = $scheduledTime + round(count($duplicateTrack) * (60 / $reqMinute));
432
                    $schTime = floor($schTime / 60) * 60;
433
434
                    if (isset($duplicateTrack[$uKey])) {
435
436
                        //if the url key is registered just display it and do not resubmit is
437
                        $urlList = '<em><span class="typo3-dimmed">' . htmlspecialchars($urlQuery) . '</span></em><br/>';
438
                    } else {
439
                        $urlList = '[' . date('d.m.y H:i', $schTime) . '] ' . htmlspecialchars($urlQuery);
0 ignored issues
show
Bug introduced by
Are you sure date('d.m.y H:i', $schTime) of type false|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

439
                        $urlList = '[' . /** @scrutinizer ignore-type */ date('d.m.y H:i', $schTime) . '] ' . htmlspecialchars($urlQuery);
Loading history...
Bug introduced by
$schTime of type double is incompatible with the type integer expected by parameter $timestamp of date(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

439
                        $urlList = '[' . date('d.m.y H:i', /** @scrutinizer ignore-type */ $schTime) . '] ' . htmlspecialchars($urlQuery);
Loading history...
440
                        $this->urlList[] = '[' . date('d.m.y H:i', $schTime) . '] ' . $urlQuery;
441
442
                        $theUrl = ($vv['subCfg']['baseUrl'] ? $vv['subCfg']['baseUrl'] : \TYPO3\CMS\Core\Utility\GeneralUtility::getIndpEnv('TYPO3_SITE_URL')) . $urlQuery;
443
444
                        // Submit for crawling!
445
                        if ($submitCrawlUrls) {
446
                            $added = $this->addUrl(
447
                            $pageRow['uid'],
448
                            $theUrl,
449
                            $vv['subCfg'],
450
                            $scheduledTime,
451
                            $configurationHash,
452
                            $skipInnerCheck
453
                            );
454
                            if ($added === false) {
455
                                $urlList .= ' (Url already existed)';
456
                            }
457
                        } elseif ($downloadCrawlUrls) {
458
                            $downloadUrls[$theUrl] = $theUrl;
459
                        }
460
461
                        $urlList .= '<br />';
462
                    }
463
                    $duplicateTrack[$uKey] = true;
464
                }
465
            }
466
        } else {
467
            $urlList = 'ERROR - no URL generated';
468
        }
469
470
        return $urlList;
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $urlList does not seem to be defined for all execution paths leading up to this point.
Loading history...
471
    }
472
473
    /**
474
     * Returns true if input processing instruction is among registered ones.
475
     *
476
     * @param string $piString PI to test
477
     * @param array $incomingProcInstructions Processing instructions
478
     * @return boolean
479
     */
480
    public function drawURLs_PIfilter($piString, array $incomingProcInstructions)
481
    {
482
        if (empty($incomingProcInstructions)) {
483
            return true;
484
        }
485
486
        foreach ($incomingProcInstructions as $pi) {
487
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($piString, $pi)) {
488
                return true;
489
            }
490
        }
491
    }
492
493
    public function getPageTSconfigForId($id)
494
    {
495
        if (!$this->MP) {
496
            $pageTSconfig = \TYPO3\CMS\Backend\Utility\BackendUtility::getPagesTSconfig($id);
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Backend\Utility\BackendUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
497
        } else {
498
            list(, $mountPointId) = explode('-', $this->MP);
0 ignored issues
show
Bug introduced by
$this->MP of type true is incompatible with the type string expected by parameter $string of explode(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

498
            list(, $mountPointId) = explode('-', /** @scrutinizer ignore-type */ $this->MP);
Loading history...
499
            $pageTSconfig = \TYPO3\CMS\Backend\Utility\BackendUtility::getPagesTSconfig($mountPointId);
500
        }
501
502
        // Call a hook to alter configuration
503
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'])) {
504
            $params = [
505
                'pageId' => $id,
506
                'pageTSConfig' => &$pageTSconfig
507
            ];
508
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'] as $userFunc) {
509
                \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($userFunc, $params, $this);
510
            }
511
        }
512
513
        return $pageTSconfig;
514
    }
515
516
    /**
517
     * This methods returns an array of configurations.
518
     * And no urls!
519
     *
520
     * @param integer $id Page ID
521
     * @param bool $forceSsl Use https
522
     * @return array
523
     */
524
    protected function getUrlsForPageId($id, $forceSsl = false)
525
    {
526
527
        /**
528
         * Get configuration from tsConfig
529
         */
530
531
        // Get page TSconfig for page ID:
532
        $pageTSconfig = $this->getPageTSconfigForId($id);
533
534
        $res = [];
535
536
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.'])) {
537
            $crawlerCfg = $pageTSconfig['tx_crawler.']['crawlerCfg.'];
538
539
            if (is_array($crawlerCfg['paramSets.'])) {
540
                foreach ($crawlerCfg['paramSets.'] as $key => $values) {
541
                    if (!is_array($values)) {
542
543
                        // Sub configuration for a single configuration string:
544
                        $subCfg = (array)$crawlerCfg['paramSets.'][$key . '.'];
545
                        $subCfg['key'] = $key;
546
547
                        if (strcmp($subCfg['procInstrFilter'], '')) {
548
                            $subCfg['procInstrFilter'] = implode(',', \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']));
549
                        }
550
                        $pidOnlyList = implode(',', \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $subCfg['pidsOnly'], 1));
551
552
                        // process configuration if it is not page-specific or if the specific page is the current page:
553
                        if (!strcmp($subCfg['pidsOnly'], '') || \TYPO3\CMS\Core\Utility\GeneralUtility::inList($pidOnlyList, $id)) {
554
555
                                // add trailing slash if not present
556
                            if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
557
                                $subCfg['baseUrl'] .= '/';
558
                            }
559
560
                            // Explode, process etc.:
561
                            $res[$key] = [];
562
                            $res[$key]['subCfg'] = $subCfg;
563
                            $res[$key]['paramParsed'] = $this->parseParams($values);
564
                            $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
565
                            $res[$key]['origin'] = 'pagets';
566
567
                            // recognize MP value
568
                            if (!$this->MP) {
569
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
570
                            } else {
571
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id . '&MP=' . $this->MP]);
0 ignored issues
show
Bug introduced by
Are you sure $this->MP of type true can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

571
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id . '&MP=' . /** @scrutinizer ignore-type */ $this->MP]);
Loading history...
572
                            }
573
                        }
574
                    }
575
                }
576
            }
577
        }
578
579
        /**
580
         * Get configuration from tx_crawler_configuration records
581
         */
582
583
        // get records along the rootline
584
        $rootLine = \TYPO3\CMS\Backend\Utility\BackendUtility::BEgetRootLine($id);
585
586
        foreach ($rootLine as $page) {
587
            $configurationRecordsForCurrentPage = \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordsByField(
588
                'tx_crawler_configuration',
589
                'pid',
590
                intval($page['uid']),
591
                \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('tx_crawler_configuration') . \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('tx_crawler_configuration')
592
            );
593
594
            if (is_array($configurationRecordsForCurrentPage)) {
595
                foreach ($configurationRecordsForCurrentPage as $configurationRecord) {
596
597
                        // check access to the configuration record
598
                    if (empty($configurationRecord['begroups']) || $GLOBALS['BE_USER']->isAdmin() || $this->hasGroupAccess($GLOBALS['BE_USER']->user['usergroup_cached_list'], $configurationRecord['begroups'])) {
599
                        $pidOnlyList = implode(',', \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $configurationRecord['pidsonly'], 1));
600
601
                        // process configuration if it is not page-specific or if the specific page is the current page:
602
                        if (!strcmp($configurationRecord['pidsonly'], '') || \TYPO3\CMS\Core\Utility\GeneralUtility::inList($pidOnlyList, $id)) {
603
                            $key = $configurationRecord['name'];
604
605
                            // don't overwrite previously defined paramSets
606
                            if (!isset($res[$key])) {
607
608
                                    /* @var $TSparserObject \TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser */
609
                                $TSparserObject = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser');
610
                                $TSparserObject->parse($configurationRecord['processing_instruction_parameters_ts']);
611
612
                                $subCfg = [
613
                                    'procInstrFilter' => $configurationRecord['processing_instruction_filter'],
614
                                    'procInstrParams.' => $TSparserObject->setup,
615
                                    'baseUrl' => $this->getBaseUrlForConfigurationRecord(
616
                                        $configurationRecord['base_url'],
617
                                        $configurationRecord['sys_domain_base_url'],
618
                                        $forceSsl
619
                                    ),
620
                                    'realurl' => $configurationRecord['realurl'],
621
                                    'cHash' => $configurationRecord['chash'],
622
                                    'userGroups' => $configurationRecord['fegroups'],
623
                                    'exclude' => $configurationRecord['exclude'],
624
                                    'rootTemplatePid' => (int) $configurationRecord['root_template_pid'],
625
                                    'key' => $key,
626
                                ];
627
628
                                // add trailing slash if not present
629
                                if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
630
                                    $subCfg['baseUrl'] .= '/';
631
                                }
632
                                if (!in_array($id, $this->expandExcludeString($subCfg['exclude']))) {
633
                                    $res[$key] = [];
634
                                    $res[$key]['subCfg'] = $subCfg;
635
                                    $res[$key]['paramParsed'] = $this->parseParams($configurationRecord['configuration']);
636
                                    $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
637
                                    $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
638
                                    $res[$key]['origin'] = 'tx_crawler_configuration_' . $configurationRecord['uid'];
639
                                }
640
                            }
641
                        }
642
                    }
643
                }
644
            }
645
        }
646
647
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'])) {
648
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'] as $func) {
649
                $params = [
650
                    'res' => &$res,
651
                ];
652
                \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($func, $params, $this);
653
            }
654
        }
655
656
        return $res;
657
    }
658
659
    /**
660
     * Checks if a domain record exist and returns the base-url based on the record. If not the given baseUrl string is used.
661
     *
662
     * @param string $baseUrl
663
     * @param integer $sysDomainUid
664
     * @param bool $ssl
665
     * @return string
666
     */
667
    protected function getBaseUrlForConfigurationRecord($baseUrl, $sysDomainUid, $ssl = false)
668
    {
669
        $sysDomainUid = intval($sysDomainUid);
670
        $urlScheme = ($ssl === false) ? 'http' : 'https';
671
672
        if ($sysDomainUid > 0) {
673
            $res = $this->db->exec_SELECTquery(
674
                '*',
675
                'sys_domain',
676
                'uid = ' . $sysDomainUid .
677
                \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('sys_domain') .
678
                \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('sys_domain')
679
            );
680
            $row = $this->db->sql_fetch_assoc($res);
681
            if ($row['domainName'] != '') {
682
                return $urlScheme . '://' . $row['domainName'];
683
            }
684
        }
685
        return $baseUrl;
686
    }
687
688
    public function getConfigurationsForBranch($rootid, $depth)
689
    {
690
        $configurationsForBranch = [];
691
692
        $pageTSconfig = $this->getPageTSconfigForId($rootid);
693
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'])) {
694
            $sets = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'];
695
            if (is_array($sets)) {
696
                foreach ($sets as $key => $value) {
697
                    if (!is_array($value)) {
698
                        continue;
699
                    }
700
                    $configurationsForBranch[] = substr($key, -1) == '.' ? substr($key, 0, -1) : $key;
701
                }
702
            }
703
        }
704
        $pids = [];
705
        $rootLine = \TYPO3\CMS\Backend\Utility\BackendUtility::BEgetRootLine($rootid);
706
        foreach ($rootLine as $node) {
707
            $pids[] = $node['uid'];
708
        }
709
        /* @var \TYPO3\CMS\Backend\Tree\View\PageTreeView */
710
        $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
711
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
712
        $tree->init('AND ' . $perms_clause);
713
        $tree->getTree($rootid, $depth, '');
714
        foreach ($tree->tree as $node) {
715
            $pids[] = $node['row']['uid'];
716
        }
717
718
        $res = $this->db->exec_SELECTquery(
719
            '*',
720
            'tx_crawler_configuration',
721
            'pid IN (' . implode(',', $pids) . ') ' .
722
            \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('tx_crawler_configuration') .
723
            \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('tx_crawler_configuration') . ' ' .
724
            \TYPO3\CMS\Backend\Utility\BackendUtility::versioningPlaceholderClause('tx_crawler_configuration') . ' '
725
        );
726
727
        while ($row = $this->db->sql_fetch_assoc($res)) {
728
            $configurationsForBranch[] = $row['name'];
729
        }
730
        $this->db->sql_free_result($res);
731
        return $configurationsForBranch;
732
    }
733
734
    /**
735
     * Check if a user has access to an item
736
     * (e.g. get the group list of the current logged in user from $GLOBALS['TSFE']->gr_list)
737
     *
738
     * @see \TYPO3\CMS\Frontend\Page\PageRepository::getMultipleGroupsWhereClause()
739
     * @param  string $groupList    Comma-separated list of (fe_)group UIDs from a user
740
     * @param  string $accessList   Comma-separated list of (fe_)group UIDs of the item to access
741
     * @return bool                 TRUE if at least one of the users group UIDs is in the access list or the access list is empty
742
     */
743
    public function hasGroupAccess($groupList, $accessList)
744
    {
745
        if (empty($accessList)) {
746
            return true;
747
        }
748
        foreach (\TYPO3\CMS\Core\Utility\GeneralUtility::intExplode(',', $groupList) as $groupUid) {
749
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($accessList, $groupUid)) {
750
                return true;
751
            }
752
        }
753
        return false;
754
    }
755
756
    /**
757
     * Parse GET vars of input Query into array with key=>value pairs
758
     *
759
     * @param string $inputQuery Input query string
760
     * @return array
761
     */
762
    public function parseParams($inputQuery)
763
    {
764
        // Extract all GET parameters into an ARRAY:
765
        $paramKeyValues = [];
766
        $GETparams = explode('&', $inputQuery);
767
768
        foreach ($GETparams as $paramAndValue) {
769
            list($p, $v) = explode('=', $paramAndValue, 2);
770
            if (strlen($p)) {
771
                $paramKeyValues[rawurldecode($p)] = rawurldecode($v);
772
            }
773
        }
774
775
        return $paramKeyValues;
776
    }
777
778
    /**
779
     * Will expand the parameters configuration to individual values. This follows a certain syntax of the value of each parameter.
780
     * Syntax of values:
781
     * - Basically: If the value is wrapped in [...] it will be expanded according to the following syntax, otherwise the value is taken literally
782
     * - Configuration is splitted by "|" and the parts are processed individually and finally added together
783
     * - For each configuration part:
784
     *         - "[int]-[int]" = Integer range, will be expanded to all values in between, values included, starting from low to high (max. 1000). Example "1-34" or "-40--30"
785
     *         - "_TABLE:[TCA table name];[_PID:[optional page id, default is current page]];[_ENABLELANG:1]" = Look up of table records from PID, filtering out deleted records. Example "_TABLE:tt_content; _PID:123"
786
     *        _ENABLELANG:1 picks only original records without their language overlays
787
     *         - Default: Literal value
788
     *
789
     * @param array $paramArray Array with key (GET var name) and values (value of GET var which is configuration for expansion)
790
     * @param integer $pid Current page ID
791
     * @return array
792
     */
793
    public function expandParameters($paramArray, $pid)
794
    {
795
        global $TCA;
796
797
        // Traverse parameter names:
798
        foreach ($paramArray as $p => $v) {
799
            $v = trim($v);
800
801
            // If value is encapsulated in square brackets it means there are some ranges of values to find, otherwise the value is literal
802
            if (substr($v, 0, 1) === '[' && substr($v, -1) === ']') {
803
                // So, find the value inside brackets and reset the paramArray value as an array.
804
                $v = substr($v, 1, -1);
805
                $paramArray[$p] = [];
806
807
                // Explode parts and traverse them:
808
                $parts = explode('|', $v);
0 ignored issues
show
Bug introduced by
It seems like $v can also be of type false; however, parameter $string of explode() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

808
                $parts = explode('|', /** @scrutinizer ignore-type */ $v);
Loading history...
809
                foreach ($parts as $pV) {
810
811
                        // Look for integer range: (fx. 1-34 or -40--30 // reads minus 40 to minus 30)
812
                    if (preg_match('/^(-?[0-9]+)\s*-\s*(-?[0-9]+)$/', trim($pV), $reg)) {
813
814
                        // Swap if first is larger than last:
815
                        if ($reg[1] > $reg[2]) {
816
                            $temp = $reg[2];
817
                            $reg[2] = $reg[1];
818
                            $reg[1] = $temp;
819
                        }
820
821
                        // Traverse range, add values:
822
                        $runAwayBrake = 1000; // Limit to size of range!
823
                        for ($a = $reg[1]; $a <= $reg[2];$a++) {
824
                            $paramArray[$p][] = $a;
825
                            $runAwayBrake--;
826
                            if ($runAwayBrake <= 0) {
827
                                break;
828
                            }
829
                        }
830
                    } elseif (substr(trim($pV), 0, 7) == '_TABLE:') {
831
832
                        // Parse parameters:
833
                        $subparts = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(';', $pV);
834
                        $subpartParams = [];
835
                        foreach ($subparts as $spV) {
836
                            list($pKey, $pVal) = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(':', $spV);
837
                            $subpartParams[$pKey] = $pVal;
838
                        }
839
840
                        // Table exists:
841
                        if (isset($TCA[$subpartParams['_TABLE']])) {
842
                            $lookUpPid = isset($subpartParams['_PID']) ? intval($subpartParams['_PID']) : $pid;
843
                            $pidField = isset($subpartParams['_PIDFIELD']) ? trim($subpartParams['_PIDFIELD']) : 'pid';
844
                            $where = isset($subpartParams['_WHERE']) ? $subpartParams['_WHERE'] : '';
845
                            $addTable = isset($subpartParams['_ADDTABLE']) ? $subpartParams['_ADDTABLE'] : '';
846
847
                            $fieldName = $subpartParams['_FIELD'] ? $subpartParams['_FIELD'] : 'uid';
848
                            if ($fieldName === 'uid' || $TCA[$subpartParams['_TABLE']]['columns'][$fieldName]) {
849
                                $andWhereLanguage = '';
850
                                $transOrigPointerField = $TCA[$subpartParams['_TABLE']]['ctrl']['transOrigPointerField'];
851
852
                                if ($subpartParams['_ENABLELANG'] && $transOrigPointerField) {
853
                                    $andWhereLanguage = ' AND ' . $this->db->quoteStr($transOrigPointerField, $subpartParams['_TABLE']) . ' <= 0 ';
854
                                }
855
856
                                $where = $this->db->quoteStr($pidField, $subpartParams['_TABLE']) . '=' . intval($lookUpPid) . ' ' .
857
                                    $andWhereLanguage . $where;
858
859
                                $rows = $this->db->exec_SELECTgetRows(
860
                                    $fieldName,
861
                                    $subpartParams['_TABLE'] . $addTable,
862
                                    $where . \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause($subpartParams['_TABLE']),
863
                                    '',
864
                                    '',
865
                                    '',
866
                                    $fieldName
867
                                );
868
869
                                if (is_array($rows)) {
870
                                    $paramArray[$p] = array_merge($paramArray[$p], array_keys($rows));
871
                                }
872
                            }
873
                        }
874
                    } else { // Just add value:
875
                        $paramArray[$p][] = $pV;
876
                    }
877
                    // Hook for processing own expandParameters place holder
878
                    if (is_array($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'])) {
879
                        $_params = [
880
                            'pObj' => &$this,
881
                            'paramArray' => &$paramArray,
882
                            'currentKey' => $p,
883
                            'currentValue' => $pV,
884
                            'pid' => $pid
885
                        ];
886
                        foreach ($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'] as $key => $_funcRef) {
887
                            \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($_funcRef, $_params, $this);
888
                        }
889
                    }
890
                }
891
892
                // Make unique set of values and sort array by key:
893
                $paramArray[$p] = array_unique($paramArray[$p]);
894
                ksort($paramArray);
895
            } else {
896
                // Set the literal value as only value in array:
897
                $paramArray[$p] = [$v];
898
            }
899
        }
900
901
        return $paramArray;
902
    }
903
904
    /**
905
     * Compiling URLs from parameter array (output of expandParameters())
906
     * The number of URLs will be the multiplication of the number of parameter values for each key
907
     *
908
     * @param array $paramArray Output of expandParameters(): Array with keys (GET var names) and for each an array of values
909
     * @param array $urls URLs accumulated in this array (for recursion)
910
     * @return array
911
     */
912
    public function compileUrls($paramArray, $urls = [])
913
    {
914
        if (count($paramArray) && is_array($urls)) {
915
            // shift first off stack:
916
            reset($paramArray);
917
            $varName = key($paramArray);
918
            $valueSet = array_shift($paramArray);
919
920
            // Traverse value set:
921
            $newUrls = [];
922
            foreach ($urls as $url) {
923
                foreach ($valueSet as $val) {
924
                    $newUrls[] = $url . (strcmp($val, '') ? '&' . rawurlencode($varName) . '=' . rawurlencode($val) : '');
925
926
                    if (count($newUrls) > \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($this->extensionSettings['maxCompileUrls'], 1, 1000000000, 10000)) {
927
                        break;
928
                    }
929
                }
930
            }
931
            $urls = $newUrls;
932
            $urls = $this->compileUrls($paramArray, $urls);
933
        }
934
935
        return $urls;
936
    }
937
938
    /************************************
939
     *
940
     * Crawler log
941
     *
942
     ************************************/
943
944
    /**
945
     * Return array of records from crawler queue for input page ID
946
     *
947
     * @param integer $id Page ID for which to look up log entries.
948
     * @param string$filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
949
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
950
     * @param boolean $doFullFlush
951
     * @param integer $itemsPerPage Limit the amount of entries per page default is 10
952
     * @return array
953
     */
954
    public function getLogEntriesForPageId($id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
955
    {
956
        // FIXME: Write Unit tests for Filters
957
        switch ($filter) {
958
            case 'pending':
959
                $addWhere = ' AND exec_time=0';
960
                break;
961
            case 'finished':
962
                $addWhere = ' AND exec_time>0';
963
                break;
964
            default:
965
                $addWhere = '';
966
                break;
967
        }
968
969
        // FIXME: Write unit test that ensures that the right records are deleted.
970
        if ($doFlush) {
971
            $this->flushQueue(($doFullFlush ? '1=1' : ('page_id=' . intval($id))) . $addWhere);
972
            return [];
973
        } else {
974
            return $this->db->exec_SELECTgetRows(
975
                '*',
976
                'tx_crawler_queue',
977
                'page_id=' . intval($id) . $addWhere,
978
                '',
979
                'scheduled DESC',
980
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
981
            );
982
        }
983
    }
984
985
    /**
986
     * Return array of records from crawler queue for input set ID
987
     *
988
     * @param integer $set_id Set ID for which to look up log entries.
989
     * @param string $filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
990
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
991
     * @param integer $itemsPerPage Limit the amount of entires per page default is 10
992
     * @return array
993
     */
994
    public function getLogEntriesForSetId($set_id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
995
    {
996
        // FIXME: Write Unit tests for Filters
997
        switch ($filter) {
998
            case 'pending':
999
                $addWhere = ' AND exec_time=0';
1000
                break;
1001
            case 'finished':
1002
                $addWhere = ' AND exec_time>0';
1003
                break;
1004
            default:
1005
                $addWhere = '';
1006
                break;
1007
        }
1008
        // FIXME: Write unit test that ensures that the right records are deleted.
1009
        if ($doFlush) {
1010
            $this->flushQueue($doFullFlush ? '' : ('set_id=' . intval($set_id) . $addWhere));
1011
            return [];
1012
        } else {
1013
            return $this->db->exec_SELECTgetRows(
1014
                '*',
1015
                'tx_crawler_queue',
1016
                'set_id=' . intval($set_id) . $addWhere,
1017
                '',
1018
                'scheduled DESC',
1019
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
1020
            );
1021
        }
1022
    }
1023
1024
    /**
1025
     * Removes queue entires
1026
     *
1027
     * @param string $where SQL related filter for the entries which should be removed
1028
     * @return void
1029
     */
1030
    protected function flushQueue($where = '')
1031
    {
1032
        $realWhere = strlen($where) > 0 ? $where : '1=1';
1033
1034
        if (tx_crawler_domain_events_dispatcher::getInstance()->hasObserver('queueEntryFlush')) {
1035
            $groups = $this->db->exec_SELECTgetRows('DISTINCT set_id', 'tx_crawler_queue', $realWhere);
1036
            foreach ($groups as $group) {
1037
                tx_crawler_domain_events_dispatcher::getInstance()->post('queueEntryFlush', $group['set_id'], $this->db->exec_SELECTgetRows('uid, set_id', 'tx_crawler_queue', $realWhere . ' AND set_id="' . $group['set_id'] . '"'));
1038
            }
1039
        }
1040
1041
        $this->db->exec_DELETEquery('tx_crawler_queue', $realWhere);
1042
    }
1043
1044
    /**
1045
     * Adding call back entries to log (called from hooks typically, see indexed search class "class.crawler.php"
1046
     *
1047
     * @param integer $setId Set ID
1048
     * @param array $params Parameters to pass to call back function
1049
     * @param string $callBack Call back object reference, eg. 'EXT:indexed_search/class.crawler.php:&tx_indexedsearch_crawler'
1050
     * @param integer $page_id Page ID to attach it to
1051
     * @param integer $schedule Time at which to activate
1052
     * @return void
1053
     */
1054
    public function addQueueEntry_callBack($setId, $params, $callBack, $page_id = 0, $schedule = 0)
1055
    {
1056
        if (!is_array($params)) {
1057
            $params = [];
1058
        }
1059
        $params['_CALLBACKOBJ'] = $callBack;
1060
1061
        // Compile value array:
1062
        $fieldArray = [
1063
            'page_id' => intval($page_id),
1064
            'parameters' => serialize($params),
1065
            'scheduled' => intval($schedule) ? intval($schedule) : $this->getCurrentTime(),
1066
            'exec_time' => 0,
1067
            'set_id' => intval($setId),
1068
            'result_data' => '',
1069
        ];
1070
1071
        $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1072
    }
1073
1074
    /************************************
1075
     *
1076
     * URL setting
1077
     *
1078
     ************************************/
1079
1080
    /**
1081
     * Setting a URL for crawling:
1082
     *
1083
     * @param integer $id Page ID
1084
     * @param string $url Complete URL
1085
     * @param array $subCfg Sub configuration array (from TS config)
1086
     * @param integer $tstamp Scheduled-time
1087
     * @param string $configurationHash (optional) configuration hash
1088
     * @param bool $skipInnerDuplicationCheck (optional) skip inner duplication check
1089
     * @return bool
1090
     */
1091
    public function addUrl(
1092
        $id,
1093
        $url,
1094
        array $subCfg,
1095
        $tstamp,
1096
        $configurationHash = '',
1097
        $skipInnerDuplicationCheck = false
1098
    ) {
1099
        $urlAdded = false;
1100
1101
        // Creating parameters:
1102
        $parameters = [
1103
            'url' => $url
1104
        ];
1105
1106
        // fe user group simulation:
1107
        $uGs = implode(',', array_unique(\TYPO3\CMS\Core\Utility\GeneralUtility::intExplode(',', $subCfg['userGroups'], 1)));
1108
        if ($uGs) {
1109
            $parameters['feUserGroupList'] = $uGs;
1110
        }
1111
1112
        // Setting processing instructions
1113
        $parameters['procInstructions'] = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']);
1114
        if (is_array($subCfg['procInstrParams.'])) {
1115
            $parameters['procInstrParams'] = $subCfg['procInstrParams.'];
1116
        }
1117
1118
        // Possible TypoScript Template Parents
1119
        $parameters['rootTemplatePid'] = $subCfg['rootTemplatePid'];
1120
1121
        // Compile value array:
1122
        $parameters_serialized = serialize($parameters);
1123
        $fieldArray = [
1124
            'page_id' => intval($id),
1125
            'parameters' => $parameters_serialized,
1126
            'parameters_hash' => \TYPO3\CMS\Core\Utility\GeneralUtility::shortMD5($parameters_serialized),
1127
            'configuration_hash' => $configurationHash,
1128
            'scheduled' => $tstamp,
1129
            'exec_time' => 0,
1130
            'set_id' => intval($this->setID),
1131
            'result_data' => '',
1132
            'configuration' => $subCfg['key'],
1133
        ];
1134
1135
        if ($this->registerQueueEntriesInternallyOnly) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->registerQueueEntriesInternallyOnly of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
1136
            //the entries will only be registered and not stored to the database
1137
            $this->queueEntries[] = $fieldArray;
1138
        } else {
1139
            if (!$skipInnerDuplicationCheck) {
1140
                // check if there is already an equal entry
1141
                $rows = $this->getDuplicateRowsIfExist($tstamp, $fieldArray);
1142
            }
1143
1144
            if (count($rows) == 0) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $rows does not seem to be defined for all execution paths leading up to this point.
Loading history...
1145
                $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1146
                $uid = $this->db->sql_insert_id();
1147
                $rows[] = $uid;
1148
                $urlAdded = true;
1149
                tx_crawler_domain_events_dispatcher::getInstance()->post('urlAddedToQueue', $this->setID, ['uid' => $uid, 'fieldArray' => $fieldArray]);
1150
            } else {
1151
                tx_crawler_domain_events_dispatcher::getInstance()->post('duplicateUrlInQueue', $this->setID, ['rows' => $rows, 'fieldArray' => $fieldArray]);
1152
            }
1153
        }
1154
1155
        return $urlAdded;
1156
    }
1157
1158
    /**
1159
     * This method determines duplicates for a queue entry with the same parameters and this timestamp.
1160
     * If the timestamp is in the past, it will check if there is any unprocessed queue entry in the past.
1161
     * If the timestamp is in the future it will check, if the queued entry has exactly the same timestamp
1162
     *
1163
     * @param int $tstamp
1164
     * @param array $fieldArray
1165
     *
1166
     * @return array;
1167
     */
1168
    protected function getDuplicateRowsIfExist($tstamp, $fieldArray)
1169
    {
1170
        $rows = [];
1171
1172
        $currentTime = $this->getCurrentTime();
1173
1174
        //if this entry is scheduled with "now"
1175
        if ($tstamp <= $currentTime) {
1176
            if ($this->extensionSettings['enableTimeslot']) {
1177
                $timeBegin = $currentTime - 100;
1178
                $timeEnd = $currentTime + 100;
1179
                $where = ' ((scheduled BETWEEN ' . $timeBegin . ' AND ' . $timeEnd . ' ) OR scheduled <= ' . $currentTime . ') ';
1180
            } else {
1181
                $where = 'scheduled <= ' . $currentTime;
1182
            }
1183
        } elseif ($tstamp > $currentTime) {
1184
            //entry with a timestamp in the future need to have the same schedule time
1185
            $where = 'scheduled = ' . $tstamp ;
1186
        }
1187
1188
        if (!empty($where)) {
1189
            $result = $this->db->exec_SELECTgetRows(
1190
                'qid',
1191
                'tx_crawler_queue',
1192
                $where .
1193
                ' AND NOT exec_time' .
1194
                ' AND NOT process_id ' .
1195
                ' AND page_id=' . intval($fieldArray['page_id']) .
1196
                ' AND parameters_hash = ' . $this->db->fullQuoteStr($fieldArray['parameters_hash'], 'tx_crawler_queue')
1197
            );
1198
1199
            if (is_array($result)) {
1200
                foreach ($result as $value) {
1201
                    $rows[] = $value['qid'];
1202
                }
1203
            }
1204
        }
1205
1206
        return $rows;
1207
    }
1208
1209
    /**
1210
     * Returns the current system time
1211
     *
1212
     * @return int
1213
     */
1214
    public function getCurrentTime()
1215
    {
1216
        return time();
1217
    }
1218
1219
    /************************************
1220
     *
1221
     * URL reading
1222
     *
1223
     ************************************/
1224
1225
    /**
1226
     * Read URL for single queue entry
1227
     *
1228
     * @param integer $queueId
1229
     * @param boolean $force If set, will process even if exec_time has been set!
1230
     * @return integer
1231
     */
1232
    public function readUrl($queueId, $force = false)
1233
    {
1234
        $ret = 0;
1235
        if ($this->debugMode) {
1236
            \TYPO3\CMS\Core\Utility\GeneralUtility::devlog('crawler-readurl start ' . microtime(true), __FUNCTION__);
1237
        }
1238
        // Get entry:
1239
        list($queueRec) = $this->db->exec_SELECTgetRows(
1240
            '*',
1241
            'tx_crawler_queue',
1242
            'qid=' . intval($queueId) . ($force ? '' : ' AND exec_time=0 AND process_scheduled > 0')
1243
        );
1244
1245
        if (!is_array($queueRec)) {
1246
            return;
1247
        }
1248
1249
        $parameters = unserialize($queueRec['parameters']);
1250
        if ($parameters['rootTemplatePid']) {
1251
            $this->initTSFE((int)$parameters['rootTemplatePid']);
1252
        } else {
1253
            \TYPO3\CMS\Core\Utility\GeneralUtility::sysLog(
1254
                'Page with (' . $queueRec['page_id'] . ') could not be crawled, please check your crawler configuration. Perhaps no Root Template Pid is set',
1255
                'crawler',
1256
                \TYPO3\CMS\Core\Utility\GeneralUtility::SYSLOG_SEVERITY_WARNING
1257
            );
1258
        }
1259
1260
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1261
            __CLASS__,
1262
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_PREPROCESS,
1263
            [$queueId, &$queueRec]
1264
        );
1265
1266
        // Set exec_time to lock record:
1267
        $field_array = ['exec_time' => $this->getCurrentTime()];
1268
1269
        if (isset($this->processID)) {
1270
            //if mulitprocessing is used we need to store the id of the process which has handled this entry
1271
            $field_array['process_id_completed'] = $this->processID;
1272
        }
1273
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1274
1275
        $result = $this->readUrl_exec($queueRec);
1276
        $resultData = unserialize($result['content']);
1277
1278
        //atm there's no need to point to specific pollable extensions
1279
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'])) {
1280
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'] as $pollable) {
1281
                // only check the success value if the instruction is runnig
1282
                // it is important to name the pollSuccess key same as the procInstructions key
1283
                if (is_array($resultData['parameters']['procInstructions']) && in_array(
1284
                    $pollable,
1285
                        $resultData['parameters']['procInstructions']
1286
                )
1287
                ) {
1288
                    if (!empty($resultData['success'][$pollable]) && $resultData['success'][$pollable]) {
1289
                        $ret |= self::CLI_STATUS_POLLABLE_PROCESSED;
1290
                    }
1291
                }
1292
            }
1293
        }
1294
1295
        // Set result in log which also denotes the end of the processing of this entry.
1296
        $field_array = ['result_data' => serialize($result)];
1297
1298
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1299
            __CLASS__,
1300
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1301
            [$queueId, &$field_array]
1302
        );
1303
1304
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1305
1306
        if ($this->debugMode) {
1307
            \TYPO3\CMS\Core\Utility\GeneralUtility::devlog('crawler-readurl stop ' . microtime(true), __FUNCTION__);
1308
        }
1309
1310
        return $ret;
1311
    }
1312
1313
    /**
1314
     * Read URL for not-yet-inserted log-entry
1315
     *
1316
     * @param integer $field_array Queue field array,
1317
     * @return string
1318
     */
1319
    public function readUrlFromArray($field_array)
1320
    {
1321
1322
            // Set exec_time to lock record:
1323
        $field_array['exec_time'] = $this->getCurrentTime();
1324
        $this->db->exec_INSERTquery('tx_crawler_queue', $field_array);
1325
        $queueId = $field_array['qid'] = $this->db->sql_insert_id();
1326
1327
        $result = $this->readUrl_exec($field_array);
0 ignored issues
show
Bug introduced by
$field_array of type integer is incompatible with the type array expected by parameter $queueRec of tx_crawler_lib::readUrl_exec(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1327
        $result = $this->readUrl_exec(/** @scrutinizer ignore-type */ $field_array);
Loading history...
1328
1329
        // Set result in log which also denotes the end of the processing of this entry.
1330
        $field_array = ['result_data' => serialize($result)];
1331
1332
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1333
            __CLASS__,
1334
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1335
            [$queueId, &$field_array]
1336
        );
1337
1338
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1339
1340
        return $result;
1341
    }
1342
1343
    /**
1344
     * Read URL for a queue record
1345
     *
1346
     * @param array $queueRec Queue record
1347
     * @return string
1348
     */
1349
    public function readUrl_exec($queueRec)
1350
    {
1351
        // Decode parameters:
1352
        $parameters = unserialize($queueRec['parameters']);
1353
        $result = 'ERROR';
1354
        if (is_array($parameters)) {
1355
            if ($parameters['_CALLBACKOBJ']) { // Calling object:
1356
                $objRef = $parameters['_CALLBACKOBJ'];
1357
                $callBackObj = &\TYPO3\CMS\Core\Utility\GeneralUtility::getUserObj($objRef);
1358
                if (is_object($callBackObj)) {
1359
                    unset($parameters['_CALLBACKOBJ']);
1360
                    $result = ['content' => serialize($callBackObj->crawler_execute($parameters, $this))];
1361
                } else {
1362
                    $result = ['content' => 'No object: ' . $objRef];
1363
                }
1364
            } else { // Regular FE request:
1365
1366
                // Prepare:
1367
                $crawlerId = $queueRec['qid'] . ':' . md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey']);
1368
1369
                // Get result:
1370
                $result = $this->requestUrl($parameters['url'], $crawlerId);
1371
1372
                tx_crawler_domain_events_dispatcher::getInstance()->post('urlCrawled', $queueRec['set_id'], ['url' => $parameters['url'], 'result' => $result]);
1373
            }
1374
        }
1375
1376
        return $result;
0 ignored issues
show
Bug Best Practice introduced by
The expression return $result also could return the type array<string,string>|array which is incompatible with the documented return type string.
Loading history...
1377
    }
1378
1379
    /**
1380
     * Gets the content of a URL.
1381
     *
1382
     * @param string $originalUrl URL to read
1383
     * @param string $crawlerId Crawler ID string (qid + hash to verify)
1384
     * @param integer $timeout Timeout time
1385
     * @param integer $recursion Recursion limiter for 302 redirects
1386
     * @return array
1387
     */
1388
    public function requestUrl($originalUrl, $crawlerId, $timeout = 2, $recursion = 10)
1389
    {
1390
        if (!$recursion) {
1391
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type array.
Loading history...
1392
        }
1393
1394
        // Parse URL, checking for scheme:
1395
        $url = parse_url($originalUrl);
1396
1397
        if ($url === false) {
1398
            if (TYPO3_DLOG) {
0 ignored issues
show
Bug introduced by
The constant TYPO3_DLOG was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
1399
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Could not parse_url() for string "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1400
            }
1401
            return false;
1402
        }
1403
1404
        if (!in_array($url['scheme'], ['','http','https'])) {
1405
            if (TYPO3_DLOG) {
1406
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Scheme does not match for url "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
0 ignored issues
show
Bug introduced by
$url of type array is incompatible with the type string expected by parameter $args of sprintf(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1406
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Scheme does not match for url "%s"', /** @scrutinizer ignore-type */ $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
Loading history...
1407
            }
1408
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type array.
Loading history...
1409
        }
1410
1411
        // direct request
1412
        if ($this->extensionSettings['makeDirectRequests']) {
1413
            $result = $this->sendDirectRequest($originalUrl, $crawlerId);
0 ignored issues
show
Bug introduced by
$crawlerId of type string is incompatible with the type integer expected by parameter $crawlerId of tx_crawler_lib::sendDirectRequest(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1413
            $result = $this->sendDirectRequest($originalUrl, /** @scrutinizer ignore-type */ $crawlerId);
Loading history...
1414
            return $result;
1415
        }
1416
1417
        $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1418
1419
        // thanks to Pierrick Caillon for adding proxy support
1420
        $rurl = $url;
1421
1422
        if ($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlUse'] && $GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']) {
1423
            $rurl = parse_url($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']);
1424
            $url['path'] = $url['scheme'] . '://' . $url['host'] . ($url['port'] > 0 ? ':' . $url['port'] : '') . $url['path'];
1425
            $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1426
        }
1427
1428
        $host = $rurl['host'];
1429
1430
        if ($url['scheme'] == 'https') {
1431
            $host = 'ssl://' . $host;
1432
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 443;
1433
        } else {
1434
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 80;
1435
        }
1436
1437
        $startTime = microtime(true);
1438
        $fp = fsockopen($host, $port, $errno, $errstr, $timeout);
1439
1440
        if (!$fp) {
1441
            if (TYPO3_DLOG) {
1442
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
0 ignored issues
show
Bug introduced by
$url of type array<mixed,mixed|string>|array is incompatible with the type string expected by parameter $args of sprintf(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1442
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', /** @scrutinizer ignore-type */ $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
Loading history...
1443
            }
1444
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type array.
Loading history...
1445
        } else {
1446
            // Request message:
1447
            $msg = implode("\r\n", $reqHeaders) . "\r\n\r\n";
1448
            fputs($fp, $msg);
0 ignored issues
show
Bug introduced by
The call to fputs() has too few arguments starting with length. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1448
            /** @scrutinizer ignore-call */ 
1449
            fputs($fp, $msg);

This check compares calls to functions or methods with their respective definitions. If the call has less arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
1449
1450
            // Read response:
1451
            $d = $this->getHttpResponseFromStream($fp);
1452
            fclose($fp);
1453
1454
            $time = microtime(true) - $startTime;
1455
            $this->log($originalUrl . ' ' . $time);
1456
1457
            // Implode content and headers:
1458
            $result = [
1459
                'request' => $msg,
1460
                'headers' => implode('', $d['headers']),
1461
                'content' => implode('', (array)$d['content'])
1462
            ];
1463
1464
            if (($this->extensionSettings['follow30x']) && ($newUrl = $this->getRequestUrlFrom302Header($d['headers'], $url['user'], $url['pass']))) {
1465
                $result = array_merge(['parentRequest' => $result], $this->requestUrl($newUrl, $crawlerId, $recursion--));
1466
                $newRequestUrl = $this->requestUrl($newUrl, $crawlerId, $timeout, --$recursion);
1467
1468
                if (is_array($newRequestUrl)) {
1469
                    $result = array_merge(['parentRequest' => $result], $newRequestUrl);
1470
                } else {
1471
                    if (TYPO3_DLOG) {
1472
                        \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1473
                    }
1474
                    return false;
1475
                }
1476
            }
1477
1478
            return $result;
1479
        }
1480
    }
1481
1482
    /**
1483
     * Gets the base path of the website frontend.
1484
     * (e.g. if you call http://mydomain.com/cms/index.php in
1485
     * the browser the base path is "/cms/")
1486
     *
1487
     * @return string Base path of the website frontend
1488
     */
1489
    protected function getFrontendBasePath()
1490
    {
1491
        $frontendBasePath = '/';
1492
1493
        // Get the path from the extension settings:
1494
        if (isset($this->extensionSettings['frontendBasePath']) && $this->extensionSettings['frontendBasePath']) {
1495
            $frontendBasePath = $this->extensionSettings['frontendBasePath'];
1496
            // If empty, try to use config.absRefPrefix:
1497
        } elseif (isset($GLOBALS['TSFE']->absRefPrefix) && !empty($GLOBALS['TSFE']->absRefPrefix)) {
1498
            $frontendBasePath = $GLOBALS['TSFE']->absRefPrefix;
1499
            // If not in CLI mode the base path can be determined from $_SERVER environment:
1500
        } elseif (!defined('TYPO3_REQUESTTYPE_CLI') || !TYPO3_REQUESTTYPE_CLI) {
0 ignored issues
show
Bug introduced by
The constant TYPO3_REQUESTTYPE_CLI was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
1501
            $frontendBasePath = \TYPO3\CMS\Core\Utility\GeneralUtility::getIndpEnv('TYPO3_SITE_PATH');
1502
        }
1503
1504
        // Base path must be '/<pathSegements>/':
1505
        if ($frontendBasePath != '/') {
1506
            $frontendBasePath = '/' . ltrim($frontendBasePath, '/');
1507
            $frontendBasePath = rtrim($frontendBasePath, '/') . '/';
1508
        }
1509
1510
        return $frontendBasePath;
1511
    }
1512
1513
    /**
1514
     * Executes a shell command and returns the outputted result.
1515
     *
1516
     * @param string $command Shell command to be executed
1517
     * @return string Outputted result of the command execution
1518
     */
1519
    protected function executeShellCommand($command)
1520
    {
1521
        $result = shell_exec($command);
1522
        return $result;
1523
    }
1524
1525
    /**
1526
     * Reads HTTP response from the given stream.
1527
     *
1528
     * @param  resource $streamPointer  Pointer to connection stream.
1529
     * @return array                    Associative array with the following items:
1530
     *                                  headers <array> Response headers sent by server.
1531
     *                                  content <array> Content, with each line as an array item.
1532
     */
1533
    protected function getHttpResponseFromStream($streamPointer)
1534
    {
1535
        $response = ['headers' => [], 'content' => []];
1536
1537
        if (is_resource($streamPointer)) {
1538
            // read headers
1539
            while ($line = fgets($streamPointer, '2048')) {
1540
                $line = trim($line);
1541
                if ($line !== '') {
1542
                    $response['headers'][] = $line;
1543
                } else {
1544
                    break;
1545
                }
1546
            }
1547
1548
            // read content
1549
            while ($line = fgets($streamPointer, '2048')) {
1550
                $response['content'][] = $line;
1551
            }
1552
        }
1553
1554
        return $response;
1555
    }
1556
1557
    /**
1558
     * @param message
1559
     */
1560
    protected function log($message)
1561
    {
1562
        if (!empty($this->extensionSettings['logFileName'])) {
1563
            @file_put_contents($this->extensionSettings['logFileName'], date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);
0 ignored issues
show
Security Best Practice introduced by
It seems like you do not handle an error condition for file_put_contents(). This can introduce security issues, and is generally not recommended. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unhandled  annotation

1563
            /** @scrutinizer ignore-unhandled */ @file_put_contents($this->extensionSettings['logFileName'], date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);

If you suppress an error, we recommend checking for the error condition explicitly:

// For example instead of
@mkdir($dir);

// Better use
if (@mkdir($dir) === false) {
    throw new \RuntimeException('The directory '.$dir.' could not be created.');
}
Loading history...
Bug introduced by
Are you sure date('Ymd His') of type false|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1563
            @file_put_contents($this->extensionSettings['logFileName'], /** @scrutinizer ignore-type */ date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);
Loading history...
1564
        }
1565
    }
1566
1567
    /**
1568
     * Builds HTTP request headers.
1569
     *
1570
     * @param array $url
1571
     * @param string $crawlerId
1572
     *
1573
     * @return array
1574
     */
1575
    protected function buildRequestHeaderArray(array $url, $crawlerId)
1576
    {
1577
        $reqHeaders = [];
1578
        $reqHeaders[] = 'GET ' . $url['path'] . ($url['query'] ? '?' . $url['query'] : '') . ' HTTP/1.0';
1579
        $reqHeaders[] = 'Host: ' . $url['host'];
1580
        if (stristr($url['query'], 'ADMCMD_previewWS')) {
1581
            $reqHeaders[] = 'Cookie: $Version="1"; be_typo_user="1"; $Path=/';
1582
        }
1583
        $reqHeaders[] = 'Connection: close';
1584
        if ($url['user'] != '') {
1585
            $reqHeaders[] = 'Authorization: Basic ' . base64_encode($url['user'] . ':' . $url['pass']);
1586
        }
1587
        $reqHeaders[] = 'X-T3crawler: ' . $crawlerId;
1588
        $reqHeaders[] = 'User-Agent: TYPO3 crawler';
1589
        return $reqHeaders;
1590
    }
1591
1592
    /**
1593
     * Check if the submitted HTTP-Header contains a redirect location and built new crawler-url
1594
     *
1595
     * @param array $headers HTTP Header
1596
     * @param string $user HTTP Auth. User
1597
     * @param string $pass HTTP Auth. Password
1598
     * @return string
1599
     */
1600
    protected function getRequestUrlFrom302Header($headers, $user = '', $pass = '')
1601
    {
1602
        if (!is_array($headers)) {
1603
            return false;
1604
        }
1605
        if (!(stristr($headers[0], '301 Moved') || stristr($headers[0], '302 Found') || stristr($headers[0], '302 Moved'))) {
1606
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type string.
Loading history...
1607
        }
1608
1609
        foreach ($headers as $hl) {
1610
            $tmp = explode(": ", $hl);
1611
            $header[trim($tmp[0])] = trim($tmp[1]);
1612
            if (trim($tmp[0]) == 'Location') {
1613
                break;
1614
            }
1615
        }
1616
        if (!array_key_exists('Location', $header)) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $header seems to be defined by a foreach iteration on line 1609. Are you sure the iterator is never empty, otherwise this variable is not defined?
Loading history...
1617
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type string.
Loading history...
1618
        }
1619
1620
        if ($user != '') {
1621
            if (!($tmp = parse_url($header['Location']))) {
1622
                return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type string.
Loading history...
1623
            }
1624
            $newUrl = $tmp['scheme'] . '://' . $user . ':' . $pass . '@' . $tmp['host'] . $tmp['path'];
1625
            if ($tmp['query'] != '') {
1626
                $newUrl .= '?' . $tmp['query'];
1627
            }
1628
        } else {
1629
            $newUrl = $header['Location'];
1630
        }
1631
        return $newUrl;
1632
    }
1633
1634
    /**************************
1635
     *
1636
     * tslib_fe hooks:
1637
     *
1638
     **************************/
1639
1640
    /**
1641
     * Initialization hook (called after database connection)
1642
     * Takes the "HTTP_X_T3CRAWLER" header and looks up queue record and verifies if the session comes from the system (by comparing hashes)
1643
     *
1644
     * @param array $params Parameters from frontend
1645
     * @param object $ref TSFE object (reference under PHP5)
1646
     * @return void
1647
     */
1648
    public function fe_init(&$params, $ref)
0 ignored issues
show
Unused Code introduced by
The parameter $ref is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

1648
    public function fe_init(&$params, /** @scrutinizer ignore-unused */ $ref)

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
1649
    {
1650
1651
            // Authenticate crawler request:
1652
        if (isset($_SERVER['HTTP_X_T3CRAWLER'])) {
1653
            list($queueId, $hash) = explode(':', $_SERVER['HTTP_X_T3CRAWLER']);
1654
            list($queueRec) = $this->db->exec_SELECTgetRows('*', 'tx_crawler_queue', 'qid=' . intval($queueId));
1655
1656
            // If a crawler record was found and hash was matching, set it up:
1657
            if (is_array($queueRec) && $hash === md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey'])) {
1658
                $params['pObj']->applicationData['tx_crawler']['running'] = true;
1659
                $params['pObj']->applicationData['tx_crawler']['parameters'] = unserialize($queueRec['parameters']);
1660
                $params['pObj']->applicationData['tx_crawler']['log'] = [];
1661
            } else {
1662
                die('No crawler entry found!');
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
1663
            }
1664
        }
1665
    }
1666
1667
    /*****************************
1668
     *
1669
     * Compiling URLs to crawl - tools
1670
     *
1671
     *****************************/
1672
1673
    /**
1674
     * @param integer $id Root page id to start from.
1675
     * @param integer $depth Depth of tree, 0=only id-page, 1= on sublevel, 99 = infinite
1676
     * @param integer $scheduledTime Unix Time when the URL is timed to be visited when put in queue
1677
     * @param integer $reqMinute Number of requests per minute (creates the interleave between requests)
1678
     * @param boolean $submitCrawlUrls If set, submits the URLs to queue in database (real crawling)
1679
     * @param boolean $downloadCrawlUrls If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
1680
     * @param array $incomingProcInstructions Array of processing instructions
1681
     * @param array $configurationSelection Array of configuration keys
1682
     * @return string
1683
     */
1684
    public function getPageTreeAndUrls(
1685
        $id,
1686
        $depth,
1687
        $scheduledTime,
1688
        $reqMinute,
1689
        $submitCrawlUrls,
1690
        $downloadCrawlUrls,
1691
        array $incomingProcInstructions,
1692
        array $configurationSelection
1693
    ) {
1694
        global $BACK_PATH;
1695
        global $LANG;
1696
        if (!is_object($LANG)) {
1697
            $LANG = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('language');
1698
            $LANG->init(0);
1699
        }
1700
        $this->scheduledTime = $scheduledTime;
0 ignored issues
show
Bug Best Practice introduced by
The property scheduledTime does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1701
        $this->reqMinute = $reqMinute;
0 ignored issues
show
Bug Best Practice introduced by
The property reqMinute does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1702
        $this->submitCrawlUrls = $submitCrawlUrls;
0 ignored issues
show
Bug Best Practice introduced by
The property submitCrawlUrls does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1703
        $this->downloadCrawlUrls = $downloadCrawlUrls;
0 ignored issues
show
Bug Best Practice introduced by
The property downloadCrawlUrls does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1704
        $this->incomingProcInstructions = $incomingProcInstructions;
1705
        $this->incomingConfigurationSelection = $configurationSelection;
1706
1707
        $this->duplicateTrack = [];
1708
        $this->downloadUrls = [];
1709
1710
        // Drawing tree:
1711
        /* @var $tree \TYPO3\CMS\Backend\Tree\View\PageTreeView */
1712
        $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1713
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
1714
        $tree->init('AND ' . $perms_clause);
1715
1716
        $pageinfo = \TYPO3\CMS\Backend\Utility\BackendUtility::readPageAccess($id, $perms_clause);
1717
1718
        // Set root row:
1719
        $tree->tree[] = [
1720
            'row' => $pageinfo,
1721
            'HTML' => \AOE\Crawler\Utility\IconUtility::getIconForRecord('pages', $pageinfo)
1722
        ];
1723
1724
        // Get branch beneath:
1725
        if ($depth) {
1726
            $tree->getTree($id, $depth, '');
1727
        }
1728
1729
        // Traverse page tree:
1730
        $code = '';
1731
1732
        foreach ($tree->tree as $data) {
1733
            $this->MP = false;
1734
1735
            // recognize mount points
1736
            if ($data['row']['doktype'] == 7) {
1737
                $mountpage = $this->db->exec_SELECTgetRows('*', 'pages', 'uid = ' . $data['row']['uid']);
1738
1739
                // fetch mounted pages
1740
                $this->MP = $mountpage[0]['mount_pid'] . '-' . $data['row']['uid'];
0 ignored issues
show
Documentation Bug introduced by
The property $MP was declared of type boolean, but $mountpage[0]['mount_pid...' . $data['row']['uid'] is of type string. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
1741
1742
                $mountTree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1743
                $mountTree->init('AND ' . $perms_clause);
1744
                $mountTree->getTree($mountpage[0]['mount_pid'], $depth, '');
1745
1746
                foreach ($mountTree->tree as $mountData) {
1747
                    $code .= $this->drawURLs_addRowsForPage(
1748
                        $mountData['row'],
1749
                        $mountData['HTML'] . \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordTitle('pages', $mountData['row'], true)
1750
                    );
1751
                }
1752
1753
                // replace page when mount_pid_ol is enabled
1754
                if ($mountpage[0]['mount_pid_ol']) {
1755
                    $data['row']['uid'] = $mountpage[0]['mount_pid'];
1756
                } else {
1757
                    // if the mount_pid_ol is not set the MP must not be used for the mountpoint page
1758
                    $this->MP = false;
1759
                }
1760
            }
1761
1762
            $code .= $this->drawURLs_addRowsForPage(
1763
                $data['row'],
1764
                $data['HTML'] . \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordTitle('pages', $data['row'], true)
1765
            );
1766
        }
1767
1768
        return $code;
1769
    }
1770
1771
    /**
1772
     * Expands exclude string
1773
     *
1774
     * @param string $excludeString Exclude string
1775
     * @return array
1776
     */
1777
    public function expandExcludeString($excludeString)
1778
    {
1779
        // internal static caches;
1780
        static $expandedExcludeStringCache;
1781
        static $treeCache;
1782
1783
        if (empty($expandedExcludeStringCache[$excludeString])) {
1784
            $pidList = [];
1785
1786
            if (!empty($excludeString)) {
1787
                /* @var $tree \TYPO3\CMS\Backend\Tree\View\PageTreeView */
1788
                $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1789
                $tree->init('AND ' . $this->backendUser->getPagePermsClause(1));
1790
1791
                $excludeParts = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $excludeString);
1792
1793
                foreach ($excludeParts as $excludePart) {
1794
                    list($pid, $depth) = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode('+', $excludePart);
1795
1796
                    // default is "page only" = "depth=0"
1797
                    if (empty($depth)) {
1798
                        $depth = (stristr($excludePart, '+')) ? 99 : 0;
1799
                    }
1800
1801
                    $pidList[] = $pid;
1802
1803
                    if ($depth > 0) {
1804
                        if (empty($treeCache[$pid][$depth])) {
1805
                            $tree->reset();
1806
                            $tree->getTree($pid, $depth);
1807
                            $treeCache[$pid][$depth] = $tree->tree;
1808
                        }
1809
1810
                        foreach ($treeCache[$pid][$depth] as $data) {
1811
                            $pidList[] = $data['row']['uid'];
1812
                        }
1813
                    }
1814
                }
1815
            }
1816
1817
            $expandedExcludeStringCache[$excludeString] = array_unique($pidList);
1818
        }
1819
1820
        return $expandedExcludeStringCache[$excludeString];
1821
    }
1822
1823
    /**
1824
     * Create the rows for display of the page tree
1825
     * For each page a number of rows are shown displaying GET variable configuration
1826
     *
1827
     * @param    array        Page row
0 ignored issues
show
Bug introduced by
The type Page was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
1828
     * @param    string        Page icon and title for row
1829
     * @return    string        HTML <tr> content (one or more)
1830
     */
1831
    public function drawURLs_addRowsForPage(array $pageRow, $pageTitleAndIcon)
1832
    {
1833
        $skipMessage = '';
1834
1835
        // Get list of configurations
1836
        $configurations = $this->getUrlsForPageRow($pageRow, $skipMessage);
1837
1838
        if (count($this->incomingConfigurationSelection) > 0) {
1839
            // remove configuration that does not match the current selection
1840
            foreach ($configurations as $confKey => $confArray) {
1841
                if (!in_array($confKey, $this->incomingConfigurationSelection)) {
1842
                    unset($configurations[$confKey]);
1843
                }
1844
            }
1845
        }
1846
1847
        // Traverse parameter combinations:
1848
        $c = 0;
1849
        $cc = 0;
0 ignored issues
show
Unused Code introduced by
The assignment to $cc is dead and can be removed.
Loading history...
1850
        $content = '';
1851
        if (count($configurations)) {
1852
            foreach ($configurations as $confKey => $confArray) {
1853
1854
                    // Title column:
1855
                if (!$c) {
1856
                    $titleClm = '<td rowspan="' . count($configurations) . '">' . $pageTitleAndIcon . '</td>';
1857
                } else {
1858
                    $titleClm = '';
1859
                }
1860
1861
                if (!in_array($pageRow['uid'], $this->expandExcludeString($confArray['subCfg']['exclude']))) {
1862
1863
                        // URL list:
1864
                    $urlList = $this->urlListFromUrlArray(
1865
                        $confArray,
1866
                        $pageRow,
1867
                        $this->scheduledTime,
1868
                        $this->reqMinute,
1869
                        $this->submitCrawlUrls,
1870
                        $this->downloadCrawlUrls,
1871
                        $this->duplicateTrack,
1872
                        $this->downloadUrls,
1873
                        $this->incomingProcInstructions // if empty the urls won't be filtered by processing instructions
1874
                    );
1875
1876
                    // Expanded parameters:
1877
                    $paramExpanded = '';
1878
                    $calcAccu = [];
1879
                    $calcRes = 1;
1880
                    foreach ($confArray['paramExpanded'] as $gVar => $gVal) {
1881
                        $paramExpanded .= '
1882
                            <tr>
1883
                                <td class="bgColor4-20">' . htmlspecialchars('&' . $gVar . '=') . '<br/>' .
1884
                                                '(' . count($gVal) . ')' .
1885
                                                '</td>
1886
                                <td class="bgColor4" nowrap="nowrap">' . nl2br(htmlspecialchars(implode(chr(10), $gVal))) . '</td>
1887
                            </tr>
1888
                        ';
1889
                        $calcRes *= count($gVal);
1890
                        $calcAccu[] = count($gVal);
1891
                    }
1892
                    $paramExpanded = '<table class="lrPadding c-list param-expanded">' . $paramExpanded . '</table>';
1893
                    $paramExpanded .= 'Comb: ' . implode('*', $calcAccu) . '=' . $calcRes;
1894
1895
                    // Options
1896
                    $optionValues = '';
1897
                    if ($confArray['subCfg']['userGroups']) {
1898
                        $optionValues .= 'User Groups: ' . $confArray['subCfg']['userGroups'] . '<br/>';
1899
                    }
1900
                    if ($confArray['subCfg']['baseUrl']) {
1901
                        $optionValues .= 'Base Url: ' . $confArray['subCfg']['baseUrl'] . '<br/>';
1902
                    }
1903
                    if ($confArray['subCfg']['procInstrFilter']) {
1904
                        $optionValues .= 'ProcInstr: ' . $confArray['subCfg']['procInstrFilter'] . '<br/>';
1905
                    }
1906
1907
                    // Compile row:
1908
                    $content .= '
1909
                        <tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
1910
                            ' . $titleClm . '
1911
                            <td>' . htmlspecialchars($confKey) . '</td>
1912
                            <td>' . nl2br(htmlspecialchars(rawurldecode(trim(str_replace('&', chr(10) . '&', \TYPO3\CMS\Core\Utility\GeneralUtility::implodeArrayForUrl('', $confArray['paramParsed'])))))) . '</td>
1913
                            <td>' . $paramExpanded . '</td>
1914
                            <td nowrap="nowrap">' . $urlList . '</td>
1915
                            <td nowrap="nowrap">' . $optionValues . '</td>
1916
                            <td nowrap="nowrap">' . \TYPO3\CMS\Core\Utility\DebugUtility::viewArray($confArray['subCfg']['procInstrParams.']) . '</td>
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\DebugUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
1917
                        </tr>';
1918
                } else {
1919
                    $content .= '<tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
1920
                            ' . $titleClm . '
1921
                            <td>' . htmlspecialchars($confKey) . '</td>
1922
                            <td colspan="5"><em>No entries</em> (Page is excluded in this configuration)</td>
1923
                        </tr>';
1924
                }
1925
1926
                $c++;
1927
            }
1928
        } else {
1929
            $message = !empty($skipMessage) ? ' (' . $skipMessage . ')' : '';
1930
1931
            // Compile row:
1932
            $content .= '
1933
                <tr class="bgColor-20" style="border-bottom: 1px solid black;">
1934
                    <td>' . $pageTitleAndIcon . '</td>
1935
                    <td colspan="6"><em>No entries</em>' . $message . '</td>
1936
                </tr>';
1937
        }
1938
1939
        return $content;
1940
    }
1941
1942
    /**
1943
     * @return int
1944
     */
1945
    public function getUnprocessedItemsCount()
1946
    {
1947
        $res = $this->db->exec_SELECTquery(
1948
            'count(*) as num',
1949
            'tx_crawler_queue',
1950
            'exec_time=0 AND process_scheduled=0 AND scheduled<=' . $this->getCurrentTime()
1951
        );
1952
1953
        $count = $this->db->sql_fetch_assoc($res);
1954
        return $count['num'];
1955
    }
1956
1957
    /*****************************
1958
     *
1959
     * CLI functions
1960
     *
1961
     *****************************/
1962
1963
    /**
1964
     * Main function for running from Command Line PHP script (cron job)
1965
     * See ext/crawler/cli/crawler_cli.phpsh for details
1966
     *
1967
     * @return int number of remaining items or false if error
1968
     */
1969
    public function CLI_main()
1970
    {
1971
        $this->setAccessMode('cli');
1972
        $result = self::CLI_STATUS_NOTHING_PROCCESSED;
1973
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli');
1974
1975
        if (isset($cliObj->cli_args['-h']) || isset($cliObj->cli_args['--help'])) {
1976
            $cliObj->cli_validateArgs();
1977
            $cliObj->cli_help();
1978
            exit;
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
1979
        }
1980
1981
        if (!$this->getDisabled() && $this->CLI_checkAndAcquireNewProcess($this->CLI_buildProcessId())) {
1982
            $countInARun = $cliObj->cli_argValue('--countInARun') ? intval($cliObj->cli_argValue('--countInARun')) : $this->extensionSettings['countInARun'];
1983
            // Seconds
1984
            $sleepAfterFinish = $cliObj->cli_argValue('--sleepAfterFinish') ? intval($cliObj->cli_argValue('--sleepAfterFinish')) : $this->extensionSettings['sleepAfterFinish'];
1985
            // Milliseconds
1986
            $sleepTime = $cliObj->cli_argValue('--sleepTime') ? intval($cliObj->cli_argValue('--sleepTime')) : $this->extensionSettings['sleepTime'];
1987
1988
            try {
1989
                // Run process:
1990
                $result = $this->CLI_run($countInARun, $sleepTime, $sleepAfterFinish);
1991
            } catch (Exception $e) {
1992
                $this->CLI_debug(get_class($e) . ': ' . $e->getMessage());
1993
                $result = self::CLI_STATUS_ABORTED;
1994
            }
1995
1996
            // Cleanup
1997
            $this->db->exec_DELETEquery('tx_crawler_process', 'assigned_items_count = 0');
1998
1999
            //TODO can't we do that in a clean way?
2000
            $releaseStatus = $this->CLI_releaseProcesses($this->CLI_buildProcessId());
0 ignored issues
show
Unused Code introduced by
The assignment to $releaseStatus is dead and can be removed.
Loading history...
2001
2002
            $this->CLI_debug("Unprocessed Items remaining:" . $this->getUnprocessedItemsCount() . " (" . $this->CLI_buildProcessId() . ")");
2003
            $result |= ($this->getUnprocessedItemsCount() > 0 ? self::CLI_STATUS_REMAIN : self::CLI_STATUS_NOTHING_PROCCESSED);
2004
        } else {
2005
            $result |= self::CLI_STATUS_ABORTED;
2006
        }
2007
2008
        return $result;
2009
    }
2010
2011
    /**
2012
     * Function executed by crawler_im.php cli script.
2013
     *
2014
     * @return void
2015
     */
2016
    public function CLI_main_im()
2017
    {
2018
        $this->setAccessMode('cli_im');
2019
2020
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli_im');
2021
2022
        // Force user to admin state and set workspace to "Live":
2023
        $this->backendUser->user['admin'] = 1;
2024
        $this->backendUser->setWorkspace(0);
2025
2026
        // Print help
2027
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2028
            $cliObj->cli_validateArgs();
2029
            $cliObj->cli_help();
2030
            exit;
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
2031
        }
2032
2033
        $cliObj->cli_validateArgs();
2034
2035
        if ($cliObj->cli_argValue('-o') === 'exec') {
2036
            $this->registerQueueEntriesInternallyOnly = true;
0 ignored issues
show
Documentation Bug introduced by
It seems like true of type true is incompatible with the declared type array of property $registerQueueEntriesInternallyOnly.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
2037
        }
2038
2039
        if (isset($cliObj->cli_args['_DEFAULT'][2])) {
2040
            // Crawler is called over TYPO3 BE
2041
            $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][2], 0);
2042
        } else {
2043
            // Crawler is called over cli
2044
            $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2045
        }
2046
2047
        $configurationKeys = $this->getConfigurationKeys($cliObj);
2048
2049
        if (!is_array($configurationKeys)) {
2050
            $configurations = $this->getUrlsForPageId($pageId);
2051
            if (is_array($configurations)) {
2052
                $configurationKeys = array_keys($configurations);
2053
            } else {
2054
                $configurationKeys = [];
2055
            }
2056
        }
2057
2058
        if ($cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec') {
2059
            $reason = new tx_crawler_domain_reason();
2060
            $reason->setReason(tx_crawler_domain_reason::REASON_GUI_SUBMIT);
2061
            $reason->setDetailText('The cli script of the crawler added to the queue');
2062
            tx_crawler_domain_events_dispatcher::getInstance()->post(
2063
                'invokeQueueChange',
2064
                $this->setID,
2065
                ['reason' => $reason]
2066
            );
2067
        }
2068
2069
        if ($this->extensionSettings['cleanUpOldQueueEntries']) {
2070
            $this->cleanUpOldQueueEntries();
2071
        }
2072
2073
        $this->setID = \TYPO3\CMS\Core\Utility\GeneralUtility::md5int(microtime());
2074
        $this->getPageTreeAndUrls(
2075
            $pageId,
2076
            \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_argValue('-d'), 0, 99),
2077
            $this->getCurrentTime(),
2078
            \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_isArg('-n') ? $cliObj->cli_argValue('-n') : 30, 1, 1000),
2079
            $cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec',
2080
            $cliObj->cli_argValue('-o') === 'url',
2081
            \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $cliObj->cli_argValue('-proc'), 1),
2082
            $configurationKeys
2083
        );
2084
2085
        if ($cliObj->cli_argValue('-o') === 'url') {
2086
            $cliObj->cli_echo(implode(chr(10), $this->downloadUrls) . chr(10), 1);
2087
        } elseif ($cliObj->cli_argValue('-o') === 'exec') {
2088
            $cliObj->cli_echo("Executing " . count($this->urlList) . " requests right away:\n\n");
2089
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2090
            $cliObj->cli_echo("\nProcessing:\n");
2091
2092
            foreach ($this->queueEntries as $queueRec) {
2093
                $p = unserialize($queueRec['parameters']);
2094
                $cliObj->cli_echo($p['url'] . ' (' . implode(',', $p['procInstructions']) . ') => ');
2095
2096
                $result = $this->readUrlFromArray($queueRec);
2097
2098
                $requestResult = unserialize($result['content']);
2099
                if (is_array($requestResult)) {
2100
                    $resLog = is_array($requestResult['log']) ? chr(10) . chr(9) . chr(9) . implode(chr(10) . chr(9) . chr(9), $requestResult['log']) : '';
2101
                    $cliObj->cli_echo('OK: ' . $resLog . chr(10));
2102
                } else {
2103
                    $cliObj->cli_echo('Error checking Crawler Result: ' . substr(preg_replace('/\s+/', ' ', strip_tags($result['content'])), 0, 30000) . '...' . chr(10));
0 ignored issues
show
Bug introduced by
Are you sure substr(preg_replace('/\s...'content'])), 0, 30000) of type false|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

2103
                    $cliObj->cli_echo('Error checking Crawler Result: ' . /** @scrutinizer ignore-type */ substr(preg_replace('/\s+/', ' ', strip_tags($result['content'])), 0, 30000) . '...' . chr(10));
Loading history...
2104
                }
2105
            }
2106
        } elseif ($cliObj->cli_argValue('-o') === 'queue') {
2107
            $cliObj->cli_echo("Putting " . count($this->urlList) . " entries in queue:\n\n");
2108
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2109
        } else {
2110
            $cliObj->cli_echo(count($this->urlList) . " entries found for processing. (Use -o to decide action):\n\n", 1);
2111
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10), 1);
2112
        }
2113
    }
2114
2115
    /**
2116
     * Function executed by crawler_im.php cli script.
2117
     *
2118
     * @return bool
2119
     */
2120
    public function CLI_main_flush()
2121
    {
2122
        $this->setAccessMode('cli_flush');
2123
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli_flush');
2124
2125
        // Force user to admin state and set workspace to "Live":
2126
        $this->backendUser->user['admin'] = 1;
2127
        $this->backendUser->setWorkspace(0);
2128
2129
        // Print help
2130
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2131
            $cliObj->cli_validateArgs();
2132
            $cliObj->cli_help();
2133
            exit;
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
2134
        }
2135
2136
        $cliObj->cli_validateArgs();
2137
        $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2138
        $fullFlush = ($pageId == 0);
2139
2140
        $mode = $cliObj->cli_argValue('-o');
2141
2142
        switch ($mode) {
2143
            case 'all':
2144
                $result = $this->getLogEntriesForPageId($pageId, '', true, $fullFlush);
2145
                break;
2146
            case 'finished':
2147
            case 'pending':
2148
                $result = $this->getLogEntriesForPageId($pageId, $mode, true, $fullFlush);
2149
                break;
2150
            default:
2151
                $cliObj->cli_validateArgs();
2152
                $cliObj->cli_help();
2153
                $result = false;
2154
        }
2155
2156
        return $result !== false;
2157
    }
2158
2159
    /**
2160
     * Obtains configuration keys from the CLI arguments
2161
     *
2162
     * @param  tx_crawler_cli_im $cliObj    Command line object
2163
     * @return mixed                        Array of keys or null if no keys found
2164
     */
2165
    protected function getConfigurationKeys(tx_crawler_cli_im &$cliObj)
2166
    {
2167
        $parameter = trim($cliObj->cli_argValue('-conf'));
2168
        return ($parameter != '' ? \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $parameter) : []);
2169
    }
2170
2171
    /**
2172
     * Running the functionality of the CLI (crawling URLs from queue)
2173
     *
2174
     * @param int $countInARun
2175
     * @param int $sleepTime
2176
     * @param int $sleepAfterFinish
2177
     * @return string
2178
     */
2179
    public function CLI_run($countInARun, $sleepTime, $sleepAfterFinish)
2180
    {
2181
        $result = 0;
2182
        $counter = 0;
2183
2184
        // First, run hooks:
2185
        $this->CLI_runHooks();
2186
2187
        // Clean up the queue
2188
        if (intval($this->extensionSettings['purgeQueueDays']) > 0) {
2189
            $purgeDate = $this->getCurrentTime() - 24 * 60 * 60 * intval($this->extensionSettings['purgeQueueDays']);
2190
            $del = $this->db->exec_DELETEquery(
0 ignored issues
show
Unused Code introduced by
The assignment to $del is dead and can be removed.
Loading history...
2191
                'tx_crawler_queue',
2192
                'exec_time!=0 AND exec_time<' . $purgeDate
2193
            );
2194
        }
2195
2196
        // Select entries:
2197
        //TODO Shouldn't this reside within the transaction?
2198
        $rows = $this->db->exec_SELECTgetRows(
2199
            'qid,scheduled',
2200
            'tx_crawler_queue',
2201
            'exec_time=0
2202
                AND process_scheduled= 0
2203
                AND scheduled<=' . $this->getCurrentTime(),
2204
            '',
2205
            'scheduled, qid',
2206
        intval($countInARun)
2207
        );
2208
2209
        if (count($rows) > 0) {
2210
            $quidList = [];
2211
2212
            foreach ($rows as $r) {
2213
                $quidList[] = $r['qid'];
2214
            }
2215
2216
            $processId = $this->CLI_buildProcessId();
2217
2218
            //reserve queue entrys for process
2219
            $this->db->sql_query('BEGIN');
2220
            //TODO make sure we're not taking assigned queue-entires
2221
            $this->db->exec_UPDATEquery(
2222
                'tx_crawler_queue',
2223
                'qid IN (' . implode(',', $quidList) . ')',
2224
                [
2225
                    'process_scheduled' => intval($this->getCurrentTime()),
2226
                    'process_id' => $processId
2227
                ]
2228
            );
2229
2230
            //save the number of assigned queue entrys to determine who many have been processed later
2231
            $numberOfAffectedRows = $this->db->sql_affected_rows();
2232
            $this->db->exec_UPDATEquery(
2233
                'tx_crawler_process',
2234
                "process_id = '" . $processId . "'",
2235
                [
2236
                    'assigned_items_count' => intval($numberOfAffectedRows)
2237
                ]
2238
            );
2239
2240
            if ($numberOfAffectedRows == count($quidList)) {
2241
                $this->db->sql_query('COMMIT');
2242
            } else {
2243
                $this->db->sql_query('ROLLBACK');
2244
                $this->CLI_debug("Nothing processed due to multi-process collision (" . $this->CLI_buildProcessId() . ")");
2245
                return ($result | self::CLI_STATUS_ABORTED);
2246
            }
2247
2248
            foreach ($rows as $r) {
2249
                $result |= $this->readUrl($r['qid']);
2250
2251
                $counter++;
2252
                usleep(intval($sleepTime)); // Just to relax the system
2253
2254
                // if during the start and the current read url the cli has been disable we need to return from the function
2255
                // mark the process NOT as ended.
2256
                if ($this->getDisabled()) {
2257
                    return ($result | self::CLI_STATUS_ABORTED);
2258
                }
2259
2260
                if (!$this->CLI_checkIfProcessIsActive($this->CLI_buildProcessId())) {
2261
                    $this->CLI_debug("conflict / timeout (" . $this->CLI_buildProcessId() . ")");
2262
2263
                    //TODO might need an additional returncode
2264
                    $result |= self::CLI_STATUS_ABORTED;
2265
                    break; //possible timeout
2266
                }
2267
            }
2268
2269
            sleep(intval($sleepAfterFinish));
2270
2271
            $msg = 'Rows: ' . $counter;
2272
            $this->CLI_debug($msg . " (" . $this->CLI_buildProcessId() . ")");
2273
        } else {
2274
            $this->CLI_debug("Nothing within queue which needs to be processed (" . $this->CLI_buildProcessId() . ")");
2275
        }
2276
2277
        if ($counter > 0) {
2278
            $result |= self::CLI_STATUS_PROCESSED;
2279
        }
2280
2281
        return $result;
2282
    }
2283
2284
    /**
2285
     * Activate hooks
2286
     *
2287
     * @return void
2288
     */
2289
    public function CLI_runHooks()
2290
    {
2291
        global $TYPO3_CONF_VARS;
2292
        if (is_array($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'])) {
2293
            foreach ($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'] as $objRef) {
2294
                $hookObj = &\TYPO3\CMS\Core\Utility\GeneralUtility::getUserObj($objRef);
2295
                if (is_object($hookObj)) {
2296
                    $hookObj->crawler_init($this);
2297
                }
2298
            }
2299
        }
2300
    }
2301
2302
    /**
2303
     * Try to acquire a new process with the given id
2304
     * also performs some auto-cleanup for orphan processes
2305
     * @todo preemption might not be the most elegant way to clean up
2306
     *
2307
     * @param string $id identification string for the process
2308
     * @return boolean
2309
     */
2310
    public function CLI_checkAndAcquireNewProcess($id)
2311
    {
2312
        $ret = true;
2313
2314
        $systemProcessId = getmypid();
2315
        if ($systemProcessId < 1) {
2316
            return false;
2317
        }
2318
2319
        $processCount = 0;
2320
        $orphanProcesses = [];
2321
2322
        $this->db->sql_query('BEGIN');
2323
2324
        $res = $this->db->exec_SELECTquery(
2325
            'process_id,ttl',
2326
            'tx_crawler_process',
2327
            'active=1 AND deleted=0'
2328
            );
2329
2330
        $currentTime = $this->getCurrentTime();
2331
2332
        while ($row = $this->db->sql_fetch_assoc($res)) {
2333
            if ($row['ttl'] < $currentTime) {
2334
                $orphanProcesses[] = $row['process_id'];
2335
            } else {
2336
                $processCount++;
2337
            }
2338
        }
2339
2340
        // if there are less than allowed active processes then add a new one
2341
        if ($processCount < intval($this->extensionSettings['processLimit'])) {
2342
            $this->CLI_debug("add process " . $this->CLI_buildProcessId() . " (" . ($processCount + 1) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2343
2344
            // create new process record
2345
            $this->db->exec_INSERTquery(
2346
                'tx_crawler_process',
2347
                [
2348
                    'process_id' => $id,
2349
                    'active' => '1',
2350
                    'ttl' => ($currentTime + intval($this->extensionSettings['processMaxRunTime'])),
2351
                    'system_process_id' => $systemProcessId
2352
                ]
2353
                );
2354
        } else {
2355
            $this->CLI_debug("Processlimit reached (" . ($processCount) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2356
            $ret = false;
2357
        }
2358
2359
        $this->CLI_releaseProcesses($orphanProcesses, true); // maybe this should be somehow included into the current lock
2360
        $this->CLI_deleteProcessesMarkedDeleted();
2361
2362
        $this->db->sql_query('COMMIT');
2363
2364
        return $ret;
2365
    }
2366
2367
    /**
2368
     * Release a process and the required resources
2369
     *
2370
     * @param  mixed    $releaseIds   string with a single process-id or array with multiple process-ids
2371
     * @param  boolean  $withinLock   show whether the DB-actions are included within an existing lock
2372
     * @return boolean
2373
     */
2374
    public function CLI_releaseProcesses($releaseIds, $withinLock = false)
2375
    {
2376
        if (!is_array($releaseIds)) {
2377
            $releaseIds = [$releaseIds];
2378
        }
2379
2380
        if (!count($releaseIds) > 0) {
2381
            return false;   //nothing to release
2382
        }
2383
2384
        if (!$withinLock) {
2385
            $this->db->sql_query('BEGIN');
2386
        }
2387
2388
        // some kind of 2nd chance algo - this way you need at least 2 processes to have a real cleanup
2389
        // this ensures that a single process can't mess up the entire process table
2390
2391
        // mark all processes as deleted which have no "waiting" queue-entires and which are not active
2392
        $this->db->exec_UPDATEquery(
2393
            'tx_crawler_queue',
2394
            'process_id IN (SELECT process_id FROM tx_crawler_process WHERE active=0 AND deleted=0)',
2395
            [
2396
                'process_scheduled' => 0,
2397
                'process_id' => ''
2398
            ]
2399
        );
2400
        $this->db->exec_UPDATEquery(
2401
            'tx_crawler_process',
2402
            'active=0 AND deleted=0
2403
            AND NOT EXISTS (
2404
                SELECT * FROM tx_crawler_queue
2405
                WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
2406
                AND tx_crawler_queue.exec_time = 0
2407
            )',
2408
            [
2409
                'deleted' => '1',
2410
                'system_process_id' => 0
2411
            ]
2412
        );
2413
        // mark all requested processes as non-active
2414
        $this->db->exec_UPDATEquery(
2415
            'tx_crawler_process',
2416
            'process_id IN (\'' . implode('\',\'', $releaseIds) . '\') AND deleted=0',
2417
            [
2418
                'active' => '0'
2419
            ]
2420
        );
2421
        $this->db->exec_UPDATEquery(
2422
            'tx_crawler_queue',
2423
            'exec_time=0 AND process_id IN ("' . implode('","', $releaseIds) . '")',
2424
            [
2425
                'process_scheduled' => 0,
2426
                'process_id' => ''
2427
            ]
2428
        );
2429
2430
        if (!$withinLock) {
2431
            $this->db->sql_query('COMMIT');
2432
        }
2433
2434
        return true;
2435
    }
2436
2437
    /**
2438
     * Delete processes marked as deleted
2439
     *
2440
     * @return void
2441
     */
2442
    public function CLI_deleteProcessesMarkedDeleted()
2443
    {
2444
        $this->db->exec_DELETEquery('tx_crawler_process', 'deleted = 1');
2445
    }
2446
2447
    /**
2448
     * Check if there are still resources left for the process with the given id
2449
     * Used to determine timeouts and to ensure a proper cleanup if there's a timeout
2450
     *
2451
     * @param  string  identification string for the process
0 ignored issues
show
Bug introduced by
The type identification was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2452
     * @return boolean determines if the process is still active / has resources
2453
     *
2454
     * FIXME: Please remove Transaction, not needed as only a select query.
2455
     */
2456
    public function CLI_checkIfProcessIsActive($pid)
2457
    {
2458
        $ret = false;
2459
        $this->db->sql_query('BEGIN');
2460
        $res = $this->db->exec_SELECTquery(
2461
            'process_id,active,ttl',
2462
            'tx_crawler_process',
2463
            'process_id = \'' . $pid . '\'  AND deleted=0',
2464
            '',
2465
            'ttl',
2466
            '0,1'
2467
        );
2468
        if ($row = $this->db->sql_fetch_assoc($res)) {
2469
            $ret = intVal($row['active']) == 1;
2470
        }
2471
        $this->db->sql_query('COMMIT');
2472
2473
        return $ret;
2474
    }
2475
2476
    /**
2477
     * Create a unique Id for the current process
2478
     *
2479
     * @return string  the ID
2480
     */
2481
    public function CLI_buildProcessId()
2482
    {
2483
        if (!$this->processID) {
2484
            $this->processID = \TYPO3\CMS\Core\Utility\GeneralUtility::shortMD5($this->microtime(true));
2485
        }
2486
        return $this->processID;
2487
    }
2488
2489
    /**
2490
     * @param bool $get_as_float
2491
     *
2492
     * @return mixed
2493
     */
2494
    protected function microtime($get_as_float = false)
2495
    {
2496
        return microtime($get_as_float);
2497
    }
2498
2499
    /**
2500
     * Prints a message to the stdout (only if debug-mode is enabled)
2501
     *
2502
     * @param  string $msg  the message
2503
     */
2504
    public function CLI_debug($msg)
2505
    {
2506
        if (intval($this->extensionSettings['processDebug'])) {
2507
            echo $msg . "\n";
2508
            flush();
2509
        }
2510
    }
2511
2512
    /**
2513
     * Get URL content by making direct request to TYPO3.
2514
     *
2515
     * @param  string $url          Page URL
2516
     * @param  int    $crawlerId    Crawler-ID
2517
     * @return array
2518
     */
2519
    protected function sendDirectRequest($url, $crawlerId)
2520
    {
2521
        $requestHeaders = $this->buildRequestHeaderArray(parse_url($url), $crawlerId);
2522
2523
        $cmd = escapeshellcmd($this->extensionSettings['phpPath']);
2524
        $cmd .= ' ';
2525
        $cmd .= escapeshellarg(\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::extPath('crawler') . 'cli/bootstrap.php');
2526
        $cmd .= ' ';
2527
        $cmd .= escapeshellarg($this->getFrontendBasePath());
2528
        $cmd .= ' ';
2529
        $cmd .= escapeshellarg($url);
2530
        $cmd .= ' ';
2531
        $cmd .= escapeshellarg(base64_encode(serialize($requestHeaders)));
2532
2533
        $startTime = microtime(true);
2534
        $content = $this->executeShellCommand($cmd);
2535
        $this->log($url . ' ' . (microtime(true) - $startTime));
2536
2537
        $result = [
2538
            'request' => implode("\r\n", $requestHeaders) . "\r\n\r\n",
2539
            'headers' => '',
2540
            'content' => $content
2541
        ];
2542
2543
        return $result;
2544
    }
2545
2546
    /**
2547
     * Cleans up entries that stayed for too long in the queue. These are:
2548
     * - processed entries that are over 1.5 days in age
2549
     * - scheduled entries that are over 7 days old
2550
     *
2551
     * @return void
2552
     */
2553
    protected function cleanUpOldQueueEntries()
2554
    {
2555
        $processedAgeInSeconds = $this->extensionSettings['cleanUpProcessedAge'] * 86400; // 24*60*60 Seconds in 24 hours
2556
        $scheduledAgeInSeconds = $this->extensionSettings['cleanUpScheduledAge'] * 86400;
2557
2558
        $now = time();
2559
        $condition = '(exec_time<>0 AND exec_time<' . ($now - $processedAgeInSeconds) . ') OR scheduled<=' . ($now - $scheduledAgeInSeconds);
2560
        $this->flushQueue($condition);
2561
    }
2562
2563
    /**
2564
     * Initializes a TypoScript Frontend necessary for using TypoScript and TypoLink functions
2565
     *
2566
     * @param int $id
2567
     * @param int $typeNum
2568
     *
2569
     * @return void
2570
     */
2571
    protected function initTSFE($id = 1, $typeNum = 0)
2572
    {
2573
        \TYPO3\CMS\Frontend\Utility\EidUtility::initTCA();
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Utility\EidUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2574
        if (!is_object($GLOBALS['TT'])) {
2575
            $GLOBALS['TT'] = new \TYPO3\CMS\Core\TimeTracker\NullTimeTracker;
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\TimeTracker\NullTimeTracker was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2576
            $GLOBALS['TT']->start();
2577
        }
2578
2579
        $GLOBALS['TSFE'] = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(\TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::class, $GLOBALS['TYPO3_CONF_VARS'], $id, $typeNum);
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Contr...criptFrontendController was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2580
        $GLOBALS['TSFE']->sys_page = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(\TYPO3\CMS\Frontend\Page\PageRepository::class);
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Page\PageRepository was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2581
        $GLOBALS['TSFE']->sys_page->init(true);
2582
        $GLOBALS['TSFE']->connectToDB();
2583
        $GLOBALS['TSFE']->initFEuser();
2584
        $GLOBALS['TSFE']->determineId();
2585
        $GLOBALS['TSFE']->initTemplate();
2586
        $GLOBALS['TSFE']->rootLine = $GLOBALS['TSFE']->sys_page->getRootLine($id, '');
2587
        $GLOBALS['TSFE']->getConfigArray();
2588
        \TYPO3\CMS\Frontend\Page\PageGenerator::pagegenInit();
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Page\PageGenerator was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2589
    }
2590
}
2591
2592
if (defined('TYPO3_MODE') && $TYPO3_CONF_VARS[TYPO3_MODE]['XCLASS']['ext/crawler/class.tx_crawler_lib.php']) {
0 ignored issues
show
Bug introduced by
The constant TYPO3_MODE was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
2593
    include_once($TYPO3_CONF_VARS[TYPO3_MODE]['XCLASS']['ext/crawler/class.tx_crawler_lib.php']);
2594
}
2595