Test Failed
Push — 6-0 ( cfb4d5...b26d37 )
by
unknown
04:59
created

tx_crawler_lib::CLI_main_im()   F

Complexity

Conditions 17
Paths 336

Size

Total Lines 96
Code Lines 64

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 17
eloc 64
nc 336
nop 0
dl 0
loc 96
rs 3.6909
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/***************************************************************
4
 *  Copyright notice
5
 *
6
 *  (c) 2016 AOE GmbH <[email protected]>
7
 *
8
 *  All rights reserved
9
 *
10
 *  This script is part of the TYPO3 project. The TYPO3 project is
11
 *  free software; you can redistribute it and/or modify
12
 *  it under the terms of the GNU General Public License as published by
13
 *  the Free Software Foundation; either version 3 of the License, or
14
 *  (at your option) any later version.
15
 *
16
 *  The GNU General Public License can be found at
17
 *  http://www.gnu.org/copyleft/gpl.html.
18
 *
19
 *  This script is distributed in the hope that it will be useful,
20
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
21
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
22
 *  GNU General Public License for more details.
23
 *
24
 *  This copyright notice MUST APPEAR in all copies of the script!
25
 ***************************************************************/
26
27
/**
28
 * Class tx_crawler_lib
29
 */
30
class tx_crawler_lib
31
{
32
    /**
33
     * @var integer
34
     */
35
    public $setID = 0;
36
37
    /**
38
     * @var string
39
     */
40
    public $processID = '';
41
42
    /**
43
     * One hour is max stalled time for the CLI
44
     * If the process had the status "start" for 3600 seconds, it will be regarded stalled and a new process is started
45
     *
46
     * @var integer
47
     */
48
    public $max_CLI_exec_time = 3600;
49
50
    /**
51
     * @var array
52
     */
53
    public $duplicateTrack = [];
54
55
    /**
56
     * @var array
57
     */
58
    public $downloadUrls = [];
59
60
    /**
61
     * @var array
62
     */
63
    public $incomingProcInstructions = [];
64
65
    /**
66
     * @var array
67
     */
68
    public $incomingConfigurationSelection = [];
69
70
    /**
71
     * @var array
72
     */
73
    public $registerQueueEntriesInternallyOnly = [];
74
75
    /**
76
     * @var array
77
     */
78
    public $queueEntries = [];
79
80
    /**
81
     * @var array
82
     */
83
    public $urlList = [];
84
85
    /**
86
     * @var boolean
87
     */
88
    public $debugMode = false;
89
90
    /**
91
     * @var array
92
     */
93
    public $extensionSettings = [];
94
95
    /**
96
     * Mount Point
97
     *
98
     * @var boolean
99
     */
100
    public $MP = false;
101
102
    /**
103
     * @var string
104
     */
105
    protected $processFilename;
106
107
    /**
108
     * Holds the internal access mode can be 'gui','cli' or 'cli_im'
109
     *
110
     * @var string
111
     */
112
    protected $accessMode;
113
114
    /**
115
     * @var \TYPO3\CMS\Core\Database\DatabaseConnection
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Database\DatabaseConnection was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
116
     */
117
    private $db;
118
119
    /**
120
     * @var TYPO3\CMS\Core\Authentication\BackendUserAuthentication
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Authentic...ckendUserAuthentication was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
121
     */
122
    private $backendUser;
123
124
    const CLI_STATUS_NOTHING_PROCCESSED = 0;
125
    const CLI_STATUS_REMAIN = 1; //queue not empty
126
    const CLI_STATUS_PROCESSED = 2; //(some) queue items where processed
127
    const CLI_STATUS_ABORTED = 4; //instance didn't finish
128
    const CLI_STATUS_POLLABLE_PROCESSED = 8;
129
130
    /**
131
     * Method to set the accessMode can be gui, cli or cli_im
132
     *
133
     * @return string
134
     */
135
    public function getAccessMode()
136
    {
137
        return $this->accessMode;
138
    }
139
140
    /**
141
     * @param string $accessMode
142
     */
143
    public function setAccessMode($accessMode)
144
    {
145
        $this->accessMode = $accessMode;
146
    }
147
148
    /**
149
     * Set disabled status to prevent processes from being processed
150
     *
151
     * @param  bool $disabled (optional, defaults to true)
152
     * @return void
153
     */
154
    public function setDisabled($disabled = true)
155
    {
156
        if ($disabled) {
157
            \TYPO3\CMS\Core\Utility\GeneralUtility::writeFile($this->processFilename, '');
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\GeneralUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
158
        } else {
159
            if (is_file($this->processFilename)) {
160
                unlink($this->processFilename);
161
            }
162
        }
163
    }
164
165
    /**
166
     * Get disable status
167
     *
168
     * @return bool true if disabled
169
     */
170
    public function getDisabled()
171
    {
172
        if (is_file($this->processFilename)) {
173
            return true;
174
        } else {
175
            return false;
176
        }
177
    }
178
179
    /**
180
     * @param string $filenameWithPath
181
     *
182
     * @return void
183
     */
184
    public function setProcessFilename($filenameWithPath)
185
    {
186
        $this->processFilename = $filenameWithPath;
187
    }
188
189
    /**
190
     * @return string
191
     */
192
    public function getProcessFilename()
193
    {
194
        return $this->processFilename;
195
    }
196
197
    /************************************
198
     *
199
     * Getting URLs based on Page TSconfig
200
     *
201
     ************************************/
202
203
    public function __construct()
204
    {
205
        $this->db = $GLOBALS['TYPO3_DB'];
206
        $this->backendUser = $GLOBALS['BE_USER'];
207
        $this->processFilename = PATH_site . 'typo3temp/tx_crawler.proc';
0 ignored issues
show
Bug introduced by
The constant PATH_site was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
208
209
        $settings = unserialize($GLOBALS['TYPO3_CONF_VARS']['EXT']['extConf']['crawler']);
210
        $settings = is_array($settings) ? $settings : [];
211
212
        // read ext_em_conf_template settings and set
213
        $this->setExtensionSettings($settings);
214
215
        // set defaults:
216
        if (\TYPO3\CMS\Core\Utility\MathUtility::convertToPositiveInteger($this->extensionSettings['countInARun']) == 0) {
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\MathUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
217
            $this->extensionSettings['countInARun'] = 100;
218
        }
219
220
        $this->extensionSettings['processLimit'] = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($this->extensionSettings['processLimit'], 1, 99, 1);
221
    }
222
223
    /**
224
     * Sets the extensions settings (unserialized pendant of $TYPO3_CONF_VARS['EXT']['extConf']['crawler']).
225
     *
226
     * @param array $extensionSettings
227
     * @return void
228
     */
229
    public function setExtensionSettings(array $extensionSettings)
230
    {
231
        $this->extensionSettings = $extensionSettings;
232
    }
233
234
    /**
235
     * Check if the given page should be crawled
236
     *
237
     * @param array $pageRow
238
     * @return false|string false if the page should be crawled (not excluded), true / skipMessage if it should be skipped
239
     */
240
    public function checkIfPageShouldBeSkipped(array $pageRow)
241
    {
242
        $skipPage = false;
243
        $skipMessage = 'Skipped'; // message will be overwritten later
244
245
        // if page is hidden
246
        if (!$this->extensionSettings['crawlHiddenPages']) {
247
            if ($pageRow['hidden']) {
248
                $skipPage = true;
249
                $skipMessage = 'Because page is hidden';
250
            }
251
        }
252
253
        if (!$skipPage) {
254
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList('3,4', $pageRow['doktype']) || $pageRow['doktype'] >= 199) {
255
                $skipPage = true;
256
                $skipMessage = 'Because doktype is not allowed';
257
            }
258
        }
259
260
        if (!$skipPage) {
261
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'])) {
262
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'] as $key => $doktypeList) {
263
                    if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($doktypeList, $pageRow['doktype'])) {
264
                        $skipPage = true;
265
                        $skipMessage = 'Doktype was excluded by "' . $key . '"';
266
                        break;
267
                    }
268
                }
269
            }
270
        }
271
272
        if (!$skipPage) {
273
            // veto hook
274
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'])) {
275
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'] as $key => $func) {
276
                    $params = [
277
                        'pageRow' => $pageRow
278
                    ];
279
                    // expects "false" if page is ok and "true" or a skipMessage if this page should _not_ be crawled
280
                    $veto = \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($func, $params, $this);
281
                    if ($veto !== false) {
282
                        $skipPage = true;
283
                        if (is_string($veto)) {
284
                            $skipMessage = $veto;
285
                        } else {
286
                            $skipMessage = 'Veto from hook "' . htmlspecialchars($key) . '"';
287
                        }
288
                        // no need to execute other hooks if a previous one return a veto
289
                        break;
290
                    }
291
                }
292
            }
293
        }
294
295
        return $skipPage ? $skipMessage : false;
296
    }
297
298
    /**
299
     * Wrapper method for getUrlsForPageId()
300
     * It returns an array of configurations and no urls!
301
     *
302
     * @param array $pageRow Page record with at least dok-type and uid columns.
303
     * @param string $skipMessage
304
     * @return array
305
     * @see getUrlsForPageId()
306
     */
307
    public function getUrlsForPageRow(array $pageRow, &$skipMessage = '')
308
    {
309
        $message = $this->checkIfPageShouldBeSkipped($pageRow);
310
311
        if ($message === false) {
312
            $forceSsl = ($pageRow['url_scheme'] === 2) ? true : false;
313
            $res = $this->getUrlsForPageId($pageRow['uid'], $forceSsl);
314
            $skipMessage = '';
315
        } else {
316
            $skipMessage = $message;
317
            $res = [];
318
        }
319
320
        return $res;
321
    }
322
323
    /**
324
     * This method is used to count if there are ANY unprocessed queue entries
325
     * of a given page_id and the configuration which matches a given hash.
326
     * If there if none, we can skip an inner detail check
327
     *
328
     * @param  int $uid
329
     * @param  string $configurationHash
330
     * @return boolean
331
     */
332
    protected function noUnprocessedQueueEntriesForPageWithConfigurationHashExist($uid, $configurationHash)
333
    {
334
        $configurationHash = $this->db->fullQuoteStr($configurationHash, 'tx_crawler_queue');
335
        $res = $this->db->exec_SELECTquery('count(*) as anz', 'tx_crawler_queue', "page_id=" . intval($uid) . " AND configuration_hash=" . $configurationHash . " AND exec_time=0");
336
        $row = $this->db->sql_fetch_assoc($res);
337
338
        return ($row['anz'] == 0);
339
    }
340
341
    /**
342
     * Creates a list of URLs from input array (and submits them to queue if asked for)
343
     * See Web > Info module script + "indexed_search"'s crawler hook-client using this!
344
     *
345
     * @param    array        Information about URLs from pageRow to crawl.
346
     * @param    array        Page row
347
     * @param    integer        Unix time to schedule indexing to, typically time()
348
     * @param    integer        Number of requests per minute (creates the interleave between requests)
349
     * @param    boolean        If set, submits the URLs to queue
0 ignored issues
show
Bug introduced by
The type If was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
350
     * @param    boolean        If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
351
     * @param    array        Array which is passed by reference and contains the an id per url to secure we will not crawl duplicates
352
     * @param    array        Array which will be filled with URLS for download if flag is set.
353
     * @param    array        Array of processing instructions
354
     * @return    string        List of URLs (meant for display in backend module)
355
     *
356
     */
357
    public function urlListFromUrlArray(
358
    array $vv,
359
    array $pageRow,
360
    $scheduledTime,
361
    $reqMinute,
362
    $submitCrawlUrls,
363
    $downloadCrawlUrls,
364
    array &$duplicateTrack,
365
    array &$downloadUrls,
366
    array $incomingProcInstructions
367
    ) {
368
369
        // realurl support (thanks to Ingo Renner)
370
        if (\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\ExtensionManagementUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
371
372
            /** @var tx_realurl $urlObj */
373
            $urlObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_realurl');
374
375
            if (!empty($vv['subCfg']['baseUrl'])) {
376
                $urlParts = parse_url($vv['subCfg']['baseUrl']);
377
                $host = strtolower($urlParts['host']);
378
                $urlObj->host = $host;
379
380
                // First pass, finding configuration OR pointer string:
381
                $urlObj->extConf = isset($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
382
383
                // If it turned out to be a string pointer, then look up the real config:
384
                if (is_string($urlObj->extConf)) {
385
                    $urlObj->extConf = is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
386
                }
387
            }
388
389
            if (!$GLOBALS['TSFE']->sys_page) {
390
                $GLOBALS['TSFE']->sys_page = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\PageRepository');
391
            }
392
            if (!$GLOBALS['TSFE']->csConvObj) {
393
                $GLOBALS['TSFE']->csConvObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Core\Charset\CharsetConverter');
394
            }
395
            if (!$GLOBALS['TSFE']->tmpl->rootLine[0]['uid']) {
396
                $GLOBALS['TSFE']->tmpl->rootLine[0]['uid'] = $urlObj->extConf['pagePath']['rootpage_id'];
397
            }
398
        }
399
400
        if (is_array($vv['URLs'])) {
401
            $configurationHash = md5(serialize($vv));
402
            $skipInnerCheck = $this->noUnprocessedQueueEntriesForPageWithConfigurationHashExist($pageRow['uid'], $configurationHash);
403
404
            foreach ($vv['URLs'] as $urlQuery) {
405
                if ($this->drawURLs_PIfilter($vv['subCfg']['procInstrFilter'], $incomingProcInstructions)) {
406
407
                    // Calculate cHash:
408
                    if ($vv['subCfg']['cHash']) {
409
                        /* @var $cacheHash \TYPO3\CMS\Frontend\Page\CacheHashCalculator */
410
                        $cacheHash = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\CacheHashCalculator');
411
                        $urlQuery .= '&cHash=' . $cacheHash->generateForParameters($urlQuery);
412
                    }
413
414
                    // Create key by which to determine unique-ness:
415
                    $uKey = $urlQuery . '|' . $vv['subCfg']['userGroups'] . '|' . $vv['subCfg']['baseUrl'] . '|' . $vv['subCfg']['procInstrFilter'];
416
417
                    // realurl support (thanks to Ingo Renner)
418
                    $urlQuery = 'index.php' . $urlQuery;
419
                    if (\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
420
                        $params = [
421
                            'LD' => [
422
                                'totalURL' => $urlQuery
423
                            ],
424
                            'TCEmainHook' => true
425
                        ];
426
                        $urlObj->encodeSpURL($params);
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $urlObj does not seem to be defined for all execution paths leading up to this point.
Loading history...
427
                        $urlQuery = $params['LD']['totalURL'];
428
                    }
429
430
                    // Scheduled time:
431
                    $schTime = $scheduledTime + round(count($duplicateTrack) * (60 / $reqMinute));
432
                    $schTime = floor($schTime / 60) * 60;
433
434
                    if (isset($duplicateTrack[$uKey])) {
435
436
                        //if the url key is registered just display it and do not resubmit is
437
                        $urlList = '<em><span class="typo3-dimmed">' . htmlspecialchars($urlQuery) . '</span></em><br/>';
438
                    } else {
439
                        $urlList = '[' . date('d.m.y H:i', $schTime) . '] ' . htmlspecialchars($urlQuery);
0 ignored issues
show
Bug introduced by
Are you sure date('d.m.y H:i', $schTime) of type false|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

439
                        $urlList = '[' . /** @scrutinizer ignore-type */ date('d.m.y H:i', $schTime) . '] ' . htmlspecialchars($urlQuery);
Loading history...
Bug introduced by
$schTime of type double is incompatible with the type integer expected by parameter $timestamp of date(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

439
                        $urlList = '[' . date('d.m.y H:i', /** @scrutinizer ignore-type */ $schTime) . '] ' . htmlspecialchars($urlQuery);
Loading history...
440
                        $this->urlList[] = '[' . date('d.m.y H:i', $schTime) . '] ' . $urlQuery;
441
442
                        $theUrl = ($vv['subCfg']['baseUrl'] ? $vv['subCfg']['baseUrl'] : \TYPO3\CMS\Core\Utility\GeneralUtility::getIndpEnv('TYPO3_SITE_URL')) . $urlQuery;
443
444
                        // Submit for crawling!
445
                        if ($submitCrawlUrls) {
446
                            $added = $this->addUrl(
447
                            $pageRow['uid'],
448
                            $theUrl,
449
                            $vv['subCfg'],
450
                            $scheduledTime,
451
                            $configurationHash,
452
                            $skipInnerCheck
453
                            );
454
                            if ($added === false) {
455
                                $urlList .= ' (Url already existed)';
456
                            }
457
                        } elseif ($downloadCrawlUrls) {
458
                            $downloadUrls[$theUrl] = $theUrl;
459
                        }
460
461
                        $urlList .= '<br />';
462
                    }
463
                    $duplicateTrack[$uKey] = true;
464
                }
465
            }
466
        } else {
467
            $urlList = 'ERROR - no URL generated';
468
        }
469
470
        return $urlList;
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $urlList does not seem to be defined for all execution paths leading up to this point.
Loading history...
471
    }
472
473
    /**
474
     * Returns true if input processing instruction is among registered ones.
475
     *
476
     * @param string $piString PI to test
477
     * @param array $incomingProcInstructions Processing instructions
478
     * @return boolean
479
     */
480
    public function drawURLs_PIfilter($piString, array $incomingProcInstructions)
481
    {
482
        if (empty($incomingProcInstructions)) {
483
            return true;
484
        }
485
486
        foreach ($incomingProcInstructions as $pi) {
487
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($piString, $pi)) {
488
                return true;
489
            }
490
        }
491
    }
492
493
    public function getPageTSconfigForId($id)
494
    {
495
        if (!$this->MP) {
496
            $pageTSconfig = \TYPO3\CMS\Backend\Utility\BackendUtility::getPagesTSconfig($id);
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Backend\Utility\BackendUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
497
        } else {
498
            list(, $mountPointId) = explode('-', $this->MP);
0 ignored issues
show
Bug introduced by
$this->MP of type true is incompatible with the type string expected by parameter $string of explode(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

498
            list(, $mountPointId) = explode('-', /** @scrutinizer ignore-type */ $this->MP);
Loading history...
499
            $pageTSconfig = \TYPO3\CMS\Backend\Utility\BackendUtility::getPagesTSconfig($mountPointId);
500
        }
501
502
        // Call a hook to alter configuration
503
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'])) {
504
            $params = [
505
                'pageId' => $id,
506
                'pageTSConfig' => &$pageTSconfig
507
            ];
508
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'] as $userFunc) {
509
                \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($userFunc, $params, $this);
510
            }
511
        }
512
513
        return $pageTSconfig;
514
    }
515
516
    /**
517
     * This methods returns an array of configurations.
518
     * And no urls!
519
     *
520
     * @param integer $id Page ID
521
     * @param bool $forceSsl Use https
522
     * @return array
523
     */
524
    protected function getUrlsForPageId($id, $forceSsl = false)
525
    {
526
527
        /**
528
         * Get configuration from tsConfig
529
         */
530
531
        // Get page TSconfig for page ID:
532
        $pageTSconfig = $this->getPageTSconfigForId($id);
533
534
        $res = [];
535
536
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.'])) {
537
            $crawlerCfg = $pageTSconfig['tx_crawler.']['crawlerCfg.'];
538
539
            if (is_array($crawlerCfg['paramSets.'])) {
540
                foreach ($crawlerCfg['paramSets.'] as $key => $values) {
541
                    if (!is_array($values)) {
542
543
                        // Sub configuration for a single configuration string:
544
                        $subCfg = (array)$crawlerCfg['paramSets.'][$key . '.'];
545
                        $subCfg['key'] = $key;
546
547
                        if (strcmp($subCfg['procInstrFilter'], '')) {
548
                            $subCfg['procInstrFilter'] = implode(',', \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']));
549
                        }
550
                        $pidOnlyList = implode(',', \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $subCfg['pidsOnly'], 1));
551
552
                        // process configuration if it is not page-specific or if the specific page is the current page:
553
                        if (!strcmp($subCfg['pidsOnly'], '') || \TYPO3\CMS\Core\Utility\GeneralUtility::inList($pidOnlyList, $id)) {
554
555
                                // add trailing slash if not present
556
                            if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
557
                                $subCfg['baseUrl'] .= '/';
558
                            }
559
560
                            // Explode, process etc.:
561
                            $res[$key] = [];
562
                            $res[$key]['subCfg'] = $subCfg;
563
                            $res[$key]['paramParsed'] = $this->parseParams($values);
564
                            $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
565
                            $res[$key]['origin'] = 'pagets';
566
567
                            // recognize MP value
568
                            if (!$this->MP) {
569
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
570
                            } else {
571
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id . '&MP=' . $this->MP]);
0 ignored issues
show
Bug introduced by
Are you sure $this->MP of type true can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

571
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id . '&MP=' . /** @scrutinizer ignore-type */ $this->MP]);
Loading history...
572
                            }
573
                        }
574
                    }
575
                }
576
            }
577
        }
578
579
        /**
580
         * Get configuration from tx_crawler_configuration records
581
         */
582
583
        // get records along the rootline
584
        $rootLine = \TYPO3\CMS\Backend\Utility\BackendUtility::BEgetRootLine($id);
585
586
        foreach ($rootLine as $page) {
587
            $configurationRecordsForCurrentPage = \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordsByField(
588
                'tx_crawler_configuration',
589
                'pid',
590
                intval($page['uid']),
591
                \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('tx_crawler_configuration') . \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('tx_crawler_configuration')
592
            );
593
594
            if (is_array($configurationRecordsForCurrentPage)) {
595
                foreach ($configurationRecordsForCurrentPage as $configurationRecord) {
596
597
                        // check access to the configuration record
598
                    if (empty($configurationRecord['begroups']) || $GLOBALS['BE_USER']->isAdmin() || $this->hasGroupAccess($GLOBALS['BE_USER']->user['usergroup_cached_list'], $configurationRecord['begroups'])) {
599
                        $pidOnlyList = implode(',', \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $configurationRecord['pidsonly'], 1));
600
601
                        // process configuration if it is not page-specific or if the specific page is the current page:
602
                        if (!strcmp($configurationRecord['pidsonly'], '') || \TYPO3\CMS\Core\Utility\GeneralUtility::inList($pidOnlyList, $id)) {
603
                            $key = $configurationRecord['name'];
604
605
                            // don't overwrite previously defined paramSets
606
                            if (!isset($res[$key])) {
607
608
                                    /* @var $TSparserObject \TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser */
609
                                $TSparserObject = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser');
610
                                $TSparserObject->parse($configurationRecord['processing_instruction_parameters_ts']);
611
612
                                $subCfg = [
613
                                    'procInstrFilter' => $configurationRecord['processing_instruction_filter'],
614
                                    'procInstrParams.' => $TSparserObject->setup,
615
                                    'baseUrl' => $this->getBaseUrlForConfigurationRecord(
616
                                        $configurationRecord['base_url'],
617
                                        $configurationRecord['sys_domain_base_url'],
618
                                        $forceSsl
619
                                    ),
620
                                    'realurl' => $configurationRecord['realurl'],
621
                                    'cHash' => $configurationRecord['chash'],
622
                                    'userGroups' => $configurationRecord['fegroups'],
623
                                    'exclude' => $configurationRecord['exclude'],
624
                                    'rootTemplatePid' => (int) $configurationRecord['root_template_pid'],
625
                                    'key' => $key,
626
                                ];
627
628
                                // add trailing slash if not present
629
                                if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
630
                                    $subCfg['baseUrl'] .= '/';
631
                                }
632
                                if (!in_array($id, $this->expandExcludeString($subCfg['exclude']))) {
633
                                    $res[$key] = [];
634
                                    $res[$key]['subCfg'] = $subCfg;
635
                                    $res[$key]['paramParsed'] = $this->parseParams($configurationRecord['configuration']);
636
                                    $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
637
                                    $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
638
                                    $res[$key]['origin'] = 'tx_crawler_configuration_' . $configurationRecord['uid'];
639
                                }
640
                            }
641
                        }
642
                    }
643
                }
644
            }
645
        }
646
647
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'])) {
648
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'] as $func) {
649
                $params = [
650
                    'res' => &$res,
651
                ];
652
                \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($func, $params, $this);
653
            }
654
        }
655
656
        return $res;
657
    }
658
659
    /**
660
     * Checks if a domain record exist and returns the base-url based on the record. If not the given baseUrl string is used.
661
     *
662
     * @param string $baseUrl
663
     * @param integer $sysDomainUid
664
     * @param bool $ssl
665
     * @return string
666
     */
667
    protected function getBaseUrlForConfigurationRecord($baseUrl, $sysDomainUid, $ssl = false)
668
    {
669
        $sysDomainUid = intval($sysDomainUid);
670
        $urlScheme = ($ssl === false) ? 'http' : 'https';
671
672
        if ($sysDomainUid > 0) {
673
            $res = $this->db->exec_SELECTquery(
674
                '*',
675
                'sys_domain',
676
                'uid = ' . $sysDomainUid .
677
                \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('sys_domain') .
678
                \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('sys_domain')
679
            );
680
            $row = $this->db->sql_fetch_assoc($res);
681
            if ($row['domainName'] != '') {
682
                return $urlScheme . '://' . $row['domainName'];
683
            }
684
        }
685
        return $baseUrl;
686
    }
687
688
    public function getConfigurationsForBranch($rootid, $depth)
689
    {
690
        $configurationsForBranch = [];
691
692
        $pageTSconfig = $this->getPageTSconfigForId($rootid);
693
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'])) {
694
            $sets = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'];
695
            if (is_array($sets)) {
696
                foreach ($sets as $key => $value) {
697
                    if (!is_array($value)) {
698
                        continue;
699
                    }
700
                    $configurationsForBranch[] = substr($key, -1) == '.' ? substr($key, 0, -1) : $key;
701
                }
702
            }
703
        }
704
        $pids = [];
705
        $rootLine = \TYPO3\CMS\Backend\Utility\BackendUtility::BEgetRootLine($rootid);
706
        foreach ($rootLine as $node) {
707
            $pids[] = $node['uid'];
708
        }
709
        /* @var \TYPO3\CMS\Backend\Tree\View\PageTreeView */
710
        $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
711
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
712
        $tree->init('AND ' . $perms_clause);
713
        $tree->getTree($rootid, $depth, '');
714
        foreach ($tree->tree as $node) {
715
            $pids[] = $node['row']['uid'];
716
        }
717
718
        $res = $this->db->exec_SELECTquery(
719
            '*',
720
            'tx_crawler_configuration',
721
            'pid IN (' . implode(',', $pids) . ') ' .
722
            \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('tx_crawler_configuration') .
723
            \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('tx_crawler_configuration') . ' ' .
724
            \TYPO3\CMS\Backend\Utility\BackendUtility::versioningPlaceholderClause('tx_crawler_configuration') . ' '
725
        );
726
727
        while ($row = $this->db->sql_fetch_assoc($res)) {
728
            $configurationsForBranch[] = $row['name'];
729
        }
730
        $this->db->sql_free_result($res);
731
        return $configurationsForBranch;
732
    }
733
734
    /**
735
     * Check if a user has access to an item
736
     * (e.g. get the group list of the current logged in user from $GLOBALS['TSFE']->gr_list)
737
     *
738
     * @see \TYPO3\CMS\Frontend\Page\PageRepository::getMultipleGroupsWhereClause()
739
     * @param  string $groupList    Comma-separated list of (fe_)group UIDs from a user
740
     * @param  string $accessList   Comma-separated list of (fe_)group UIDs of the item to access
741
     * @return bool                 TRUE if at least one of the users group UIDs is in the access list or the access list is empty
742
     */
743
    public function hasGroupAccess($groupList, $accessList)
744
    {
745
        if (empty($accessList)) {
746
            return true;
747
        }
748
        foreach (\TYPO3\CMS\Core\Utility\GeneralUtility::intExplode(',', $groupList) as $groupUid) {
749
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($accessList, $groupUid)) {
750
                return true;
751
            }
752
        }
753
        return false;
754
    }
755
756
    /**
757
     * Parse GET vars of input Query into array with key=>value pairs
758
     *
759
     * @param string $inputQuery Input query string
760
     * @return array
761
     */
762
    public function parseParams($inputQuery)
763
    {
764
        // Extract all GET parameters into an ARRAY:
765
        $paramKeyValues = [];
766
        $GETparams = explode('&', $inputQuery);
767
768
        foreach ($GETparams as $paramAndValue) {
769
            list($p, $v) = explode('=', $paramAndValue, 2);
770
            if (strlen($p)) {
771
                $paramKeyValues[rawurldecode($p)] = rawurldecode($v);
772
            }
773
        }
774
775
        return $paramKeyValues;
776
    }
777
778
    /**
779
     * Will expand the parameters configuration to individual values. This follows a certain syntax of the value of each parameter.
780
     * Syntax of values:
781
     * - Basically: If the value is wrapped in [...] it will be expanded according to the following syntax, otherwise the value is taken literally
782
     * - Configuration is splitted by "|" and the parts are processed individually and finally added together
783
     * - For each configuration part:
784
     *         - "[int]-[int]" = Integer range, will be expanded to all values in between, values included, starting from low to high (max. 1000). Example "1-34" or "-40--30"
785
     *         - "_TABLE:[TCA table name];[_PID:[optional page id, default is current page]];[_ENABLELANG:1]" = Look up of table records from PID, filtering out deleted records. Example "_TABLE:tt_content; _PID:123"
786
     *        _ENABLELANG:1 picks only original records without their language overlays
787
     *         - Default: Literal value
788
     *
789
     * @param array $paramArray Array with key (GET var name) and values (value of GET var which is configuration for expansion)
790
     * @param integer $pid Current page ID
791
     * @return array
792
     */
793
    public function expandParameters($paramArray, $pid)
794
    {
795
        global $TCA;
796
797
        // Traverse parameter names:
798
        foreach ($paramArray as $p => $v) {
799
            $v = trim($v);
800
801
            // If value is encapsulated in square brackets it means there are some ranges of values to find, otherwise the value is literal
802
            if (substr($v, 0, 1) === '[' && substr($v, -1) === ']') {
803
                // So, find the value inside brackets and reset the paramArray value as an array.
804
                $v = substr($v, 1, -1);
805
                $paramArray[$p] = [];
806
807
                // Explode parts and traverse them:
808
                $parts = explode('|', $v);
0 ignored issues
show
Bug introduced by
It seems like $v can also be of type false; however, parameter $string of explode() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

808
                $parts = explode('|', /** @scrutinizer ignore-type */ $v);
Loading history...
809
                foreach ($parts as $pV) {
810
811
                        // Look for integer range: (fx. 1-34 or -40--30 // reads minus 40 to minus 30)
812
                    if (preg_match('/^(-?[0-9]+)\s*-\s*(-?[0-9]+)$/', trim($pV), $reg)) {
813
814
                        // Swap if first is larger than last:
815
                        if ($reg[1] > $reg[2]) {
816
                            $temp = $reg[2];
817
                            $reg[2] = $reg[1];
818
                            $reg[1] = $temp;
819
                        }
820
821
                        // Traverse range, add values:
822
                        $runAwayBrake = 1000; // Limit to size of range!
823
                        for ($a = $reg[1]; $a <= $reg[2];$a++) {
824
                            $paramArray[$p][] = $a;
825
                            $runAwayBrake--;
826
                            if ($runAwayBrake <= 0) {
827
                                break;
828
                            }
829
                        }
830
                    } elseif (substr(trim($pV), 0, 7) == '_TABLE:') {
831
832
                        // Parse parameters:
833
                        $subparts = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(';', $pV);
834
                        $subpartParams = [];
835
                        foreach ($subparts as $spV) {
836
                            list($pKey, $pVal) = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(':', $spV);
837
                            $subpartParams[$pKey] = $pVal;
838
                        }
839
840
                        // Table exists:
841
                        if (isset($TCA[$subpartParams['_TABLE']])) {
842
                            $lookUpPid = isset($subpartParams['_PID']) ? intval($subpartParams['_PID']) : $pid;
843
                            $pidField = isset($subpartParams['_PIDFIELD']) ? trim($subpartParams['_PIDFIELD']) : 'pid';
844
                            $where = isset($subpartParams['_WHERE']) ? $subpartParams['_WHERE'] : '';
845
                            $addTable = isset($subpartParams['_ADDTABLE']) ? $subpartParams['_ADDTABLE'] : '';
846
847
                            $fieldName = $subpartParams['_FIELD'] ? $subpartParams['_FIELD'] : 'uid';
848
                            if ($fieldName === 'uid' || $TCA[$subpartParams['_TABLE']]['columns'][$fieldName]) {
849
                                $andWhereLanguage = '';
850
                                $transOrigPointerField = $TCA[$subpartParams['_TABLE']]['ctrl']['transOrigPointerField'];
851
852
                                if ($subpartParams['_ENABLELANG'] && $transOrigPointerField) {
853
                                    $andWhereLanguage = ' AND ' . $this->db->quoteStr($transOrigPointerField, $subpartParams['_TABLE']) . ' <= 0 ';
854
                                }
855
856
                                $where = $this->db->quoteStr($pidField, $subpartParams['_TABLE']) . '=' . intval($lookUpPid) . ' ' .
857
                                    $andWhereLanguage . $where;
858
859
                                $rows = $this->db->exec_SELECTgetRows(
860
                                    $fieldName,
861
                                    $subpartParams['_TABLE'] . $addTable,
862
                                    $where . \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause($subpartParams['_TABLE']),
863
                                    '',
864
                                    '',
865
                                    '',
866
                                    $fieldName
867
                                );
868
869
                                if (is_array($rows)) {
870
                                    $paramArray[$p] = array_merge($paramArray[$p], array_keys($rows));
871
                                }
872
                            }
873
                        }
874
                    } else { // Just add value:
875
                        $paramArray[$p][] = $pV;
876
                    }
877
                    // Hook for processing own expandParameters place holder
878
                    if (is_array($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'])) {
879
                        $_params = [
880
                            'pObj' => &$this,
881
                            'paramArray' => &$paramArray,
882
                            'currentKey' => $p,
883
                            'currentValue' => $pV,
884
                            'pid' => $pid
885
                        ];
886
                        foreach ($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'] as $key => $_funcRef) {
887
                            \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($_funcRef, $_params, $this);
888
                        }
889
                    }
890
                }
891
892
                // Make unique set of values and sort array by key:
893
                $paramArray[$p] = array_unique($paramArray[$p]);
894
                ksort($paramArray);
895
            } else {
896
                // Set the literal value as only value in array:
897
                $paramArray[$p] = [$v];
898
            }
899
        }
900
901
        return $paramArray;
902
    }
903
904
    /**
905
     * Compiling URLs from parameter array (output of expandParameters())
906
     * The number of URLs will be the multiplication of the number of parameter values for each key
907
     *
908
     * @param array $paramArray Output of expandParameters(): Array with keys (GET var names) and for each an array of values
909
     * @param array $urls URLs accumulated in this array (for recursion)
910
     * @return array
911
     */
912
    public function compileUrls($paramArray, $urls = [])
913
    {
914
        if (count($paramArray) && is_array($urls)) {
915
            // shift first off stack:
916
            reset($paramArray);
917
            $varName = key($paramArray);
918
            $valueSet = array_shift($paramArray);
919
920
            // Traverse value set:
921
            $newUrls = [];
922
            foreach ($urls as $url) {
923
                foreach ($valueSet as $val) {
924
                    $newUrls[] = $url . (strcmp($val, '') ? '&' . rawurlencode($varName) . '=' . rawurlencode($val) : '');
925
926
                    if (count($newUrls) > \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($this->extensionSettings['maxCompileUrls'], 1, 1000000000, 10000)) {
927
                        break;
928
                    }
929
                }
930
            }
931
            $urls = $newUrls;
932
            $urls = $this->compileUrls($paramArray, $urls);
933
        }
934
935
        return $urls;
936
    }
937
938
    /************************************
939
     *
940
     * Crawler log
941
     *
942
     ************************************/
943
944
    /**
945
     * Return array of records from crawler queue for input page ID
946
     *
947
     * @param integer $id Page ID for which to look up log entries.
948
     * @param string$filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
949
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
950
     * @param boolean $doFullFlush
951
     * @param integer $itemsPerPage Limit the amount of entries per page default is 10
952
     * @return array
953
     */
954
    public function getLogEntriesForPageId($id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
955
    {
956
        // FIXME: Write Unit tests for Filters
957
        switch ($filter) {
958
            case 'pending':
959
                $addWhere = ' AND exec_time=0';
960
                break;
961
            case 'finished':
962
                $addWhere = ' AND exec_time>0';
963
                break;
964
            default:
965
                $addWhere = '';
966
                break;
967
        }
968
969
        // FIXME: Write unit test that ensures that the right records are deleted.
970
        if ($doFlush) {
971
            $this->flushQueue(($doFullFlush ? '1=1' : ('page_id=' . intval($id))) . $addWhere);
972
            return [];
973
        } else {
974
            return $this->db->exec_SELECTgetRows(
975
                '*',
976
                'tx_crawler_queue',
977
                'page_id=' . intval($id) . $addWhere,
978
                '',
979
                'scheduled DESC',
980
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
981
            );
982
        }
983
    }
984
985
    /**
986
     * Return array of records from crawler queue for input set ID
987
     *
988
     * @param integer $set_id Set ID for which to look up log entries.
989
     * @param string $filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
990
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
991
     * @param integer $itemsPerPage Limit the amount of entires per page default is 10
992
     * @return array
993
     */
994
    public function getLogEntriesForSetId($set_id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
995
    {
996
        // FIXME: Write Unit tests for Filters
997
        switch ($filter) {
998
            case 'pending':
999
                $addWhere = ' AND exec_time=0';
1000
                break;
1001
            case 'finished':
1002
                $addWhere = ' AND exec_time>0';
1003
                break;
1004
            default:
1005
                $addWhere = '';
1006
                break;
1007
        }
1008
        // FIXME: Write unit test that ensures that the right records are deleted.
1009
        if ($doFlush) {
1010
            $this->flushQueue($doFullFlush ? '' : ('set_id=' . intval($set_id) . $addWhere));
1011
            return [];
1012
        } else {
1013
            return $this->db->exec_SELECTgetRows(
1014
                '*',
1015
                'tx_crawler_queue',
1016
                'set_id=' . intval($set_id) . $addWhere,
1017
                '',
1018
                'scheduled DESC',
1019
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
1020
            );
1021
        }
1022
    }
1023
1024
    /**
1025
     * Removes queue entires
1026
     *
1027
     * @param string $where SQL related filter for the entries which should be removed
1028
     * @return void
1029
     */
1030
    protected function flushQueue($where = '')
1031
    {
1032
        $realWhere = strlen($where) > 0 ? $where : '1=1';
1033
1034
        if (tx_crawler_domain_events_dispatcher::getInstance()->hasObserver('queueEntryFlush')) {
1035
            $groups = $this->db->exec_SELECTgetRows('DISTINCT set_id', 'tx_crawler_queue', $realWhere);
1036
            foreach ($groups as $group) {
1037
                tx_crawler_domain_events_dispatcher::getInstance()->post('queueEntryFlush', $group['set_id'], $this->db->exec_SELECTgetRows('uid, set_id', 'tx_crawler_queue', $realWhere . ' AND set_id="' . $group['set_id'] . '"'));
1038
            }
1039
        }
1040
1041
        $this->db->exec_DELETEquery('tx_crawler_queue', $realWhere);
1042
    }
1043
1044
    /**
1045
     * Adding call back entries to log (called from hooks typically, see indexed search class "class.crawler.php"
1046
     *
1047
     * @param integer $setId Set ID
1048
     * @param array $params Parameters to pass to call back function
1049
     * @param string $callBack Call back object reference, eg. 'EXT:indexed_search/class.crawler.php:&tx_indexedsearch_crawler'
1050
     * @param integer $page_id Page ID to attach it to
1051
     * @param integer $schedule Time at which to activate
1052
     * @return void
1053
     */
1054
    public function addQueueEntry_callBack($setId, $params, $callBack, $page_id = 0, $schedule = 0)
1055
    {
1056
        if (!is_array($params)) {
1057
            $params = [];
1058
        }
1059
        $params['_CALLBACKOBJ'] = $callBack;
1060
1061
        // Compile value array:
1062
        $fieldArray = [
1063
            'page_id' => intval($page_id),
1064
            'parameters' => serialize($params),
1065
            'scheduled' => intval($schedule) ? intval($schedule) : $this->getCurrentTime(),
1066
            'exec_time' => 0,
1067
            'set_id' => intval($setId),
1068
            'result_data' => '',
1069
        ];
1070
1071
        $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1072
    }
1073
1074
    /************************************
1075
     *
1076
     * URL setting
1077
     *
1078
     ************************************/
1079
1080
    /**
1081
     * Setting a URL for crawling:
1082
     *
1083
     * @param integer $id Page ID
1084
     * @param string $url Complete URL
1085
     * @param array $subCfg Sub configuration array (from TS config)
1086
     * @param integer $tstamp Scheduled-time
1087
     * @param string $configurationHash (optional) configuration hash
1088
     * @param bool $skipInnerDuplicationCheck (optional) skip inner duplication check
1089
     * @return bool
1090
     */
1091
    public function addUrl(
1092
        $id,
1093
        $url,
1094
        array $subCfg,
1095
        $tstamp,
1096
        $configurationHash = '',
1097
        $skipInnerDuplicationCheck = false
1098
    ) {
1099
        $urlAdded = false;
1100
1101
        // Creating parameters:
1102
        $parameters = [
1103
            'url' => $url
1104
        ];
1105
1106
        // fe user group simulation:
1107
        $uGs = implode(',', array_unique(\TYPO3\CMS\Core\Utility\GeneralUtility::intExplode(',', $subCfg['userGroups'], 1)));
1108
        if ($uGs) {
1109
            $parameters['feUserGroupList'] = $uGs;
1110
        }
1111
1112
        // Setting processing instructions
1113
        $parameters['procInstructions'] = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']);
1114
        if (is_array($subCfg['procInstrParams.'])) {
1115
            $parameters['procInstrParams'] = $subCfg['procInstrParams.'];
1116
        }
1117
1118
        // Possible TypoScript Template Parents
1119
        $parameters['rootTemplatePid'] = $subCfg['rootTemplatePid'];
1120
1121
        // Compile value array:
1122
        $parameters_serialized = serialize($parameters);
1123
        $fieldArray = [
1124
            'page_id' => intval($id),
1125
            'parameters' => $parameters_serialized,
1126
            'parameters_hash' => \TYPO3\CMS\Core\Utility\GeneralUtility::shortMD5($parameters_serialized),
1127
            'configuration_hash' => $configurationHash,
1128
            'scheduled' => $tstamp,
1129
            'exec_time' => 0,
1130
            'set_id' => intval($this->setID),
1131
            'result_data' => '',
1132
            'configuration' => $subCfg['key'],
1133
        ];
1134
1135
        if ($this->registerQueueEntriesInternallyOnly) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->registerQueueEntriesInternallyOnly of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
1136
            //the entries will only be registered and not stored to the database
1137
            $this->queueEntries[] = $fieldArray;
1138
        } else {
1139
            if (!$skipInnerDuplicationCheck) {
1140
                // check if there is already an equal entry
1141
                $rows = $this->getDuplicateRowsIfExist($tstamp, $fieldArray);
1142
            }
1143
1144
            if (count($rows) == 0) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $rows does not seem to be defined for all execution paths leading up to this point.
Loading history...
1145
                $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1146
                $uid = $this->db->sql_insert_id();
1147
                $rows[] = $uid;
1148
                $urlAdded = true;
1149
                tx_crawler_domain_events_dispatcher::getInstance()->post('urlAddedToQueue', $this->setID, ['uid' => $uid, 'fieldArray' => $fieldArray]);
1150
            } else {
1151
                tx_crawler_domain_events_dispatcher::getInstance()->post('duplicateUrlInQueue', $this->setID, ['rows' => $rows, 'fieldArray' => $fieldArray]);
1152
            }
1153
        }
1154
1155
        return $urlAdded;
1156
    }
1157
1158
    /**
1159
     * This method determines duplicates for a queue entry with the same parameters and this timestamp.
1160
     * If the timestamp is in the past, it will check if there is any unprocessed queue entry in the past.
1161
     * If the timestamp is in the future it will check, if the queued entry has exactly the same timestamp
1162
     *
1163
     * @param int $tstamp
1164
     * @param array $fieldArray
1165
     *
1166
     * @return array;
1167
     */
1168
    protected function getDuplicateRowsIfExist($tstamp, $fieldArray)
1169
    {
1170
        $rows = [];
1171
1172
        $currentTime = $this->getCurrentTime();
1173
1174
        //if this entry is scheduled with "now"
1175
        if ($tstamp <= $currentTime) {
1176
            if ($this->extensionSettings['enableTimeslot']) {
1177
                $timeBegin = $currentTime - 100;
1178
                $timeEnd = $currentTime + 100;
1179
                $where = ' ((scheduled BETWEEN ' . $timeBegin . ' AND ' . $timeEnd . ' ) OR scheduled <= ' . $currentTime . ') ';
1180
            } else {
1181
                $where = 'scheduled <= ' . $currentTime;
1182
            }
1183
        } elseif ($tstamp > $currentTime) {
1184
            //entry with a timestamp in the future need to have the same schedule time
1185
            $where = 'scheduled = ' . $tstamp ;
1186
        }
1187
1188
        if (!empty($where)) {
1189
            $result = $this->db->exec_SELECTgetRows(
1190
                'qid',
1191
                'tx_crawler_queue',
1192
                $where .
1193
                ' AND NOT exec_time' .
1194
                ' AND NOT process_id ' .
1195
                ' AND page_id=' . intval($fieldArray['page_id']) .
1196
                ' AND parameters_hash = ' . $this->db->fullQuoteStr($fieldArray['parameters_hash'], 'tx_crawler_queue')
1197
            );
1198
1199
            if (is_array($result)) {
1200
                foreach ($result as $value) {
1201
                    $rows[] = $value['qid'];
1202
                }
1203
            }
1204
        }
1205
1206
        return $rows;
1207
    }
1208
1209
    /**
1210
     * Returns the current system time
1211
     *
1212
     * @return int
1213
     */
1214
    public function getCurrentTime()
1215
    {
1216
        return time();
1217
    }
1218
1219
    /************************************
1220
     *
1221
     * URL reading
1222
     *
1223
     ************************************/
1224
1225
    /**
1226
     * Read URL for single queue entry
1227
     *
1228
     * @param integer $queueId
1229
     * @param boolean $force If set, will process even if exec_time has been set!
1230
     * @return integer
1231
     */
1232
    public function readUrl($queueId, $force = false)
1233
    {
1234
        $ret = 0;
1235
        if ($this->debugMode) {
1236
            \TYPO3\CMS\Core\Utility\GeneralUtility::devlog('crawler-readurl start ' . microtime(true), __FUNCTION__);
1237
        }
1238
        // Get entry:
1239
        list($queueRec) = $this->db->exec_SELECTgetRows(
1240
            '*',
1241
            'tx_crawler_queue',
1242
            'qid=' . intval($queueId) . ($force ? '' : ' AND exec_time=0 AND process_scheduled > 0')
1243
        );
1244
1245
        if (!is_array($queueRec)) {
1246
            return;
1247
        }
1248
1249
        $parameters = unserialize($queueRec['parameters']);
1250
        if ($parameters['rootTemplatePid']) {
1251
            $this->initTSFE((int)$parameters['rootTemplatePid']);
1252
        } else {
1253
            \TYPO3\CMS\Core\Utility\GeneralUtility::sysLog(
1254
                'Page with (' . $queueRec['page_id'] . ') could not be crawled, please check your crawler configuration. Perhaps no Root Template Pid is set',
1255
                'crawler',
1256
                \TYPO3\CMS\Core\Utility\GeneralUtility::SYSLOG_SEVERITY_WARNING
1257
            );
1258
        }
1259
1260
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1261
            __CLASS__,
1262
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_PREPROCESS,
1263
            [$queueId, &$queueRec]
1264
        );
1265
1266
        // Set exec_time to lock record:
1267
        $field_array = ['exec_time' => $this->getCurrentTime()];
1268
1269
        if (isset($this->processID)) {
1270
            //if mulitprocessing is used we need to store the id of the process which has handled this entry
1271
            $field_array['process_id_completed'] = $this->processID;
1272
        }
1273
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1274
1275
        $result = $this->readUrl_exec($queueRec);
1276
        $resultData = unserialize($result['content']);
1277
1278
        //atm there's no need to point to specific pollable extensions
1279
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'])) {
1280
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'] as $pollable) {
1281
                // only check the success value if the instruction is runnig
1282
                // it is important to name the pollSuccess key same as the procInstructions key
1283
                if (is_array($resultData['parameters']['procInstructions']) && in_array(
1284
                    $pollable,
1285
                        $resultData['parameters']['procInstructions']
1286
                )
1287
                ) {
1288
                    if (!empty($resultData['success'][$pollable]) && $resultData['success'][$pollable]) {
1289
                        $ret |= self::CLI_STATUS_POLLABLE_PROCESSED;
1290
                    }
1291
                }
1292
            }
1293
        }
1294
1295
        // Set result in log which also denotes the end of the processing of this entry.
1296
        $field_array = ['result_data' => serialize($result)];
1297
1298
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1299
            __CLASS__,
1300
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1301
            [$queueId, &$field_array]
1302
        );
1303
1304
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1305
1306
        if ($this->debugMode) {
1307
            \TYPO3\CMS\Core\Utility\GeneralUtility::devlog('crawler-readurl stop ' . microtime(true), __FUNCTION__);
1308
        }
1309
1310
        return $ret;
1311
    }
1312
1313
    /**
1314
     * Read URL for not-yet-inserted log-entry
1315
     *
1316
     * @param integer $field_array Queue field array,
1317
     * @return string
1318
     */
1319
    public function readUrlFromArray($field_array)
1320
    {
1321
1322
            // Set exec_time to lock record:
1323
        $field_array['exec_time'] = $this->getCurrentTime();
1324
        $this->db->exec_INSERTquery('tx_crawler_queue', $field_array);
1325
        $queueId = $field_array['qid'] = $this->db->sql_insert_id();
1326
1327
        $result = $this->readUrl_exec($field_array);
0 ignored issues
show
Bug introduced by
$field_array of type integer is incompatible with the type array expected by parameter $queueRec of tx_crawler_lib::readUrl_exec(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1327
        $result = $this->readUrl_exec(/** @scrutinizer ignore-type */ $field_array);
Loading history...
1328
1329
        // Set result in log which also denotes the end of the processing of this entry.
1330
        $field_array = ['result_data' => serialize($result)];
1331
1332
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1333
            __CLASS__,
1334
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1335
            [$queueId, &$field_array]
1336
        );
1337
1338
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1339
1340
        return $result;
1341
    }
1342
1343
    /**
1344
     * Read URL for a queue record
1345
     *
1346
     * @param array $queueRec Queue record
1347
     * @return string
1348
     */
1349
    public function readUrl_exec($queueRec)
1350
    {
1351
        // Decode parameters:
1352
        $parameters = unserialize($queueRec['parameters']);
1353
        $result = 'ERROR';
1354
        if (is_array($parameters)) {
1355
            if ($parameters['_CALLBACKOBJ']) { // Calling object:
1356
                $objRef = $parameters['_CALLBACKOBJ'];
1357
                $callBackObj = &\TYPO3\CMS\Core\Utility\GeneralUtility::getUserObj($objRef);
1358
                if (is_object($callBackObj)) {
1359
                    unset($parameters['_CALLBACKOBJ']);
1360
                    $result = ['content' => serialize($callBackObj->crawler_execute($parameters, $this))];
1361
                } else {
1362
                    $result = ['content' => 'No object: ' . $objRef];
1363
                }
1364
            } else { // Regular FE request:
1365
1366
                // Prepare:
1367
                $crawlerId = $queueRec['qid'] . ':' . md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey']);
1368
1369
                // Get result:
1370
                $result = $this->requestUrl($parameters['url'], $crawlerId);
1371
1372
                tx_crawler_domain_events_dispatcher::getInstance()->post('urlCrawled', $queueRec['set_id'], ['url' => $parameters['url'], 'result' => $result]);
1373
            }
1374
        }
1375
1376
        return $result;
0 ignored issues
show
Bug Best Practice introduced by
The expression return $result also could return the type array<string,string>|array which is incompatible with the documented return type string.
Loading history...
1377
    }
1378
1379
    /**
1380
     * Gets the content of a URL.
1381
     *
1382
     * @param string $originalUrl URL to read
1383
     * @param string $crawlerId Crawler ID string (qid + hash to verify)
1384
     * @param integer $timeout Timeout time
1385
     * @param integer $recursion Recursion limiter for 302 redirects
1386
     * @return array
1387
     */
1388
    public function requestUrl($originalUrl, $crawlerId, $timeout = 2, $recursion = 10)
1389
    {
1390
        if (!$recursion) {
1391
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type array.
Loading history...
1392
        }
1393
1394
        // Parse URL, checking for scheme:
1395
        $url = parse_url($originalUrl);
1396
1397
        if ($url === false) {
1398
            if (TYPO3_DLOG) {
0 ignored issues
show
Bug introduced by
The constant TYPO3_DLOG was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
1399
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Could not parse_url() for string "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1400
            }
1401
            return false;
1402
        }
1403
1404
        if (!in_array($url['scheme'], ['','http','https'])) {
1405
            if (TYPO3_DLOG) {
1406
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Scheme does not match for url "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
0 ignored issues
show
Bug introduced by
$url of type array is incompatible with the type string expected by parameter $args of sprintf(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1406
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Scheme does not match for url "%s"', /** @scrutinizer ignore-type */ $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
Loading history...
1407
            }
1408
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type array.
Loading history...
1409
        }
1410
1411
        // direct request
1412
        if ($this->extensionSettings['makeDirectRequests']) {
1413
            $result = $this->sendDirectRequest($originalUrl, $crawlerId);
0 ignored issues
show
Bug introduced by
$crawlerId of type string is incompatible with the type integer expected by parameter $crawlerId of tx_crawler_lib::sendDirectRequest(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1413
            $result = $this->sendDirectRequest($originalUrl, /** @scrutinizer ignore-type */ $crawlerId);
Loading history...
1414
            return $result;
1415
        }
1416
1417
        $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1418
1419
        // thanks to Pierrick Caillon for adding proxy support
1420
        $rurl = $url;
1421
1422
        if ($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlUse'] && $GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']) {
1423
            $rurl = parse_url($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']);
1424
            $url['path'] = $url['scheme'] . '://' . $url['host'] . ($url['port'] > 0 ? ':' . $url['port'] : '') . $url['path'];
1425
            $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1426
        }
1427
1428
        $host = $rurl['host'];
1429
1430
        if ($url['scheme'] == 'https') {
1431
            $host = 'ssl://' . $host;
1432
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 443;
1433
        } else {
1434
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 80;
1435
        }
1436
1437
        $startTime = microtime(true);
1438
        $fp = fsockopen($host, $port, $errno, $errstr, $timeout);
1439
1440
        if (!$fp) {
1441
            if (TYPO3_DLOG) {
1442
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
0 ignored issues
show
Bug introduced by
$url of type array<mixed,mixed|string>|array is incompatible with the type string expected by parameter $args of sprintf(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1442
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', /** @scrutinizer ignore-type */ $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
Loading history...
1443
            }
1444
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type array.
Loading history...
1445
        } else {
1446
            // Request message:
1447
            $msg = implode("\r\n", $reqHeaders) . "\r\n\r\n";
1448
            fputs($fp, $msg);
0 ignored issues
show
Bug introduced by
The call to fputs() has too few arguments starting with length. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1448
            /** @scrutinizer ignore-call */ 
1449
            fputs($fp, $msg);

This check compares calls to functions or methods with their respective definitions. If the call has less arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
1449
1450
            // Read response:
1451
            $d = $this->getHttpResponseFromStream($fp);
1452
            fclose($fp);
1453
1454
            $time = microtime(true) - $startTime;
1455
            $this->log($originalUrl . ' ' . $time);
1456
1457
            // Implode content and headers:
1458
            $result = [
1459
                'request' => $msg,
1460
                'headers' => implode('', $d['headers']),
1461
                'content' => implode('', (array)$d['content'])
1462
            ];
1463
1464
            if (($this->extensionSettings['follow30x']) && ($newUrl = $this->getRequestUrlFrom302Header($d['headers'], $url['user'], $url['pass']))) {
1465
                $result = array_merge(['parentRequest' => $result], $this->requestUrl($newUrl, $crawlerId, $recursion--));
1466
                $newRequestUrl = $this->requestUrl($newUrl, $crawlerId, $timeout, --$recursion);
1467
1468
                if (is_array($newRequestUrl)) {
1469
                    $result = array_merge(['parentRequest' => $result], $newRequestUrl);
1470
                } else {
1471
                    if (TYPO3_DLOG) {
1472
                        \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1473
                    }
1474
                    return false;
1475
                }
1476
            }
1477
1478
            return $result;
1479
        }
1480
    }
1481
1482
    /**
1483
     * Gets the base path of the website frontend.
1484
     * (e.g. if you call http://mydomain.com/cms/index.php in
1485
     * the browser the base path is "/cms/")
1486
     *
1487
     * @return string Base path of the website frontend
1488
     */
1489
    protected function getFrontendBasePath()
1490
    {
1491
        $frontendBasePath = '/';
1492
1493
        // Get the path from the extension settings:
1494
        if (isset($this->extensionSettings['frontendBasePath']) && $this->extensionSettings['frontendBasePath']) {
1495
            $frontendBasePath = $this->extensionSettings['frontendBasePath'];
1496
            // If empty, try to use config.absRefPrefix:
1497
        } elseif (isset($GLOBALS['TSFE']->absRefPrefix) && !empty($GLOBALS['TSFE']->absRefPrefix)) {
1498
            $frontendBasePath = $GLOBALS['TSFE']->absRefPrefix;
1499
            // If not in CLI mode the base path can be determined from $_SERVER environment:
1500
        } elseif (!defined('TYPO3_REQUESTTYPE_CLI') || !TYPO3_REQUESTTYPE_CLI) {
0 ignored issues
show
Bug introduced by
The constant TYPO3_REQUESTTYPE_CLI was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
1501
            $frontendBasePath = \TYPO3\CMS\Core\Utility\GeneralUtility::getIndpEnv('TYPO3_SITE_PATH');
1502
        }
1503
1504
        // Base path must be '/<pathSegements>/':
1505
        if ($frontendBasePath != '/') {
1506
            $frontendBasePath = '/' . ltrim($frontendBasePath, '/');
1507
            $frontendBasePath = rtrim($frontendBasePath, '/') . '/';
1508
        }
1509
1510
        return $frontendBasePath;
1511
    }
1512
1513
    /**
1514
     * Executes a shell command and returns the outputted result.
1515
     *
1516
     * @param string $command Shell command to be executed
1517
     * @return string Outputted result of the command execution
1518
     */
1519
    protected function executeShellCommand($command)
1520
    {
1521
        $result = shell_exec($command);
1522
        return $result;
1523
    }
1524
1525
    /**
1526
     * Reads HTTP response from the given stream.
1527
     *
1528
     * @param  resource $streamPointer  Pointer to connection stream.
1529
     * @return array                    Associative array with the following items:
1530
     *                                  headers <array> Response headers sent by server.
1531
     *                                  content <array> Content, with each line as an array item.
1532
     */
1533
    protected function getHttpResponseFromStream($streamPointer)
1534
    {
1535
        $response = ['headers' => [], 'content' => []];
1536
1537
        if (is_resource($streamPointer)) {
1538
            // read headers
1539
            while ($line = fgets($streamPointer, '2048')) {
1540
                $line = trim($line);
1541
                if ($line !== '') {
1542
                    $response['headers'][] = $line;
1543
                } else {
1544
                    break;
1545
                }
1546
            }
1547
1548
            // read content
1549
            while ($line = fgets($streamPointer, '2048')) {
1550
                $response['content'][] = $line;
1551
            }
1552
        }
1553
1554
        return $response;
1555
    }
1556
1557
    /**
1558
     * @param message
1559
     */
1560
    protected function log($message)
1561
    {
1562
        if (!empty($this->extensionSettings['logFileName'])) {
1563
            @file_put_contents($this->extensionSettings['logFileName'], date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);
0 ignored issues
show
Security Best Practice introduced by
It seems like you do not handle an error condition for file_put_contents(). This can introduce security issues, and is generally not recommended. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unhandled  annotation

1563
            /** @scrutinizer ignore-unhandled */ @file_put_contents($this->extensionSettings['logFileName'], date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);

If you suppress an error, we recommend checking for the error condition explicitly:

// For example instead of
@mkdir($dir);

// Better use
if (@mkdir($dir) === false) {
    throw new \RuntimeException('The directory '.$dir.' could not be created.');
}
Loading history...
Bug introduced by
Are you sure date('Ymd His') of type false|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1563
            @file_put_contents($this->extensionSettings['logFileName'], /** @scrutinizer ignore-type */ date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);
Loading history...
1564
        }
1565
    }
1566
1567
    /**
1568
     * Builds HTTP request headers.
1569
     *
1570
     * @param array $url
1571
     * @param string $crawlerId
1572
     *
1573
     * @return array
1574
     */
1575
    protected function buildRequestHeaderArray(array $url, $crawlerId)
1576
    {
1577
        $reqHeaders = [];
1578
        $reqHeaders[] = 'GET ' . $url['path'] . ($url['query'] ? '?' . $url['query'] : '') . ' HTTP/1.0';
1579
        $reqHeaders[] = 'Host: ' . $url['host'];
1580
        if (stristr($url['query'], 'ADMCMD_previewWS')) {
1581
            $reqHeaders[] = 'Cookie: $Version="1"; be_typo_user="1"; $Path=/';
1582
        }
1583
        $reqHeaders[] = 'Connection: close';
1584
        if ($url['user'] != '') {
1585
            $reqHeaders[] = 'Authorization: Basic ' . base64_encode($url['user'] . ':' . $url['pass']);
1586
        }
1587
        $reqHeaders[] = 'X-T3crawler: ' . $crawlerId;
1588
        $reqHeaders[] = 'User-Agent: TYPO3 crawler';
1589
        return $reqHeaders;
1590
    }
1591
1592
    /**
1593
     * Check if the submitted HTTP-Header contains a redirect location and built new crawler-url
1594
     *
1595
     * @param array $headers HTTP Header
1596
     * @param string $user HTTP Auth. User
1597
     * @param string $pass HTTP Auth. Password
1598
     * @return string
1599
     */
1600
    protected function getRequestUrlFrom302Header($headers, $user = '', $pass = '')
1601
    {
1602
        if (!is_array($headers)) {
1603
            return false;
1604
        }
1605
        if (!(stristr($headers[0], '301 Moved') || stristr($headers[0], '302 Found') || stristr($headers[0], '302 Moved'))) {
1606
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type string.
Loading history...
1607
        }
1608
1609
        foreach ($headers as $hl) {
1610
            $tmp = explode(": ", $hl);
1611
            $header[trim($tmp[0])] = trim($tmp[1]);
1612
            if (trim($tmp[0]) == 'Location') {
1613
                break;
1614
            }
1615
        }
1616
        if (!array_key_exists('Location', $header)) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $header seems to be defined by a foreach iteration on line 1609. Are you sure the iterator is never empty, otherwise this variable is not defined?
Loading history...
1617
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type string.
Loading history...
1618
        }
1619
1620
        if ($user != '') {
1621
            if (!($tmp = parse_url($header['Location']))) {
1622
                return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type string.
Loading history...
1623
            }
1624
            $newUrl = $tmp['scheme'] . '://' . $user . ':' . $pass . '@' . $tmp['host'] . $tmp['path'];
1625
            if ($tmp['query'] != '') {
1626
                $newUrl .= '?' . $tmp['query'];
1627
            }
1628
        } else {
1629
            $newUrl = $header['Location'];
1630
        }
1631
        return $newUrl;
1632
    }
1633
1634
    /**************************
1635
     *
1636
     * tslib_fe hooks:
1637
     *
1638
     **************************/
1639
1640
    /**
1641
     * Initialization hook (called after database connection)
1642
     * Takes the "HTTP_X_T3CRAWLER" header and looks up queue record and verifies if the session comes from the system (by comparing hashes)
1643
     *
1644
     * @param array $params Parameters from frontend
1645
     * @param object $ref TSFE object (reference under PHP5)
1646
     * @return void
1647
     */
1648
    public function fe_init(&$params, $ref)
0 ignored issues
show
Unused Code introduced by
The parameter $ref is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

1648
    public function fe_init(&$params, /** @scrutinizer ignore-unused */ $ref)

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
1649
    {
1650
1651
            // Authenticate crawler request:
1652
        if (isset($_SERVER['HTTP_X_T3CRAWLER'])) {
1653
            list($queueId, $hash) = explode(':', $_SERVER['HTTP_X_T3CRAWLER']);
1654
            list($queueRec) = $this->db->exec_SELECTgetRows('*', 'tx_crawler_queue', 'qid=' . intval($queueId));
1655
1656
            // If a crawler record was found and hash was matching, set it up:
1657
            if (is_array($queueRec) && $hash === md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey'])) {
1658
                $params['pObj']->applicationData['tx_crawler']['running'] = true;
1659
                $params['pObj']->applicationData['tx_crawler']['parameters'] = unserialize($queueRec['parameters']);
1660
                $params['pObj']->applicationData['tx_crawler']['log'] = [];
1661
            } else {
1662
                die('No crawler entry found!');
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
1663
            }
1664
        }
1665
    }
1666
1667
    /*****************************
1668
     *
1669
     * Compiling URLs to crawl - tools
1670
     *
1671
     *****************************/
1672
1673
    /**
1674
     * @param integer $id Root page id to start from.
1675
     * @param integer $depth Depth of tree, 0=only id-page, 1= on sublevel, 99 = infinite
1676
     * @param integer $scheduledTime Unix Time when the URL is timed to be visited when put in queue
1677
     * @param integer $reqMinute Number of requests per minute (creates the interleave between requests)
1678
     * @param boolean $submitCrawlUrls If set, submits the URLs to queue in database (real crawling)
1679
     * @param boolean $downloadCrawlUrls If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
1680
     * @param array $incomingProcInstructions Array of processing instructions
1681
     * @param array $configurationSelection Array of configuration keys
1682
     * @return string
1683
     */
1684
    public function getPageTreeAndUrls(
1685
        $id,
1686
        $depth,
1687
        $scheduledTime,
1688
        $reqMinute,
1689
        $submitCrawlUrls,
1690
        $downloadCrawlUrls,
1691
        array $incomingProcInstructions,
1692
        array $configurationSelection
1693
    ) {
1694
        global $BACK_PATH;
1695
        global $LANG;
1696
        if (!is_object($LANG)) {
1697
            $LANG = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('language');
1698
            $LANG->init(0);
1699
        }
1700
        $this->scheduledTime = $scheduledTime;
0 ignored issues
show
Bug Best Practice introduced by
The property scheduledTime does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1701
        $this->reqMinute = $reqMinute;
0 ignored issues
show
Bug Best Practice introduced by
The property reqMinute does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1702
        $this->submitCrawlUrls = $submitCrawlUrls;
0 ignored issues
show
Bug Best Practice introduced by
The property submitCrawlUrls does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1703
        $this->downloadCrawlUrls = $downloadCrawlUrls;
0 ignored issues
show
Bug Best Practice introduced by
The property downloadCrawlUrls does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1704
        $this->incomingProcInstructions = $incomingProcInstructions;
1705
        $this->incomingConfigurationSelection = $configurationSelection;
1706
1707
        $this->duplicateTrack = [];
1708
        $this->downloadUrls = [];
1709
1710
        // Drawing tree:
1711
        /* @var $tree \TYPO3\CMS\Backend\Tree\View\PageTreeView */
1712
        $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1713
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
1714
        $tree->init('AND ' . $perms_clause);
1715
1716
        $pageinfo = \TYPO3\CMS\Backend\Utility\BackendUtility::readPageAccess($id, $perms_clause);
1717
1718
        // Set root row:
1719
        $tree->tree[] = [
1720
            'row' => $pageinfo,
1721
            'HTML' => \AOE\Crawler\Utility\IconUtility::getIconForRecord('pages', $pageinfo)
1722
        ];
1723
1724
        // Get branch beneath:
1725
        if ($depth) {
1726
            $tree->getTree($id, $depth, '');
1727
        }
1728
1729
        // Traverse page tree:
1730
        $code = '';
1731
1732
        foreach ($tree->tree as $data) {
1733
            $this->MP = false;
1734
1735
            // recognize mount points
1736
            if ($data['row']['doktype'] == 7) {
1737
                $mountpage = $this->db->exec_SELECTgetRows('*', 'pages', 'uid = ' . $data['row']['uid']);
1738
1739
                // fetch mounted pages
1740
                $this->MP = $mountpage[0]['mount_pid'] . '-' . $data['row']['uid'];
0 ignored issues
show
Documentation Bug introduced by
The property $MP was declared of type boolean, but $mountpage[0]['mount_pid...' . $data['row']['uid'] is of type string. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
1741
1742
                $mountTree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1743
                $mountTree->init('AND ' . $perms_clause);
1744
                $mountTree->getTree($mountpage[0]['mount_pid'], $depth, '');
1745
1746
                foreach ($mountTree->tree as $mountData) {
1747
                    $code .= $this->drawURLs_addRowsForPage(
1748
                        $mountData['row'],
1749
                        $mountData['HTML'] . \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordTitle('pages', $mountData['row'], true)
1750
                    );
1751
                }
1752
1753
                // replace page when mount_pid_ol is enabled
1754
                if ($mountpage[0]['mount_pid_ol']) {
1755
                    $data['row']['uid'] = $mountpage[0]['mount_pid'];
1756
                } else {
1757
                    // if the mount_pid_ol is not set the MP must not be used for the mountpoint page
1758
                    $this->MP = false;
1759
                }
1760
            }
1761
1762
            $code .= $this->drawURLs_addRowsForPage(
1763
                $data['row'],
1764
                $data['HTML'] . \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordTitle('pages', $data['row'], true)
1765
            );
1766
        }
1767
1768
        return $code;
1769
    }
1770
1771
    /**
1772
     * Expands exclude string
1773
     *
1774
     * @param string $excludeString Exclude string
1775
     * @return array
1776
     */
1777
    public function expandExcludeString($excludeString)
1778
    {
1779
        // internal static caches;
1780
        static $expandedExcludeStringCache;
1781
        static $treeCache;
1782
1783
        if (empty($expandedExcludeStringCache[$excludeString])) {
1784
            $pidList = [];
1785
1786
            if (!empty($excludeString)) {
1787
                /* @var $tree \TYPO3\CMS\Backend\Tree\View\PageTreeView */
1788
                $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1789
                $tree->init('AND ' . $this->backendUser->getPagePermsClause(1));
1790
1791
                $excludeParts = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $excludeString);
1792
1793
                foreach ($excludeParts as $excludePart) {
1794
                    list($pid, $depth) = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode('+', $excludePart);
1795
1796
                    // default is "page only" = "depth=0"
1797
                    if (empty($depth)) {
1798
                        $depth = (stristr($excludePart, '+')) ? 99 : 0;
1799
                    }
1800
1801
                    $pidList[] = $pid;
1802
1803
                    if ($depth > 0) {
1804
                        if (empty($treeCache[$pid][$depth])) {
1805
                            $tree->reset();
1806
                            $tree->getTree($pid, $depth);
1807
                            $treeCache[$pid][$depth] = $tree->tree;
1808
                        }
1809
1810
                        foreach ($treeCache[$pid][$depth] as $data) {
1811
                            $pidList[] = $data['row']['uid'];
1812
                        }
1813
                    }
1814
                }
1815
            }
1816
1817
            $expandedExcludeStringCache[$excludeString] = array_unique($pidList);
1818
        }
1819
1820
        return $expandedExcludeStringCache[$excludeString];
1821
    }
1822
1823
    /**
1824
     * Create the rows for display of the page tree
1825
     * For each page a number of rows are shown displaying GET variable configuration
1826
     *
1827
     * @param    array        Page row
0 ignored issues
show
Bug introduced by
The type Page was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
1828
     * @param    string        Page icon and title for row
1829
     * @return    string        HTML <tr> content (one or more)
1830
     */
1831
    public function drawURLs_addRowsForPage(array $pageRow, $pageTitleAndIcon)
1832
    {
1833
        $skipMessage = '';
1834
1835
        // Get list of configurations
1836
        $configurations = $this->getUrlsForPageRow($pageRow, $skipMessage);
1837
1838
        if (count($this->incomingConfigurationSelection) > 0) {
1839
            // remove configuration that does not match the current selection
1840
            foreach ($configurations as $confKey => $confArray) {
1841
                if (!in_array($confKey, $this->incomingConfigurationSelection)) {
1842
                    unset($configurations[$confKey]);
1843
                }
1844
            }
1845
        }
1846
1847
        // Traverse parameter combinations:
1848
        $c = 0;
1849
        $cc = 0;
0 ignored issues
show
Unused Code introduced by
The assignment to $cc is dead and can be removed.
Loading history...
1850
        $content = '';
1851
        if (count($configurations)) {
1852
            foreach ($configurations as $confKey => $confArray) {
1853
1854
                    // Title column:
1855
                if (!$c) {
1856
                    $titleClm = '<td rowspan="' . count($configurations) . '">' . $pageTitleAndIcon . '</td>';
1857
                } else {
1858
                    $titleClm = '';
1859
                }
1860
1861
                if (!in_array($pageRow['uid'], $this->expandExcludeString($confArray['subCfg']['exclude']))) {
1862
1863
                        // URL list:
1864
                    $urlList = $this->urlListFromUrlArray(
1865
                        $confArray,
1866
                        $pageRow,
1867
                        $this->scheduledTime,
1868
                        $this->reqMinute,
1869
                        $this->submitCrawlUrls,
1870
                        $this->downloadCrawlUrls,
1871
                        $this->duplicateTrack,
1872
                        $this->downloadUrls,
1873
                        $this->incomingProcInstructions // if empty the urls won't be filtered by processing instructions
1874
                    );
1875
1876
                    // Expanded parameters:
1877
                    $paramExpanded = '';
1878
                    $calcAccu = [];
1879
                    $calcRes = 1;
1880
                    foreach ($confArray['paramExpanded'] as $gVar => $gVal) {
1881
                        $paramExpanded .= '
1882
                            <tr>
1883
                                <td class="bgColor4-20">' . htmlspecialchars('&' . $gVar . '=') . '<br/>' .
1884
                                                '(' . count($gVal) . ')' .
1885
                                                '</td>
1886
                                <td class="bgColor4" nowrap="nowrap">' . nl2br(htmlspecialchars(implode(chr(10), $gVal))) . '</td>
1887
                            </tr>
1888
                        ';
1889
                        $calcRes *= count($gVal);
1890
                        $calcAccu[] = count($gVal);
1891
                    }
1892
                    $paramExpanded = '<table class="lrPadding c-list param-expanded">' . $paramExpanded . '</table>';
1893
                    $paramExpanded .= 'Comb: ' . implode('*', $calcAccu) . '=' . $calcRes;
1894
1895
                    // Options
1896
                    $optionValues = '';
1897
                    if ($confArray['subCfg']['userGroups']) {
1898
                        $optionValues .= 'User Groups: ' . $confArray['subCfg']['userGroups'] . '<br/>';
1899
                    }
1900
                    if ($confArray['subCfg']['baseUrl']) {
1901
                        $optionValues .= 'Base Url: ' . $confArray['subCfg']['baseUrl'] . '<br/>';
1902
                    }
1903
                    if ($confArray['subCfg']['procInstrFilter']) {
1904
                        $optionValues .= 'ProcInstr: ' . $confArray['subCfg']['procInstrFilter'] . '<br/>';
1905
                    }
1906
1907
                    // Compile row:
1908
                    $content .= '
1909
                        <tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
1910
                            ' . $titleClm . '
1911
                            <td>' . htmlspecialchars($confKey) . '</td>
1912
                            <td>' . nl2br(htmlspecialchars(rawurldecode(trim(str_replace('&', chr(10) . '&', \TYPO3\CMS\Core\Utility\GeneralUtility::implodeArrayForUrl('', $confArray['paramParsed'])))))) . '</td>
1913
                            <td>' . $paramExpanded . '</td>
1914
                            <td nowrap="nowrap">' . $urlList . '</td>
1915
                            <td nowrap="nowrap">' . $optionValues . '</td>
1916
                            <td nowrap="nowrap">' . \TYPO3\CMS\Core\Utility\DebugUtility::viewArray($confArray['subCfg']['procInstrParams.']) . '</td>
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\DebugUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
1917
                        </tr>';
1918
                } else {
1919
                    $content .= '<tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
1920
                            ' . $titleClm . '
1921
                            <td>' . htmlspecialchars($confKey) . '</td>
1922
                            <td colspan="5"><em>No entries</em> (Page is excluded in this configuration)</td>
1923
                        </tr>';
1924
                }
1925
1926
                $c++;
1927
            }
1928
        } else {
1929
            $message = !empty($skipMessage) ? ' (' . $skipMessage . ')' : '';
1930
1931
            // Compile row:
1932
            $content .= '
1933
                <tr class="bgColor-20" style="border-bottom: 1px solid black;">
1934
                    <td>' . $pageTitleAndIcon . '</td>
1935
                    <td colspan="6"><em>No entries</em>' . $message . '</td>
1936
                </tr>';
1937
        }
1938
1939
        return $content;
1940
    }
1941
1942
    /**
1943
     * @return int
1944
     */
1945
    public function getUnprocessedItemsCount()
1946
    {
1947
        $res = $this->db->exec_SELECTquery(
1948
            'count(*) as num',
1949
            'tx_crawler_queue',
1950
            'exec_time=0 AND process_scheduled=0 AND scheduled<=' . $this->getCurrentTime()
1951
        );
1952
1953
        $count = $this->db->sql_fetch_assoc($res);
1954
        return $count['num'];
1955
    }
1956
1957
    /*****************************
1958
     *
1959
     * CLI functions
1960
     *
1961
     *****************************/
1962
1963
    /**
1964
     * Main function for running from Command Line PHP script (cron job)
1965
     * See ext/crawler/cli/crawler_cli.phpsh for details
1966
     *
1967
     * @return int number of remaining items or false if error
1968
     */
1969
    public function CLI_main()
1970
    {
1971
        $this->setAccessMode('cli');
1972
        $result = self::CLI_STATUS_NOTHING_PROCCESSED;
1973
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli');
1974
1975
        if (isset($cliObj->cli_args['-h']) || isset($cliObj->cli_args['--help'])) {
1976
            $cliObj->cli_validateArgs();
1977
            $cliObj->cli_help();
1978
            exit;
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
1979
        }
1980
1981
        if (!$this->getDisabled() && $this->CLI_checkAndAcquireNewProcess($this->CLI_buildProcessId())) {
1982
            $countInARun = $cliObj->cli_argValue('--countInARun') ? intval($cliObj->cli_argValue('--countInARun')) : $this->extensionSettings['countInARun'];
1983
            // Seconds
1984
            $sleepAfterFinish = $cliObj->cli_argValue('--sleepAfterFinish') ? intval($cliObj->cli_argValue('--sleepAfterFinish')) : $this->extensionSettings['sleepAfterFinish'];
1985
            // Milliseconds
1986
            $sleepTime = $cliObj->cli_argValue('--sleepTime') ? intval($cliObj->cli_argValue('--sleepTime')) : $this->extensionSettings['sleepTime'];
1987
1988
            try {
1989
                // Run process:
1990
                $result = $this->CLI_run($countInARun, $sleepTime, $sleepAfterFinish);
1991
            } catch (Exception $e) {
1992
                $this->CLI_debug(get_class($e) . ': ' . $e->getMessage());
1993
                $result = self::CLI_STATUS_ABORTED;
1994
            }
1995
1996
            // Cleanup
1997
            $this->db->exec_DELETEquery('tx_crawler_process', 'assigned_items_count = 0');
1998
1999
            //TODO can't we do that in a clean way?
2000
            $releaseStatus = $this->CLI_releaseProcesses($this->CLI_buildProcessId());
0 ignored issues
show
Unused Code introduced by
The assignment to $releaseStatus is dead and can be removed.
Loading history...
2001
2002
            $this->CLI_debug("Unprocessed Items remaining:" . $this->getUnprocessedItemsCount() . " (" . $this->CLI_buildProcessId() . ")");
2003
            $result |= ($this->getUnprocessedItemsCount() > 0 ? self::CLI_STATUS_REMAIN : self::CLI_STATUS_NOTHING_PROCCESSED);
2004
        } else {
2005
            $result |= self::CLI_STATUS_ABORTED;
2006
        }
2007
2008
        return $result;
2009
    }
2010
2011
    /**
2012
     * Function executed by crawler_im.php cli script.
2013
     *
2014
     * @return void
2015
     */
2016
    public function CLI_main_im()
2017
    {
2018
        $this->setAccessMode('cli_im');
2019
2020
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli_im');
2021
2022
        // Force user to admin state and set workspace to "Live":
2023
        $this->backendUser->user['admin'] = 1;
2024
        $this->backendUser->setWorkspace(0);
2025
2026
        // Print help
2027
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2028
            $cliObj->cli_validateArgs();
2029
            $cliObj->cli_help();
2030
            exit;
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
2031
        }
2032
2033
        $cliObj->cli_validateArgs();
2034
2035
        if ($cliObj->cli_argValue('-o') === 'exec') {
2036
            $this->registerQueueEntriesInternallyOnly = true;
0 ignored issues
show
Documentation Bug introduced by
It seems like true of type true is incompatible with the declared type array of property $registerQueueEntriesInternallyOnly.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
2037
        }
2038
2039
        if (isset($cliObj->cli_args['_DEFAULT'][2])) {
2040
            // Crawler is called over TYPO3 BE
2041
            $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][2], 0);
2042
        } else {
2043
            // Crawler is called over cli
2044
            $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2045
        }
2046
2047
        $configurationKeys = $this->getConfigurationKeys($cliObj);
2048
2049
        if (!is_array($configurationKeys)) {
2050
            $configurations = $this->getUrlsForPageId($pageId);
2051
            if (is_array($configurations)) {
2052
                $configurationKeys = array_keys($configurations);
2053
            } else {
2054
                $configurationKeys = [];
2055
            }
2056
        }
2057
2058
        if ($cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec') {
2059
            $reason = new tx_crawler_domain_reason();
2060
            $reason->setReason(tx_crawler_domain_reason::REASON_GUI_SUBMIT);
2061
            $reason->setDetailText('The cli script of the crawler added to the queue');
2062
            tx_crawler_domain_events_dispatcher::getInstance()->post(
2063
                'invokeQueueChange',
2064
                $this->setID,
2065
                ['reason' => $reason]
2066
            );
2067
        }
2068
2069
        if ($this->extensionSettings['cleanUpOldQueueEntries']) {
2070
            $this->cleanUpOldQueueEntries();
2071
        }
2072
2073
        $this->setID = \TYPO3\CMS\Core\Utility\GeneralUtility::md5int(microtime());
2074
        $this->getPageTreeAndUrls(
2075
            $pageId,
2076
            \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_argValue('-d'), 0, 99),
2077
            $this->getCurrentTime(),
2078
            \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_isArg('-n') ? $cliObj->cli_argValue('-n') : 30, 1, 1000),
2079
            $cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec',
2080
            $cliObj->cli_argValue('-o') === 'url',
2081
            \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $cliObj->cli_argValue('-proc'), 1),
2082
            $configurationKeys
2083
        );
2084
2085
        if ($cliObj->cli_argValue('-o') === 'url') {
2086
            $cliObj->cli_echo(implode(chr(10), $this->downloadUrls) . chr(10), 1);
2087
        } elseif ($cliObj->cli_argValue('-o') === 'exec') {
2088
            $cliObj->cli_echo("Executing " . count($this->urlList) . " requests right away:\n\n");
2089
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2090
            $cliObj->cli_echo("\nProcessing:\n");
2091
2092
            foreach ($this->queueEntries as $queueRec) {
2093
                $p = unserialize($queueRec['parameters']);
2094
                $cliObj->cli_echo($p['url'] . ' (' . implode(',', $p['procInstructions']) . ') => ');
2095
2096
                $result = $this->readUrlFromArray($queueRec);
2097
2098
                $requestResult = unserialize($result['content']);
2099
                if (is_array($requestResult)) {
2100
                    $resLog = is_array($requestResult['log']) ? chr(10) . chr(9) . chr(9) . implode(chr(10) . chr(9) . chr(9), $requestResult['log']) : '';
2101
                    $cliObj->cli_echo('OK: ' . $resLog . chr(10));
2102
                } else {
2103
                    $cliObj->cli_echo('Error checking Crawler Result: ' . substr(preg_replace('/\s+/', ' ', strip_tags($result['content'])), 0, 30000) . '...' . chr(10));
0 ignored issues
show
Bug introduced by
Are you sure substr(preg_replace('/\s...'content'])), 0, 30000) of type false|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

2103
                    $cliObj->cli_echo('Error checking Crawler Result: ' . /** @scrutinizer ignore-type */ substr(preg_replace('/\s+/', ' ', strip_tags($result['content'])), 0, 30000) . '...' . chr(10));
Loading history...
2104
                }
2105
            }
2106
        } elseif ($cliObj->cli_argValue('-o') === 'queue') {
2107
            $cliObj->cli_echo("Putting " . count($this->urlList) . " entries in queue:\n\n");
2108
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2109
        } else {
2110
            $cliObj->cli_echo(count($this->urlList) . " entries found for processing. (Use -o to decide action):\n\n", 1);
2111
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10), 1);
2112
        }
2113
    }
2114
2115
    /**
2116
     * Function executed by crawler_im.php cli script.
2117
     *
2118
     * @return bool
2119
     */
2120
    public function CLI_main_flush()
2121
    {
2122
        $this->setAccessMode('cli_flush');
2123
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli_flush');
2124
2125
        // Force user to admin state and set workspace to "Live":
2126
        $this->backendUser->user['admin'] = 1;
2127
        $this->backendUser->setWorkspace(0);
2128
2129
        // Print help
2130
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2131
            $cliObj->cli_validateArgs();
2132
            $cliObj->cli_help();
2133
            exit;
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
2134
        }
2135
2136
        $cliObj->cli_validateArgs();
2137
        $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2138
        $fullFlush = ($pageId == 0);
2139
2140
        $mode = $cliObj->cli_argValue('-o');
2141
2142
        switch ($mode) {
2143
            case 'all':
2144
                $result = $this->getLogEntriesForPageId($pageId, '', true, $fullFlush);
2145
                break;
2146
            case 'finished':
2147
            case 'pending':
2148
                $result = $this->getLogEntriesForPageId($pageId, $mode, true, $fullFlush);
2149
                break;
2150
            default:
2151
                $cliObj->cli_validateArgs();
2152
                $cliObj->cli_help();
2153
                $result = false;
2154
        }
2155
2156
        return $result !== false;
2157
    }
2158
2159
    /**
2160
     * Obtains configuration keys from the CLI arguments
2161
     *
2162
     * @param  tx_crawler_cli_im $cliObj    Command line object
2163
     * @return mixed                        Array of keys or null if no keys found
2164
     */
2165
    protected function getConfigurationKeys(tx_crawler_cli_im &$cliObj)
2166
    {
2167
        $parameter = trim($cliObj->cli_argValue('-conf'));
2168
        return ($parameter != '' ? \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $parameter) : []);
2169
    }
2170
2171
    /**
2172
     * Running the functionality of the CLI (crawling URLs from queue)
2173
     *
2174
     * @param int $countInARun
2175
     * @param int $sleepTime
2176
     * @param int $sleepAfterFinish
2177
     * @return string
2178
     */
2179
    public function CLI_run($countInARun, $sleepTime, $sleepAfterFinish)
2180
    {
2181
        $result = 0;
2182
        $counter = 0;
2183
2184
        // First, run hooks:
2185
        $this->CLI_runHooks();
2186
2187
        // Clean up the queue
2188
        if (intval($this->extensionSettings['purgeQueueDays']) > 0) {
2189
            $purgeDate = $this->getCurrentTime() - 24 * 60 * 60 * intval($this->extensionSettings['purgeQueueDays']);
2190
            $del = $this->db->exec_DELETEquery(
0 ignored issues
show
Unused Code introduced by
The assignment to $del is dead and can be removed.
Loading history...
2191
                'tx_crawler_queue',
2192
                'exec_time!=0 AND exec_time<' . $purgeDate
2193
            );
2194
        }
2195
2196
        // Select entries:
2197
        //TODO Shouldn't this reside within the transaction?
2198
        $rows = $this->db->exec_SELECTgetRows(
2199
            'qid,scheduled',
2200
            'tx_crawler_queue',
2201
            'exec_time=0
2202
                AND process_scheduled= 0
2203
                AND scheduled<=' . $this->getCurrentTime(),
2204
            '',
2205
            'scheduled, qid',
2206
        intval($countInARun)
2207
        );
2208
2209
        if (count($rows) > 0) {
2210
            $quidList = [];
2211
2212
            foreach ($rows as $r) {
2213
                $quidList[] = $r['qid'];
2214
            }
2215
2216
            $processId = $this->CLI_buildProcessId();
2217
2218
            //reserve queue entrys for process
2219
            $this->db->sql_query('BEGIN');
2220
            //TODO make sure we're not taking assigned queue-entires
2221
            $this->db->exec_UPDATEquery(
2222
                'tx_crawler_queue',
2223
                'qid IN (' . implode(',', $quidList) . ')',
2224
                [
2225
                    'process_scheduled' => intval($this->getCurrentTime()),
2226
                    'process_id' => $processId
2227
                ]
2228
            );
2229
2230
            //save the number of assigned queue entrys to determine who many have been processed later
2231
            $numberOfAffectedRows = $this->db->sql_affected_rows();
2232
            $this->db->exec_UPDATEquery(
2233
                'tx_crawler_process',
2234
                "process_id = '" . $processId . "'",
2235
                [
2236
                    'assigned_items_count' => intval($numberOfAffectedRows)
2237
                ]
2238
            );
2239
2240
            if ($numberOfAffectedRows == count($quidList)) {
2241
                $this->db->sql_query('COMMIT');
2242
            } else {
2243
                $this->db->sql_query('ROLLBACK');
2244
                $this->CLI_debug("Nothing processed due to multi-process collision (" . $this->CLI_buildProcessId() . ")");
2245
                return ($result | self::CLI_STATUS_ABORTED);
2246
            }
2247
2248
            foreach ($rows as $r) {
2249
                $result |= $this->readUrl($r['qid']);
2250
2251
                $counter++;
2252
                usleep(intval($sleepTime)); // Just to relax the system
2253
2254
                // if during the start and the current read url the cli has been disable we need to return from the function
2255
                // mark the process NOT as ended.
2256
                if ($this->getDisabled()) {
2257
                    return ($result | self::CLI_STATUS_ABORTED);
2258
                }
2259
2260
                if (!$this->CLI_checkIfProcessIsActive($this->CLI_buildProcessId())) {
2261
                    $this->CLI_debug("conflict / timeout (" . $this->CLI_buildProcessId() . ")");
2262
2263
                    //TODO might need an additional returncode
2264
                    $result |= self::CLI_STATUS_ABORTED;
2265
                    break; //possible timeout
2266
                }
2267
            }
2268
2269
            sleep(intval($sleepAfterFinish));
2270
2271
            $msg = 'Rows: ' . $counter;
2272
            $this->CLI_debug($msg . " (" . $this->CLI_buildProcessId() . ")");
2273
        } else {
2274
            $this->CLI_debug("Nothing within queue which needs to be processed (" . $this->CLI_buildProcessId() . ")");
2275
        }
2276
2277
        if ($counter > 0) {
2278
            $result |= self::CLI_STATUS_PROCESSED;
2279
        }
2280
2281
        return $result;
2282
    }
2283
2284
    /**
2285
     * Activate hooks
2286
     *
2287
     * @return void
2288
     */
2289
    public function CLI_runHooks()
2290
    {
2291
        global $TYPO3_CONF_VARS;
2292
        if (is_array($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'])) {
2293
            foreach ($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'] as $objRef) {
2294
                $hookObj = &\TYPO3\CMS\Core\Utility\GeneralUtility::getUserObj($objRef);
2295
                if (is_object($hookObj)) {
2296
                    $hookObj->crawler_init($this);
2297
                }
2298
            }
2299
        }
2300
    }
2301
2302
    /**
2303
     * Try to acquire a new process with the given id
2304
     * also performs some auto-cleanup for orphan processes
2305
     * @todo preemption might not be the most elegant way to clean up
2306
     *
2307
     * @param string $id identification string for the process
2308
     * @return boolean
2309
     */
2310
    public function CLI_checkAndAcquireNewProcess($id)
2311
    {
2312
        $ret = true;
2313
2314
        $systemProcessId = getmypid();
2315
        if ($systemProcessId < 1) {
2316
            return false;
2317
        }
2318
2319
        $processCount = 0;
2320
        $orphanProcesses = [];
2321
2322
        $this->db->sql_query('BEGIN');
2323
2324
        $res = $this->db->exec_SELECTquery(
2325
            'process_id,ttl',
2326
            'tx_crawler_process',
2327
            'active=1 AND deleted=0'
2328
            );
2329
2330
        $currentTime = $this->getCurrentTime();
2331
2332
        while ($row = $this->db->sql_fetch_assoc($res)) {
2333
            if ($row['ttl'] < $currentTime) {
2334
                $orphanProcesses[] = $row['process_id'];
2335
            } else {
2336
                $processCount++;
2337
            }
2338
        }
2339
2340
        // if there are less than allowed active processes then add a new one
2341
        if ($processCount < intval($this->extensionSettings['processLimit'])) {
2342
            $this->CLI_debug("add process " . $this->CLI_buildProcessId() . " (" . ($processCount + 1) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2343
2344
            // create new process record
2345
            $this->db->exec_INSERTquery(
2346
                'tx_crawler_process',
2347
                [
2348
                    'process_id' => $id,
2349
                    'active' => '1',
2350
                    'ttl' => ($currentTime + intval($this->extensionSettings['processMaxRunTime'])),
2351
                    'system_process_id' => $systemProcessId
2352
                ]
2353
                );
2354
        } else {
2355
            $this->CLI_debug("Processlimit reached (" . ($processCount) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2356
            $ret = false;
2357
        }
2358
2359
        $this->CLI_releaseProcesses($orphanProcesses, true); // maybe this should be somehow included into the current lock
2360
        $this->CLI_deleteProcessesMarkedDeleted();
2361
2362
        $this->db->sql_query('COMMIT');
2363
2364
        return $ret;
2365
    }
2366
2367
    /**
2368
     * Release a process and the required resources
2369
     *
2370
     * @param  mixed    $releaseIds   string with a single process-id or array with multiple process-ids
2371
     * @param  boolean  $withinLock   show whether the DB-actions are included within an existing lock
2372
     * @return boolean
2373
     */
2374
    public function CLI_releaseProcesses($releaseIds, $withinLock = false)
2375
    {
2376
        if (!is_array($releaseIds)) {
2377
            $releaseIds = [$releaseIds];
2378
        }
2379
2380
        if (!count($releaseIds) > 0) {
2381
            return false;   //nothing to release
2382
        }
2383
2384
        if (!$withinLock) {
2385
            $this->db->sql_query('BEGIN');
2386
        }
2387
2388
        // some kind of 2nd chance algo - this way you need at least 2 processes to have a real cleanup
2389
        // this ensures that a single process can't mess up the entire process table
2390
2391
        // mark all processes as deleted which have no "waiting" queue-entires and which are not active
2392
        $this->db->exec_UPDATEquery(
2393
            'tx_crawler_queue',
2394
            'process_id IN (SELECT process_id FROM tx_crawler_process WHERE active=0 AND deleted=0)',
2395
            [
2396
                'process_scheduled' => 0,
2397
                'process_id' => ''
2398
            ]
2399
        );
2400
        $this->db->exec_UPDATEquery(
2401
            'tx_crawler_process',
2402
            'active=0 AND deleted=0
2403
            AND NOT EXISTS (
2404
                SELECT * FROM tx_crawler_queue
2405
                WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
2406
                AND tx_crawler_queue.exec_time = 0
2407
            )',
2408
            [
2409
                'deleted' => '1',
2410
                'system_process_id' => 0
2411
            ]
2412
        );
2413
        // mark all requested processes as non-active
2414
        $this->db->exec_UPDATEquery(
2415
            'tx_crawler_process',
2416
            'process_id IN (\'' . implode('\',\'', $releaseIds) . '\') AND deleted=0',
2417
            [
2418
                'active' => '0'
2419
            ]
2420
        );
2421
        $this->db->exec_UPDATEquery(
2422
            'tx_crawler_queue',
2423
            'exec_time=0 AND process_id IN ("' . implode('","', $releaseIds) . '")',
2424
            [
2425
                'process_scheduled' => 0,
2426
                'process_id' => ''
2427
            ]
2428
        );
2429
2430
        if (!$withinLock) {
2431
            $this->db->sql_query('COMMIT');
2432
        }
2433
2434
        return true;
2435
    }
2436
2437
    /**
2438
     * Delete processes marked as deleted
2439
     *
2440
     * @return void
2441
     */
2442
    public function CLI_deleteProcessesMarkedDeleted()
2443
    {
2444
        $this->db->exec_DELETEquery('tx_crawler_process', 'deleted = 1');
2445
    }
2446
2447
    /**
2448
     * Check if there are still resources left for the process with the given id
2449
     * Used to determine timeouts and to ensure a proper cleanup if there's a timeout
2450
     *
2451
     * @param  string  identification string for the process
0 ignored issues
show
Bug introduced by
The type identification was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2452
     * @return boolean determines if the process is still active / has resources
2453
     *
2454
     * FIXME: Please remove Transaction, not needed as only a select query.
2455
     */
2456
    public function CLI_checkIfProcessIsActive($pid)
2457
    {
2458
        $ret = false;
2459
        $this->db->sql_query('BEGIN');
2460
        $res = $this->db->exec_SELECTquery(
2461
            'process_id,active,ttl',
2462
            'tx_crawler_process',
2463
            'process_id = \'' . $pid . '\'  AND deleted=0',
2464
            '',
2465
            'ttl',
2466
            '0,1'
2467
        );
2468
        if ($row = $this->db->sql_fetch_assoc($res)) {
2469
            $ret = intVal($row['active']) == 1;
2470
        }
2471
        $this->db->sql_query('COMMIT');
2472
2473
        return $ret;
2474
    }
2475
2476
    /**
2477
     * Create a unique Id for the current process
2478
     *
2479
     * @return string  the ID
2480
     */
2481
    public function CLI_buildProcessId()
2482
    {
2483
        if (!$this->processID) {
2484
            $this->processID = \TYPO3\CMS\Core\Utility\GeneralUtility::shortMD5($this->microtime(true));
2485
        }
2486
        return $this->processID;
2487
    }
2488
2489
    /**
2490
     * @param bool $get_as_float
2491
     *
2492
     * @return mixed
2493
     */
2494
    protected function microtime($get_as_float = false)
2495
    {
2496
        return microtime($get_as_float);
2497
    }
2498
2499
    /**
2500
     * Prints a message to the stdout (only if debug-mode is enabled)
2501
     *
2502
     * @param  string $msg  the message
2503
     */
2504
    public function CLI_debug($msg)
2505
    {
2506
        if (intval($this->extensionSettings['processDebug'])) {
2507
            echo $msg . "\n";
2508
            flush();
2509
        }
2510
    }
2511
2512
    /**
2513
     * Get URL content by making direct request to TYPO3.
2514
     *
2515
     * @param  string $url          Page URL
2516
     * @param  int    $crawlerId    Crawler-ID
2517
     * @return array
2518
     */
2519
    protected function sendDirectRequest($url, $crawlerId)
2520
    {
2521
        $requestHeaders = $this->buildRequestHeaderArray(parse_url($url), $crawlerId);
2522
2523
        $cmd = escapeshellcmd($this->extensionSettings['phpPath']);
2524
        $cmd .= ' ';
2525
        $cmd .= escapeshellarg(\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::extPath('crawler') . 'cli/bootstrap.php');
2526
        $cmd .= ' ';
2527
        $cmd .= escapeshellarg($this->getFrontendBasePath());
2528
        $cmd .= ' ';
2529
        $cmd .= escapeshellarg($url);
2530
        $cmd .= ' ';
2531
        $cmd .= escapeshellarg(base64_encode(serialize($requestHeaders)));
2532
2533
        $startTime = microtime(true);
2534
        $content = $this->executeShellCommand($cmd);
2535
        $this->log($url . ' ' . (microtime(true) - $startTime));
2536
2537
        $result = [
2538
            'request' => implode("\r\n", $requestHeaders) . "\r\n\r\n",
2539
            'headers' => '',
2540
            'content' => $content
2541
        ];
2542
2543
        return $result;
2544
    }
2545
2546
    /**
2547
     * Cleans up entries that stayed for too long in the queue. These are:
2548
     * - processed entries that are over 1.5 days in age
2549
     * - scheduled entries that are over 7 days old
2550
     *
2551
     * @return void
2552
     */
2553
    protected function cleanUpOldQueueEntries()
2554
    {
2555
        $processedAgeInSeconds = $this->extensionSettings['cleanUpProcessedAge'] * 86400; // 24*60*60 Seconds in 24 hours
2556
        $scheduledAgeInSeconds = $this->extensionSettings['cleanUpScheduledAge'] * 86400;
2557
2558
        $now = time();
2559
        $condition = '(exec_time<>0 AND exec_time<' . ($now - $processedAgeInSeconds) . ') OR scheduled<=' . ($now - $scheduledAgeInSeconds);
2560
        $this->flushQueue($condition);
2561
    }
2562
2563
    /**
2564
     * Initializes a TypoScript Frontend necessary for using TypoScript and TypoLink functions
2565
     *
2566
     * @param int $id
2567
     * @param int $typeNum
2568
     *
2569
     * @return void
2570
     */
2571
    protected function initTSFE($id = 1, $typeNum = 0)
2572
    {
2573
        \TYPO3\CMS\Frontend\Utility\EidUtility::initTCA();
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Utility\EidUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2574
        if (!is_object($GLOBALS['TT'])) {
2575
            $GLOBALS['TT'] = new \TYPO3\CMS\Core\TimeTracker\NullTimeTracker;
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\TimeTracker\NullTimeTracker was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2576
            $GLOBALS['TT']->start();
2577
        }
2578
2579
        $GLOBALS['TSFE'] = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(\TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::class, $GLOBALS['TYPO3_CONF_VARS'], $id, $typeNum);
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Contr...criptFrontendController was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2580
        $GLOBALS['TSFE']->sys_page = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(\TYPO3\CMS\Frontend\Page\PageRepository::class);
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Page\PageRepository was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2581
        $GLOBALS['TSFE']->sys_page->init(true);
2582
        $GLOBALS['TSFE']->connectToDB();
2583
        $GLOBALS['TSFE']->initFEuser();
2584
        $GLOBALS['TSFE']->determineId();
2585
        $GLOBALS['TSFE']->initTemplate();
2586
        $GLOBALS['TSFE']->rootLine = $GLOBALS['TSFE']->sys_page->getRootLine($id, '');
2587
        $GLOBALS['TSFE']->getConfigArray();
2588
        \TYPO3\CMS\Frontend\Page\PageGenerator::pagegenInit();
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Page\PageGenerator was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2589
    }
2590
}
2591
2592
if (defined('TYPO3_MODE') && $TYPO3_CONF_VARS[TYPO3_MODE]['XCLASS']['ext/crawler/class.tx_crawler_lib.php']) {
0 ignored issues
show
Bug introduced by
The constant TYPO3_MODE was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
2593
    include_once($TYPO3_CONF_VARS[TYPO3_MODE]['XCLASS']['ext/crawler/class.tx_crawler_lib.php']);
2594
}
2595