Test Failed
Push — 6-0 ( cfb4d5 )
by Tomas Norre
03:23
created

tx_crawler_lib::urlListFromUrlArray()   D

Complexity

Conditions 21
Paths 114

Size

Total Lines 114
Code Lines 57

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 21
eloc 57
nc 114
nop 9
dl 0
loc 114
rs 4.4991
c 0
b 0
f 0

How to fix   Long Method    Complexity    Many Parameters   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

1
<?php
2
3
/***************************************************************
4
 *  Copyright notice
5
 *
6
 *  (c) 2016 AOE GmbH <[email protected]>
7
 *
8
 *  All rights reserved
9
 *
10
 *  This script is part of the TYPO3 project. The TYPO3 project is
11
 *  free software; you can redistribute it and/or modify
12
 *  it under the terms of the GNU General Public License as published by
13
 *  the Free Software Foundation; either version 3 of the License, or
14
 *  (at your option) any later version.
15
 *
16
 *  The GNU General Public License can be found at
17
 *  http://www.gnu.org/copyleft/gpl.html.
18
 *
19
 *  This script is distributed in the hope that it will be useful,
20
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
21
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
22
 *  GNU General Public License for more details.
23
 *
24
 *  This copyright notice MUST APPEAR in all copies of the script!
25
 ***************************************************************/
26
27
/**
28
 * Class tx_crawler_lib
29
 */
30
class tx_crawler_lib
31
{
32
    /**
33
     * @var integer
34
     */
35
    public $setID = 0;
36
37
    /**
38
     * @var string
39
     */
40
    public $processID = '';
41
42
    /**
43
     * One hour is max stalled time for the CLI
44
     * If the process had the status "start" for 3600 seconds, it will be regarded stalled and a new process is started
45
     *
46
     * @var integer
47
     */
48
    public $max_CLI_exec_time = 3600;
49
50
    /**
51
     * @var array
52
     */
53
    public $duplicateTrack = [];
54
55
    /**
56
     * @var array
57
     */
58
    public $downloadUrls = [];
59
60
    /**
61
     * @var array
62
     */
63
    public $incomingProcInstructions = [];
64
65
    /**
66
     * @var array
67
     */
68
    public $incomingConfigurationSelection = [];
69
70
    /**
71
     * @var array
72
     */
73
    public $registerQueueEntriesInternallyOnly = [];
74
75
    /**
76
     * @var array
77
     */
78
    public $queueEntries = [];
79
80
    /**
81
     * @var array
82
     */
83
    public $urlList = [];
84
85
    /**
86
     * @var boolean
87
     */
88
    public $debugMode = false;
89
90
    /**
91
     * @var array
92
     */
93
    public $extensionSettings = [];
94
95
    /**
96
     * Mount Point
97
     *
98
     * @var boolean
99
     */
100
    public $MP = false;
101
102
    /**
103
     * @var string
104
     */
105
    protected $processFilename;
106
107
    /**
108
     * Holds the internal access mode can be 'gui','cli' or 'cli_im'
109
     *
110
     * @var string
111
     */
112
    protected $accessMode;
113
114
    /**
115
     * @var \TYPO3\CMS\Core\Database\DatabaseConnection
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Database\DatabaseConnection was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
116
     */
117
    private $db;
118
119
    /**
120
     * @var TYPO3\CMS\Core\Authentication\BackendUserAuthentication
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Authentic...ckendUserAuthentication was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
121
     */
122
    private $backendUser;
123
124
    const CLI_STATUS_NOTHING_PROCCESSED = 0;
125
    const CLI_STATUS_REMAIN = 1; //queue not empty
126
    const CLI_STATUS_PROCESSED = 2; //(some) queue items where processed
127
    const CLI_STATUS_ABORTED = 4; //instance didn't finish
128
    const CLI_STATUS_POLLABLE_PROCESSED = 8;
129
130
    /**
131
     * Method to set the accessMode can be gui, cli or cli_im
132
     *
133
     * @return string
134
     */
135
    public function getAccessMode()
136
    {
137
        return $this->accessMode;
138
    }
139
140
    /**
141
     * @param string $accessMode
142
     */
143
    public function setAccessMode($accessMode)
144
    {
145
        $this->accessMode = $accessMode;
146
    }
147
148
    /**
149
     * Set disabled status to prevent processes from being processed
150
     *
151
     * @param  bool $disabled (optional, defaults to true)
152
     * @return void
153
     */
154
    public function setDisabled($disabled = true)
155
    {
156
        if ($disabled) {
157
            \TYPO3\CMS\Core\Utility\GeneralUtility::writeFile($this->processFilename, '');
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\GeneralUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
158
        } else {
159
            if (is_file($this->processFilename)) {
160
                unlink($this->processFilename);
161
            }
162
        }
163
    }
164
165
    /**
166
     * Get disable status
167
     *
168
     * @return bool true if disabled
169
     */
170
    public function getDisabled()
171
    {
172
        if (is_file($this->processFilename)) {
173
            return true;
174
        } else {
175
            return false;
176
        }
177
    }
178
179
    /**
180
     * @param string $filenameWithPath
181
     *
182
     * @return void
183
     */
184
    public function setProcessFilename($filenameWithPath)
185
    {
186
        $this->processFilename = $filenameWithPath;
187
    }
188
189
    /**
190
     * @return string
191
     */
192
    public function getProcessFilename()
193
    {
194
        return $this->processFilename;
195
    }
196
197
    /************************************
198
     *
199
     * Getting URLs based on Page TSconfig
200
     *
201
     ************************************/
202
203
    public function __construct()
204
    {
205
        $this->db = $GLOBALS['TYPO3_DB'];
206
        $this->backendUser = $GLOBALS['BE_USER'];
207
        $this->processFilename = PATH_site . 'typo3temp/tx_crawler.proc';
0 ignored issues
show
Bug introduced by
The constant PATH_site was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
208
209
        $settings = unserialize($GLOBALS['TYPO3_CONF_VARS']['EXT']['extConf']['crawler']);
210
        $settings = is_array($settings) ? $settings : [];
211
212
        // read ext_em_conf_template settings and set
213
        $this->setExtensionSettings($settings);
214
215
        // set defaults:
216
        if (\TYPO3\CMS\Core\Utility\MathUtility::convertToPositiveInteger($this->extensionSettings['countInARun']) == 0) {
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\MathUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
217
            $this->extensionSettings['countInARun'] = 100;
218
        }
219
220
        $this->extensionSettings['processLimit'] = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($this->extensionSettings['processLimit'], 1, 99, 1);
221
    }
222
223
    /**
224
     * Sets the extensions settings (unserialized pendant of $TYPO3_CONF_VARS['EXT']['extConf']['crawler']).
225
     *
226
     * @param array $extensionSettings
227
     * @return void
228
     */
229
    public function setExtensionSettings(array $extensionSettings)
230
    {
231
        $this->extensionSettings = $extensionSettings;
232
    }
233
234
    /**
235
     * Check if the given page should be crawled
236
     *
237
     * @param array $pageRow
238
     * @return false|string false if the page should be crawled (not excluded), true / skipMessage if it should be skipped
239
     */
240
    public function checkIfPageShouldBeSkipped(array $pageRow)
241
    {
242
        $skipPage = false;
243
        $skipMessage = 'Skipped'; // message will be overwritten later
244
245
        // if page is hidden
246
        if (!$this->extensionSettings['crawlHiddenPages']) {
247
            if ($pageRow['hidden']) {
248
                $skipPage = true;
249
                $skipMessage = 'Because page is hidden';
250
            }
251
        }
252
253
        if (!$skipPage) {
254
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList('3,4', $pageRow['doktype']) || $pageRow['doktype'] >= 199) {
255
                $skipPage = true;
256
                $skipMessage = 'Because doktype is not allowed';
257
            }
258
        }
259
260
        if (!$skipPage) {
261
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'])) {
262
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'] as $key => $doktypeList) {
263
                    if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($doktypeList, $pageRow['doktype'])) {
264
                        $skipPage = true;
265
                        $skipMessage = 'Doktype was excluded by "' . $key . '"';
266
                        break;
267
                    }
268
                }
269
            }
270
        }
271
272
        if (!$skipPage) {
273
            // veto hook
274
            if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'])) {
275
                foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'] as $key => $func) {
276
                    $params = [
277
                        'pageRow' => $pageRow
278
                    ];
279
                    // expects "false" if page is ok and "true" or a skipMessage if this page should _not_ be crawled
280
                    $veto = \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($func, $params, $this);
281
                    if ($veto !== false) {
282
                        $skipPage = true;
283
                        if (is_string($veto)) {
284
                            $skipMessage = $veto;
285
                        } else {
286
                            $skipMessage = 'Veto from hook "' . htmlspecialchars($key) . '"';
287
                        }
288
                        // no need to execute other hooks if a previous one return a veto
289
                        break;
290
                    }
291
                }
292
            }
293
        }
294
295
        return $skipPage ? $skipMessage : false;
296
    }
297
298
    /**
299
     * Wrapper method for getUrlsForPageId()
300
     * It returns an array of configurations and no urls!
301
     *
302
     * @param array $pageRow Page record with at least dok-type and uid columns.
303
     * @param string $skipMessage
304
     * @return array
305
     * @see getUrlsForPageId()
306
     */
307
    public function getUrlsForPageRow(array $pageRow, &$skipMessage = '')
308
    {
309
        $message = $this->checkIfPageShouldBeSkipped($pageRow);
310
311
        if ($message === false) {
312
            $forceSsl = ($pageRow['url_scheme'] === 2) ? true : false;
313
            $res = $this->getUrlsForPageId($pageRow['uid'], $forceSsl);
314
            $skipMessage = '';
315
        } else {
316
            $skipMessage = $message;
317
            $res = [];
318
        }
319
320
        return $res;
321
    }
322
323
    /**
324
     * This method is used to count if there are ANY unprocessed queue entries
325
     * of a given page_id and the configuration which matches a given hash.
326
     * If there if none, we can skip an inner detail check
327
     *
328
     * @param  int $uid
329
     * @param  string $configurationHash
330
     * @return boolean
331
     */
332
    protected function noUnprocessedQueueEntriesForPageWithConfigurationHashExist($uid, $configurationHash)
333
    {
334
        $configurationHash = $this->db->fullQuoteStr($configurationHash, 'tx_crawler_queue');
335
        $res = $this->db->exec_SELECTquery('count(*) as anz', 'tx_crawler_queue', "page_id=" . intval($uid) . " AND configuration_hash=" . $configurationHash . " AND exec_time=0");
336
        $row = $this->db->sql_fetch_assoc($res);
337
338
        return ($row['anz'] == 0);
339
    }
340
341
    /**
342
     * Creates a list of URLs from input array (and submits them to queue if asked for)
343
     * See Web > Info module script + "indexed_search"'s crawler hook-client using this!
344
     *
345
     * @param    array        Information about URLs from pageRow to crawl.
346
     * @param    array        Page row
347
     * @param    integer        Unix time to schedule indexing to, typically time()
348
     * @param    integer        Number of requests per minute (creates the interleave between requests)
349
     * @param    boolean        If set, submits the URLs to queue
0 ignored issues
show
Bug introduced by
The type If was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
350
     * @param    boolean        If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
351
     * @param    array        Array which is passed by reference and contains the an id per url to secure we will not crawl duplicates
352
     * @param    array        Array which will be filled with URLS for download if flag is set.
353
     * @param    array        Array of processing instructions
354
     * @return    string        List of URLs (meant for display in backend module)
355
     *
356
     */
357
    public function urlListFromUrlArray(
358
    array $vv,
359
    array $pageRow,
360
    $scheduledTime,
361
    $reqMinute,
362
    $submitCrawlUrls,
363
    $downloadCrawlUrls,
364
    array &$duplicateTrack,
365
    array &$downloadUrls,
366
    array $incomingProcInstructions
367
    ) {
368
369
        // realurl support (thanks to Ingo Renner)
370
        if (\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\ExtensionManagementUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
371
372
            /** @var tx_realurl $urlObj */
373
            $urlObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_realurl');
374
375
            if (!empty($vv['subCfg']['baseUrl'])) {
376
                $urlParts = parse_url($vv['subCfg']['baseUrl']);
377
                $host = strtolower($urlParts['host']);
378
                $urlObj->host = $host;
379
380
                // First pass, finding configuration OR pointer string:
381
                $urlObj->extConf = isset($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->host] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
382
383
                // If it turned out to be a string pointer, then look up the real config:
384
                if (is_string($urlObj->extConf)) {
385
                    $urlObj->extConf = is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf]) ? $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl'][$urlObj->extConf] : $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['realurl']['_DEFAULT'];
386
                }
387
            }
388
389
            if (!$GLOBALS['TSFE']->sys_page) {
390
                $GLOBALS['TSFE']->sys_page = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\PageRepository');
391
            }
392
            if (!$GLOBALS['TSFE']->csConvObj) {
393
                $GLOBALS['TSFE']->csConvObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Core\Charset\CharsetConverter');
394
            }
395
            if (!$GLOBALS['TSFE']->tmpl->rootLine[0]['uid']) {
396
                $GLOBALS['TSFE']->tmpl->rootLine[0]['uid'] = $urlObj->extConf['pagePath']['rootpage_id'];
397
            }
398
        }
399
400
        if (is_array($vv['URLs'])) {
401
            $configurationHash = md5(serialize($vv));
402
            $skipInnerCheck = $this->noUnprocessedQueueEntriesForPageWithConfigurationHashExist($pageRow['uid'], $configurationHash);
403
404
            foreach ($vv['URLs'] as $urlQuery) {
405
                if ($this->drawURLs_PIfilter($vv['subCfg']['procInstrFilter'], $incomingProcInstructions)) {
406
407
                    // Calculate cHash:
408
                    if ($vv['subCfg']['cHash']) {
409
                        /* @var $cacheHash \TYPO3\CMS\Frontend\Page\CacheHashCalculator */
410
                        $cacheHash = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Frontend\Page\CacheHashCalculator');
411
                        $urlQuery .= '&cHash=' . $cacheHash->generateForParameters($urlQuery);
412
                    }
413
414
                    // Create key by which to determine unique-ness:
415
                    $uKey = $urlQuery . '|' . $vv['subCfg']['userGroups'] . '|' . $vv['subCfg']['baseUrl'] . '|' . $vv['subCfg']['procInstrFilter'];
416
417
                    // realurl support (thanks to Ingo Renner)
418
                    $urlQuery = 'index.php' . $urlQuery;
419
                    if (\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::isLoaded('realurl') && $vv['subCfg']['realurl']) {
420
                        $params = [
421
                            'LD' => [
422
                                'totalURL' => $urlQuery
423
                            ],
424
                            'TCEmainHook' => true
425
                        ];
426
                        $urlObj->encodeSpURL($params);
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $urlObj does not seem to be defined for all execution paths leading up to this point.
Loading history...
427
                        $urlQuery = $params['LD']['totalURL'];
428
                    }
429
430
                    // Scheduled time:
431
                    $schTime = $scheduledTime + round(count($duplicateTrack) * (60 / $reqMinute));
432
                    $schTime = floor($schTime / 60) * 60;
433
434
                    if (isset($duplicateTrack[$uKey])) {
435
436
                        //if the url key is registered just display it and do not resubmit is
437
                        $urlList = '<em><span class="typo3-dimmed">' . htmlspecialchars($urlQuery) . '</span></em><br/>';
438
                    } else {
439
                        $urlList = '[' . date('d.m.y H:i', $schTime) . '] ' . htmlspecialchars($urlQuery);
0 ignored issues
show
Bug introduced by
Are you sure date('d.m.y H:i', $schTime) of type false|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

439
                        $urlList = '[' . /** @scrutinizer ignore-type */ date('d.m.y H:i', $schTime) . '] ' . htmlspecialchars($urlQuery);
Loading history...
Bug introduced by
$schTime of type double is incompatible with the type integer expected by parameter $timestamp of date(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

439
                        $urlList = '[' . date('d.m.y H:i', /** @scrutinizer ignore-type */ $schTime) . '] ' . htmlspecialchars($urlQuery);
Loading history...
440
                        $this->urlList[] = '[' . date('d.m.y H:i', $schTime) . '] ' . $urlQuery;
441
442
                        $theUrl = ($vv['subCfg']['baseUrl'] ? $vv['subCfg']['baseUrl'] : \TYPO3\CMS\Core\Utility\GeneralUtility::getIndpEnv('TYPO3_SITE_URL')) . $urlQuery;
443
444
                        // Submit for crawling!
445
                        if ($submitCrawlUrls) {
446
                            $added = $this->addUrl(
447
                            $pageRow['uid'],
448
                            $theUrl,
449
                            $vv['subCfg'],
450
                            $scheduledTime,
451
                            $configurationHash,
452
                            $skipInnerCheck
453
                            );
454
                            if ($added === false) {
455
                                $urlList .= ' (Url already existed)';
456
                            }
457
                        } elseif ($downloadCrawlUrls) {
458
                            $downloadUrls[$theUrl] = $theUrl;
459
                        }
460
461
                        $urlList .= '<br />';
462
                    }
463
                    $duplicateTrack[$uKey] = true;
464
                }
465
            }
466
        } else {
467
            $urlList = 'ERROR - no URL generated';
468
        }
469
470
        return $urlList;
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $urlList does not seem to be defined for all execution paths leading up to this point.
Loading history...
471
    }
472
473
    /**
474
     * Returns true if input processing instruction is among registered ones.
475
     *
476
     * @param string $piString PI to test
477
     * @param array $incomingProcInstructions Processing instructions
478
     * @return boolean
479
     */
480
    public function drawURLs_PIfilter($piString, array $incomingProcInstructions)
481
    {
482
        if (empty($incomingProcInstructions)) {
483
            return true;
484
        }
485
486
        foreach ($incomingProcInstructions as $pi) {
487
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($piString, $pi)) {
488
                return true;
489
            }
490
        }
491
    }
492
493
    public function getPageTSconfigForId($id)
494
    {
495
        if (!$this->MP) {
496
            $pageTSconfig = \TYPO3\CMS\Backend\Utility\BackendUtility::getPagesTSconfig($id);
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Backend\Utility\BackendUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
497
        } else {
498
            list(, $mountPointId) = explode('-', $this->MP);
0 ignored issues
show
Bug introduced by
$this->MP of type true is incompatible with the type string expected by parameter $string of explode(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

498
            list(, $mountPointId) = explode('-', /** @scrutinizer ignore-type */ $this->MP);
Loading history...
499
            $pageTSconfig = \TYPO3\CMS\Backend\Utility\BackendUtility::getPagesTSconfig($mountPointId);
500
        }
501
502
        // Call a hook to alter configuration
503
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'])) {
504
            $params = [
505
                'pageId' => $id,
506
                'pageTSConfig' => &$pageTSconfig
507
            ];
508
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['getPageTSconfigForId'] as $userFunc) {
509
                \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($userFunc, $params, $this);
510
            }
511
        }
512
513
        return $pageTSconfig;
514
    }
515
516
    /**
517
     * This methods returns an array of configurations.
518
     * And no urls!
519
     *
520
     * @param integer $id Page ID
521
     * @param bool $forceSsl Use https
522
     * @return array
523
     */
524
    protected function getUrlsForPageId($id, $forceSsl = false)
525
    {
526
527
        /**
528
         * Get configuration from tsConfig
529
         */
530
531
        // Get page TSconfig for page ID:
532
        $pageTSconfig = $this->getPageTSconfigForId($id);
533
534
        $res = [];
535
536
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.'])) {
537
            $crawlerCfg = $pageTSconfig['tx_crawler.']['crawlerCfg.'];
538
539
            if (is_array($crawlerCfg['paramSets.'])) {
540
                foreach ($crawlerCfg['paramSets.'] as $key => $values) {
541
                    if (!is_array($values)) {
542
543
                        // Sub configuration for a single configuration string:
544
                        $subCfg = (array)$crawlerCfg['paramSets.'][$key . '.'];
545
                        $subCfg['key'] = $key;
546
547
                        if (strcmp($subCfg['procInstrFilter'], '')) {
548
                            $subCfg['procInstrFilter'] = implode(',', \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']));
549
                        }
550
                        $pidOnlyList = implode(',', \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $subCfg['pidsOnly'], 1));
551
552
                        // process configuration if it is not page-specific or if the specific page is the current page:
553
                        if (!strcmp($subCfg['pidsOnly'], '') || \TYPO3\CMS\Core\Utility\GeneralUtility::inList($pidOnlyList, $id)) {
554
555
                                // add trailing slash if not present
556
                            if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
557
                                $subCfg['baseUrl'] .= '/';
558
                            }
559
560
                            // Explode, process etc.:
561
                            $res[$key] = [];
562
                            $res[$key]['subCfg'] = $subCfg;
563
                            $res[$key]['paramParsed'] = $this->parseParams($values);
564
                            $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
565
                            $res[$key]['origin'] = 'pagets';
566
567
                            // recognize MP value
568
                            if (!$this->MP) {
569
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
570
                            } else {
571
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id . '&MP=' . $this->MP]);
0 ignored issues
show
Bug introduced by
Are you sure $this->MP of type true can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

571
                                $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id . '&MP=' . /** @scrutinizer ignore-type */ $this->MP]);
Loading history...
572
                            }
573
                        }
574
                    }
575
                }
576
            }
577
        }
578
579
        /**
580
         * Get configuration from tx_crawler_configuration records
581
         */
582
583
        // get records along the rootline
584
        $rootLine = \TYPO3\CMS\Backend\Utility\BackendUtility::BEgetRootLine($id);
585
586
        foreach ($rootLine as $page) {
587
            $configurationRecordsForCurrentPage = \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordsByField(
588
                'tx_crawler_configuration',
589
                'pid',
590
                intval($page['uid']),
591
                \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('tx_crawler_configuration') . \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('tx_crawler_configuration')
592
            );
593
594
            if (is_array($configurationRecordsForCurrentPage)) {
595
                foreach ($configurationRecordsForCurrentPage as $configurationRecord) {
596
597
                        // check access to the configuration record
598
                    if (empty($configurationRecord['begroups']) || $GLOBALS['BE_USER']->isAdmin() || $this->hasGroupAccess($GLOBALS['BE_USER']->user['usergroup_cached_list'], $configurationRecord['begroups'])) {
599
                        $pidOnlyList = implode(',', \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $configurationRecord['pidsonly'], 1));
600
601
                        // process configuration if it is not page-specific or if the specific page is the current page:
602
                        if (!strcmp($configurationRecord['pidsonly'], '') || \TYPO3\CMS\Core\Utility\GeneralUtility::inList($pidOnlyList, $id)) {
603
                            $key = $configurationRecord['name'];
604
605
                            // don't overwrite previously defined paramSets
606
                            if (!isset($res[$key])) {
607
608
                                    /* @var $TSparserObject \TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser */
609
                                $TSparserObject = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Core\TypoScript\Parser\TypoScriptParser');
610
                                $TSparserObject->parse($configurationRecord['processing_instruction_parameters_ts']);
611
612
                                $subCfg = [
613
                                    'procInstrFilter' => $configurationRecord['processing_instruction_filter'],
614
                                    'procInstrParams.' => $TSparserObject->setup,
615
                                    'baseUrl' => $this->getBaseUrlForConfigurationRecord(
616
                                        $configurationRecord['base_url'],
617
                                        $configurationRecord['sys_domain_base_url'],
618
                                        $forceSsl
619
                                    ),
620
                                    'realurl' => $configurationRecord['realurl'],
621
                                    'cHash' => $configurationRecord['chash'],
622
                                    'userGroups' => $configurationRecord['fegroups'],
623
                                    'exclude' => $configurationRecord['exclude'],
624
                                    'rootTemplatePid' => (int) $configurationRecord['root_template_pid'],
625
                                    'key' => $key,
626
                                ];
627
628
                                // add trailing slash if not present
629
                                if (!empty($subCfg['baseUrl']) && substr($subCfg['baseUrl'], -1) != '/') {
630
                                    $subCfg['baseUrl'] .= '/';
631
                                }
632
                                if (!in_array($id, $this->expandExcludeString($subCfg['exclude']))) {
633
                                    $res[$key] = [];
634
                                    $res[$key]['subCfg'] = $subCfg;
635
                                    $res[$key]['paramParsed'] = $this->parseParams($configurationRecord['configuration']);
636
                                    $res[$key]['paramExpanded'] = $this->expandParameters($res[$key]['paramParsed'], $id);
637
                                    $res[$key]['URLs'] = $this->compileUrls($res[$key]['paramExpanded'], ['?id=' . $id]);
638
                                    $res[$key]['origin'] = 'tx_crawler_configuration_' . $configurationRecord['uid'];
639
                                }
640
                            }
641
                        }
642
                    }
643
                }
644
            }
645
        }
646
647
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'])) {
648
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['processUrls'] as $func) {
649
                $params = [
650
                    'res' => &$res,
651
                ];
652
                \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($func, $params, $this);
653
            }
654
        }
655
656
        return $res;
657
    }
658
659
    /**
660
     * Checks if a domain record exist and returns the base-url based on the record. If not the given baseUrl string is used.
661
     *
662
     * @param string $baseUrl
663
     * @param integer $sysDomainUid
664
     * @param bool $ssl
665
     * @return string
666
     */
667
    protected function getBaseUrlForConfigurationRecord($baseUrl, $sysDomainUid, $ssl = false)
668
    {
669
        $sysDomainUid = intval($sysDomainUid);
670
        $urlScheme = ($ssl === false) ? 'http' : 'https';
671
672
        if ($sysDomainUid > 0) {
673
            $res = $this->db->exec_SELECTquery(
674
                '*',
675
                'sys_domain',
676
                'uid = ' . $sysDomainUid .
677
                \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('sys_domain') .
678
                \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('sys_domain')
679
            );
680
            $row = $this->db->sql_fetch_assoc($res);
681
            if ($row['domainName'] != '') {
682
                return $urlScheme . '://' . $row['domainName'];
683
            }
684
        }
685
        return $baseUrl;
686
    }
687
688
    public function getConfigurationsForBranch($rootid, $depth)
689
    {
690
        $configurationsForBranch = [];
691
692
        $pageTSconfig = $this->getPageTSconfigForId($rootid);
693
        if (is_array($pageTSconfig) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']) && is_array($pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'])) {
694
            $sets = $pageTSconfig['tx_crawler.']['crawlerCfg.']['paramSets.'];
695
            if (is_array($sets)) {
696
                foreach ($sets as $key => $value) {
697
                    if (!is_array($value)) {
698
                        continue;
699
                    }
700
                    $configurationsForBranch[] = substr($key, -1) == '.' ? substr($key, 0, -1) : $key;
701
                }
702
            }
703
        }
704
        $pids = [];
705
        $rootLine = \TYPO3\CMS\Backend\Utility\BackendUtility::BEgetRootLine($rootid);
706
        foreach ($rootLine as $node) {
707
            $pids[] = $node['uid'];
708
        }
709
        /* @var \TYPO3\CMS\Backend\Tree\View\PageTreeView */
710
        $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
711
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
712
        $tree->init('AND ' . $perms_clause);
713
        $tree->getTree($rootid, $depth, '');
714
        foreach ($tree->tree as $node) {
715
            $pids[] = $node['row']['uid'];
716
        }
717
718
        $res = $this->db->exec_SELECTquery(
719
            '*',
720
            'tx_crawler_configuration',
721
            'pid IN (' . implode(',', $pids) . ') ' .
722
            \TYPO3\CMS\Backend\Utility\BackendUtility::BEenableFields('tx_crawler_configuration') .
723
            \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause('tx_crawler_configuration') . ' ' .
724
            \TYPO3\CMS\Backend\Utility\BackendUtility::versioningPlaceholderClause('tx_crawler_configuration') . ' '
725
        );
726
727
        while ($row = $this->db->sql_fetch_assoc($res)) {
728
            $configurationsForBranch[] = $row['name'];
729
        }
730
        $this->db->sql_free_result($res);
731
        return $configurationsForBranch;
732
    }
733
734
    /**
735
     * Check if a user has access to an item
736
     * (e.g. get the group list of the current logged in user from $GLOBALS['TSFE']->gr_list)
737
     *
738
     * @see \TYPO3\CMS\Frontend\Page\PageRepository::getMultipleGroupsWhereClause()
739
     * @param  string $groupList    Comma-separated list of (fe_)group UIDs from a user
740
     * @param  string $accessList   Comma-separated list of (fe_)group UIDs of the item to access
741
     * @return bool                 TRUE if at least one of the users group UIDs is in the access list or the access list is empty
742
     */
743
    public function hasGroupAccess($groupList, $accessList)
744
    {
745
        if (empty($accessList)) {
746
            return true;
747
        }
748
        foreach (\TYPO3\CMS\Core\Utility\GeneralUtility::intExplode(',', $groupList) as $groupUid) {
749
            if (\TYPO3\CMS\Core\Utility\GeneralUtility::inList($accessList, $groupUid)) {
750
                return true;
751
            }
752
        }
753
        return false;
754
    }
755
756
    /**
757
     * Parse GET vars of input Query into array with key=>value pairs
758
     *
759
     * @param string $inputQuery Input query string
760
     * @return array
761
     */
762
    public function parseParams($inputQuery)
763
    {
764
        // Extract all GET parameters into an ARRAY:
765
        $paramKeyValues = [];
766
        $GETparams = explode('&', $inputQuery);
767
768
        foreach ($GETparams as $paramAndValue) {
769
            list($p, $v) = explode('=', $paramAndValue, 2);
770
            if (strlen($p)) {
771
                $paramKeyValues[rawurldecode($p)] = rawurldecode($v);
772
            }
773
        }
774
775
        return $paramKeyValues;
776
    }
777
778
    /**
779
     * Will expand the parameters configuration to individual values. This follows a certain syntax of the value of each parameter.
780
     * Syntax of values:
781
     * - Basically: If the value is wrapped in [...] it will be expanded according to the following syntax, otherwise the value is taken literally
782
     * - Configuration is splitted by "|" and the parts are processed individually and finally added together
783
     * - For each configuration part:
784
     *         - "[int]-[int]" = Integer range, will be expanded to all values in between, values included, starting from low to high (max. 1000). Example "1-34" or "-40--30"
785
     *         - "_TABLE:[TCA table name];[_PID:[optional page id, default is current page]];[_ENABLELANG:1]" = Look up of table records from PID, filtering out deleted records. Example "_TABLE:tt_content; _PID:123"
786
     *        _ENABLELANG:1 picks only original records without their language overlays
787
     *         - Default: Literal value
788
     *
789
     * @param array $paramArray Array with key (GET var name) and values (value of GET var which is configuration for expansion)
790
     * @param integer $pid Current page ID
791
     * @return array
792
     */
793
    public function expandParameters($paramArray, $pid)
794
    {
795
        global $TCA;
796
797
        // Traverse parameter names:
798
        foreach ($paramArray as $p => $v) {
799
            $v = trim($v);
800
801
            // If value is encapsulated in square brackets it means there are some ranges of values to find, otherwise the value is literal
802
            if (substr($v, 0, 1) === '[' && substr($v, -1) === ']') {
803
                // So, find the value inside brackets and reset the paramArray value as an array.
804
                $v = substr($v, 1, -1);
805
                $paramArray[$p] = [];
806
807
                // Explode parts and traverse them:
808
                $parts = explode('|', $v);
0 ignored issues
show
Bug introduced by
It seems like $v can also be of type false; however, parameter $string of explode() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

808
                $parts = explode('|', /** @scrutinizer ignore-type */ $v);
Loading history...
809
                foreach ($parts as $pV) {
810
811
                        // Look for integer range: (fx. 1-34 or -40--30 // reads minus 40 to minus 30)
812
                    if (preg_match('/^(-?[0-9]+)\s*-\s*(-?[0-9]+)$/', trim($pV), $reg)) {
813
814
                        // Swap if first is larger than last:
815
                        if ($reg[1] > $reg[2]) {
816
                            $temp = $reg[2];
817
                            $reg[2] = $reg[1];
818
                            $reg[1] = $temp;
819
                        }
820
821
                        // Traverse range, add values:
822
                        $runAwayBrake = 1000; // Limit to size of range!
823
                        for ($a = $reg[1]; $a <= $reg[2];$a++) {
824
                            $paramArray[$p][] = $a;
825
                            $runAwayBrake--;
826
                            if ($runAwayBrake <= 0) {
827
                                break;
828
                            }
829
                        }
830
                    } elseif (substr(trim($pV), 0, 7) == '_TABLE:') {
831
832
                        // Parse parameters:
833
                        $subparts = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(';', $pV);
834
                        $subpartParams = [];
835
                        foreach ($subparts as $spV) {
836
                            list($pKey, $pVal) = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(':', $spV);
837
                            $subpartParams[$pKey] = $pVal;
838
                        }
839
840
                        // Table exists:
841
                        if (isset($TCA[$subpartParams['_TABLE']])) {
842
                            $lookUpPid = isset($subpartParams['_PID']) ? intval($subpartParams['_PID']) : $pid;
843
                            $pidField = isset($subpartParams['_PIDFIELD']) ? trim($subpartParams['_PIDFIELD']) : 'pid';
844
                            $where = isset($subpartParams['_WHERE']) ? $subpartParams['_WHERE'] : '';
845
                            $addTable = isset($subpartParams['_ADDTABLE']) ? $subpartParams['_ADDTABLE'] : '';
846
847
                            $fieldName = $subpartParams['_FIELD'] ? $subpartParams['_FIELD'] : 'uid';
848
                            if ($fieldName === 'uid' || $TCA[$subpartParams['_TABLE']]['columns'][$fieldName]) {
849
                                $andWhereLanguage = '';
850
                                $transOrigPointerField = $TCA[$subpartParams['_TABLE']]['ctrl']['transOrigPointerField'];
851
852
                                if ($subpartParams['_ENABLELANG'] && $transOrigPointerField) {
853
                                    $andWhereLanguage = ' AND ' . $this->db->quoteStr($transOrigPointerField, $subpartParams['_TABLE']) . ' <= 0 ';
854
                                }
855
856
                                $where = $this->db->quoteStr($pidField, $subpartParams['_TABLE']) . '=' . intval($lookUpPid) . ' ' .
857
                                    $andWhereLanguage . $where;
858
859
                                $rows = $this->db->exec_SELECTgetRows(
860
                                    $fieldName,
861
                                    $subpartParams['_TABLE'] . $addTable,
862
                                    $where . \TYPO3\CMS\Backend\Utility\BackendUtility::deleteClause($subpartParams['_TABLE']),
863
                                    '',
864
                                    '',
865
                                    '',
866
                                    $fieldName
867
                                );
868
869
                                if (is_array($rows)) {
870
                                    $paramArray[$p] = array_merge($paramArray[$p], array_keys($rows));
871
                                }
872
                            }
873
                        }
874
                    } else { // Just add value:
875
                        $paramArray[$p][] = $pV;
876
                    }
877
                    // Hook for processing own expandParameters place holder
878
                    if (is_array($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'])) {
879
                        $_params = [
880
                            'pObj' => &$this,
881
                            'paramArray' => &$paramArray,
882
                            'currentKey' => $p,
883
                            'currentValue' => $pV,
884
                            'pid' => $pid
885
                        ];
886
                        foreach ($GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']['expandParameters'] as $key => $_funcRef) {
887
                            \TYPO3\CMS\Core\Utility\GeneralUtility::callUserFunction($_funcRef, $_params, $this);
888
                        }
889
                    }
890
                }
891
892
                // Make unique set of values and sort array by key:
893
                $paramArray[$p] = array_unique($paramArray[$p]);
894
                ksort($paramArray);
895
            } else {
896
                // Set the literal value as only value in array:
897
                $paramArray[$p] = [$v];
898
            }
899
        }
900
901
        return $paramArray;
902
    }
903
904
    /**
905
     * Compiling URLs from parameter array (output of expandParameters())
906
     * The number of URLs will be the multiplication of the number of parameter values for each key
907
     *
908
     * @param array $paramArray Output of expandParameters(): Array with keys (GET var names) and for each an array of values
909
     * @param array $urls URLs accumulated in this array (for recursion)
910
     * @return array
911
     */
912
    public function compileUrls($paramArray, $urls = [])
913
    {
914
        if (count($paramArray) && is_array($urls)) {
915
            // shift first off stack:
916
            reset($paramArray);
917
            $varName = key($paramArray);
918
            $valueSet = array_shift($paramArray);
919
920
            // Traverse value set:
921
            $newUrls = [];
922
            foreach ($urls as $url) {
923
                foreach ($valueSet as $val) {
924
                    $newUrls[] = $url . (strcmp($val, '') ? '&' . rawurlencode($varName) . '=' . rawurlencode($val) : '');
925
926
                    if (count($newUrls) > \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($this->extensionSettings['maxCompileUrls'], 1, 1000000000, 10000)) {
927
                        break;
928
                    }
929
                }
930
            }
931
            $urls = $newUrls;
932
            $urls = $this->compileUrls($paramArray, $urls);
933
        }
934
935
        return $urls;
936
    }
937
938
    /************************************
939
     *
940
     * Crawler log
941
     *
942
     ************************************/
943
944
    /**
945
     * Return array of records from crawler queue for input page ID
946
     *
947
     * @param integer $id Page ID for which to look up log entries.
948
     * @param string$filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
949
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
950
     * @param boolean $doFullFlush
951
     * @param integer $itemsPerPage Limit the amount of entries per page default is 10
952
     * @return array
953
     */
954
    public function getLogEntriesForPageId($id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
955
    {
956
        // FIXME: Write Unit tests for Filters
957
        switch ($filter) {
958
            case 'pending':
959
                $addWhere = ' AND exec_time=0';
960
                break;
961
            case 'finished':
962
                $addWhere = ' AND exec_time>0';
963
                break;
964
            default:
965
                $addWhere = '';
966
                break;
967
        }
968
969
        // FIXME: Write unit test that ensures that the right records are deleted.
970
        if ($doFlush) {
971
            $this->flushQueue(($doFullFlush ? '1=1' : ('page_id=' . intval($id))) . $addWhere);
972
            return [];
973
        } else {
974
            return $this->db->exec_SELECTgetRows(
975
                '*',
976
                'tx_crawler_queue',
977
                'page_id=' . intval($id) . $addWhere,
978
                '',
979
                'scheduled DESC',
980
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
981
            );
982
        }
983
    }
984
985
    /**
986
     * Return array of records from crawler queue for input set ID
987
     *
988
     * @param integer $set_id Set ID for which to look up log entries.
989
     * @param string $filter Filter: "all" => all entries, "pending" => all that is not yet run, "finished" => all complete ones
990
     * @param boolean $doFlush If TRUE, then entries selected at DELETED(!) instead of selected!
991
     * @param integer $itemsPerPage Limit the amount of entires per page default is 10
992
     * @return array
993
     */
994
    public function getLogEntriesForSetId($set_id, $filter = '', $doFlush = false, $doFullFlush = false, $itemsPerPage = 10)
995
    {
996
        // FIXME: Write Unit tests for Filters
997
        switch ($filter) {
998
            case 'pending':
999
                $addWhere = ' AND exec_time=0';
1000
                break;
1001
            case 'finished':
1002
                $addWhere = ' AND exec_time>0';
1003
                break;
1004
            default:
1005
                $addWhere = '';
1006
                break;
1007
        }
1008
        // FIXME: Write unit test that ensures that the right records are deleted.
1009
        if ($doFlush) {
1010
            $this->flushQueue($doFullFlush ? '' : ('set_id=' . intval($set_id) . $addWhere));
1011
            return [];
1012
        } else {
1013
            return $this->db->exec_SELECTgetRows(
1014
                '*',
1015
                'tx_crawler_queue',
1016
                'set_id=' . intval($set_id) . $addWhere,
1017
                '',
1018
                'scheduled DESC',
1019
                (intval($itemsPerPage) > 0 ? intval($itemsPerPage) : '')
1020
            );
1021
        }
1022
    }
1023
1024
    /**
1025
     * Removes queue entires
1026
     *
1027
     * @param string $where SQL related filter for the entries which should be removed
1028
     * @return void
1029
     */
1030
    protected function flushQueue($where = '')
1031
    {
1032
        $realWhere = strlen($where) > 0 ? $where : '1=1';
1033
1034
        if (tx_crawler_domain_events_dispatcher::getInstance()->hasObserver('queueEntryFlush')) {
1035
            $groups = $this->db->exec_SELECTgetRows('DISTINCT set_id', 'tx_crawler_queue', $realWhere);
1036
            foreach ($groups as $group) {
1037
                tx_crawler_domain_events_dispatcher::getInstance()->post('queueEntryFlush', $group['set_id'], $this->db->exec_SELECTgetRows('uid, set_id', 'tx_crawler_queue', $realWhere . ' AND set_id="' . $group['set_id'] . '"'));
1038
            }
1039
        }
1040
1041
        $this->db->exec_DELETEquery('tx_crawler_queue', $realWhere);
1042
    }
1043
1044
    /**
1045
     * Adding call back entries to log (called from hooks typically, see indexed search class "class.crawler.php"
1046
     *
1047
     * @param integer $setId Set ID
1048
     * @param array $params Parameters to pass to call back function
1049
     * @param string $callBack Call back object reference, eg. 'EXT:indexed_search/class.crawler.php:&tx_indexedsearch_crawler'
1050
     * @param integer $page_id Page ID to attach it to
1051
     * @param integer $schedule Time at which to activate
1052
     * @return void
1053
     */
1054
    public function addQueueEntry_callBack($setId, $params, $callBack, $page_id = 0, $schedule = 0)
1055
    {
1056
        if (!is_array($params)) {
1057
            $params = [];
1058
        }
1059
        $params['_CALLBACKOBJ'] = $callBack;
1060
1061
        // Compile value array:
1062
        $fieldArray = [
1063
            'page_id' => intval($page_id),
1064
            'parameters' => serialize($params),
1065
            'scheduled' => intval($schedule) ? intval($schedule) : $this->getCurrentTime(),
1066
            'exec_time' => 0,
1067
            'set_id' => intval($setId),
1068
            'result_data' => '',
1069
        ];
1070
1071
        $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1072
    }
1073
1074
    /************************************
1075
     *
1076
     * URL setting
1077
     *
1078
     ************************************/
1079
1080
    /**
1081
     * Setting a URL for crawling:
1082
     *
1083
     * @param integer $id Page ID
1084
     * @param string $url Complete URL
1085
     * @param array $subCfg Sub configuration array (from TS config)
1086
     * @param integer $tstamp Scheduled-time
1087
     * @param string $configurationHash (optional) configuration hash
1088
     * @param bool $skipInnerDuplicationCheck (optional) skip inner duplication check
1089
     * @return bool
1090
     */
1091
    public function addUrl(
1092
        $id,
1093
        $url,
1094
        array $subCfg,
1095
        $tstamp,
1096
        $configurationHash = '',
1097
        $skipInnerDuplicationCheck = false
1098
    ) {
1099
        $urlAdded = false;
1100
1101
        // Creating parameters:
1102
        $parameters = [
1103
            'url' => $url
1104
        ];
1105
1106
        // fe user group simulation:
1107
        $uGs = implode(',', array_unique(\TYPO3\CMS\Core\Utility\GeneralUtility::intExplode(',', $subCfg['userGroups'], 1)));
1108
        if ($uGs) {
1109
            $parameters['feUserGroupList'] = $uGs;
1110
        }
1111
1112
        // Setting processing instructions
1113
        $parameters['procInstructions'] = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $subCfg['procInstrFilter']);
1114
        if (is_array($subCfg['procInstrParams.'])) {
1115
            $parameters['procInstrParams'] = $subCfg['procInstrParams.'];
1116
        }
1117
1118
        // Possible TypoScript Template Parents
1119
        $parameters['rootTemplatePid'] = $subCfg['rootTemplatePid'];
1120
1121
        // Compile value array:
1122
        $parameters_serialized = serialize($parameters);
1123
        $fieldArray = [
1124
            'page_id' => intval($id),
1125
            'parameters' => $parameters_serialized,
1126
            'parameters_hash' => \TYPO3\CMS\Core\Utility\GeneralUtility::shortMD5($parameters_serialized),
1127
            'configuration_hash' => $configurationHash,
1128
            'scheduled' => $tstamp,
1129
            'exec_time' => 0,
1130
            'set_id' => intval($this->setID),
1131
            'result_data' => '',
1132
            'configuration' => $subCfg['key'],
1133
        ];
1134
1135
        if ($this->registerQueueEntriesInternallyOnly) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->registerQueueEntriesInternallyOnly of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
1136
            //the entries will only be registered and not stored to the database
1137
            $this->queueEntries[] = $fieldArray;
1138
        } else {
1139
            if (!$skipInnerDuplicationCheck) {
1140
                // check if there is already an equal entry
1141
                $rows = $this->getDuplicateRowsIfExist($tstamp, $fieldArray);
1142
            }
1143
1144
            if (count($rows) == 0) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $rows does not seem to be defined for all execution paths leading up to this point.
Loading history...
1145
                $this->db->exec_INSERTquery('tx_crawler_queue', $fieldArray);
1146
                $uid = $this->db->sql_insert_id();
1147
                $rows[] = $uid;
1148
                $urlAdded = true;
1149
                tx_crawler_domain_events_dispatcher::getInstance()->post('urlAddedToQueue', $this->setID, ['uid' => $uid, 'fieldArray' => $fieldArray]);
1150
            } else {
1151
                tx_crawler_domain_events_dispatcher::getInstance()->post('duplicateUrlInQueue', $this->setID, ['rows' => $rows, 'fieldArray' => $fieldArray]);
1152
            }
1153
        }
1154
1155
        return $urlAdded;
1156
    }
1157
1158
    /**
1159
     * This method determines duplicates for a queue entry with the same parameters and this timestamp.
1160
     * If the timestamp is in the past, it will check if there is any unprocessed queue entry in the past.
1161
     * If the timestamp is in the future it will check, if the queued entry has exactly the same timestamp
1162
     *
1163
     * @param int $tstamp
1164
     * @param array $fieldArray
1165
     *
1166
     * @return array;
1167
     */
1168
    protected function getDuplicateRowsIfExist($tstamp, $fieldArray)
1169
    {
1170
        $rows = [];
1171
1172
        $currentTime = $this->getCurrentTime();
1173
1174
        //if this entry is scheduled with "now"
1175
        if ($tstamp <= $currentTime) {
1176
            if ($this->extensionSettings['enableTimeslot']) {
1177
                $timeBegin = $currentTime - 100;
1178
                $timeEnd = $currentTime + 100;
1179
                $where = ' ((scheduled BETWEEN ' . $timeBegin . ' AND ' . $timeEnd . ' ) OR scheduled <= ' . $currentTime . ') ';
1180
            } else {
1181
                $where = 'scheduled <= ' . $currentTime;
1182
            }
1183
        } elseif ($tstamp > $currentTime) {
1184
            //entry with a timestamp in the future need to have the same schedule time
1185
            $where = 'scheduled = ' . $tstamp ;
1186
        }
1187
1188
        if (!empty($where)) {
1189
            $result = $this->db->exec_SELECTgetRows(
1190
                'qid',
1191
                'tx_crawler_queue',
1192
                $where .
1193
                ' AND NOT exec_time' .
1194
                ' AND NOT process_id ' .
1195
                ' AND page_id=' . intval($fieldArray['page_id']) .
1196
                ' AND parameters_hash = ' . $this->db->fullQuoteStr($fieldArray['parameters_hash'], 'tx_crawler_queue')
1197
            );
1198
1199
            if (is_array($result)) {
1200
                foreach ($result as $value) {
1201
                    $rows[] = $value['qid'];
1202
                }
1203
            }
1204
        }
1205
1206
        return $rows;
1207
    }
1208
1209
    /**
1210
     * Returns the current system time
1211
     *
1212
     * @return int
1213
     */
1214
    public function getCurrentTime()
1215
    {
1216
        return time();
1217
    }
1218
1219
    /************************************
1220
     *
1221
     * URL reading
1222
     *
1223
     ************************************/
1224
1225
    /**
1226
     * Read URL for single queue entry
1227
     *
1228
     * @param integer $queueId
1229
     * @param boolean $force If set, will process even if exec_time has been set!
1230
     * @return integer
1231
     */
1232
    public function readUrl($queueId, $force = false)
1233
    {
1234
        $ret = 0;
1235
        if ($this->debugMode) {
1236
            \TYPO3\CMS\Core\Utility\GeneralUtility::devlog('crawler-readurl start ' . microtime(true), __FUNCTION__);
1237
        }
1238
        // Get entry:
1239
        list($queueRec) = $this->db->exec_SELECTgetRows(
1240
            '*',
1241
            'tx_crawler_queue',
1242
            'qid=' . intval($queueId) . ($force ? '' : ' AND exec_time=0 AND process_scheduled > 0')
1243
        );
1244
1245
        if (!is_array($queueRec)) {
1246
            return;
1247
        }
1248
1249
        $parameters = unserialize($queueRec['parameters']);
1250
        if ($parameters['rootTemplatePid']) {
1251
            $this->initTSFE((int)$parameters['rootTemplatePid']);
1252
        } else {
1253
            \TYPO3\CMS\Core\Utility\GeneralUtility::sysLog(
1254
                'Page with (' . $queueRec['page_id'] . ') could not be crawled, please check your crawler configuration. Perhaps no Root Template Pid is set',
1255
                'crawler',
1256
                \TYPO3\CMS\Core\Utility\GeneralUtility::SYSLOG_SEVERITY_WARNING
1257
            );
1258
        }
1259
1260
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1261
            __CLASS__,
1262
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_PREPROCESS,
1263
            [$queueId, &$queueRec]
1264
        );
1265
1266
        // Set exec_time to lock record:
1267
        $field_array = ['exec_time' => $this->getCurrentTime()];
1268
1269
        if (isset($this->processID)) {
1270
            //if mulitprocessing is used we need to store the id of the process which has handled this entry
1271
            $field_array['process_id_completed'] = $this->processID;
1272
        }
1273
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1274
1275
        $result = $this->readUrl_exec($queueRec);
1276
        $resultData = unserialize($result['content']);
1277
1278
        //atm there's no need to point to specific pollable extensions
1279
        if (is_array($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'])) {
1280
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'] as $pollable) {
1281
                // only check the success value if the instruction is runnig
1282
                // it is important to name the pollSuccess key same as the procInstructions key
1283
                if (is_array($resultData['parameters']['procInstructions']) && in_array(
1284
                    $pollable,
1285
                        $resultData['parameters']['procInstructions']
1286
                )
1287
                ) {
1288
                    if (!empty($resultData['success'][$pollable]) && $resultData['success'][$pollable]) {
1289
                        $ret |= self::CLI_STATUS_POLLABLE_PROCESSED;
1290
                    }
1291
                }
1292
            }
1293
        }
1294
1295
        // Set result in log which also denotes the end of the processing of this entry.
1296
        $field_array = ['result_data' => serialize($result)];
1297
1298
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1299
            __CLASS__,
1300
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1301
            [$queueId, &$field_array]
1302
        );
1303
1304
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1305
1306
        if ($this->debugMode) {
1307
            \TYPO3\CMS\Core\Utility\GeneralUtility::devlog('crawler-readurl stop ' . microtime(true), __FUNCTION__);
1308
        }
1309
1310
        return $ret;
1311
    }
1312
1313
    /**
1314
     * Read URL for not-yet-inserted log-entry
1315
     *
1316
     * @param integer $field_array Queue field array,
1317
     * @return string
1318
     */
1319
    public function readUrlFromArray($field_array)
1320
    {
1321
1322
            // Set exec_time to lock record:
1323
        $field_array['exec_time'] = $this->getCurrentTime();
1324
        $this->db->exec_INSERTquery('tx_crawler_queue', $field_array);
1325
        $queueId = $field_array['qid'] = $this->db->sql_insert_id();
1326
1327
        $result = $this->readUrl_exec($field_array);
0 ignored issues
show
Bug introduced by
$field_array of type integer is incompatible with the type array expected by parameter $queueRec of tx_crawler_lib::readUrl_exec(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1327
        $result = $this->readUrl_exec(/** @scrutinizer ignore-type */ $field_array);
Loading history...
1328
1329
        // Set result in log which also denotes the end of the processing of this entry.
1330
        $field_array = ['result_data' => serialize($result)];
1331
1332
        \AOE\Crawler\Utility\SignalSlotUtility::emitSignal(
1333
            __CLASS__,
1334
            \AOE\Crawler\Utility\SignalSlotUtility::SIGNNAL_QUEUEITEM_POSTPROCESS,
1335
            [$queueId, &$field_array]
1336
        );
1337
1338
        $this->db->exec_UPDATEquery('tx_crawler_queue', 'qid=' . intval($queueId), $field_array);
1339
1340
        return $result;
1341
    }
1342
1343
    /**
1344
     * Read URL for a queue record
1345
     *
1346
     * @param array $queueRec Queue record
1347
     * @return string
1348
     */
1349
    public function readUrl_exec($queueRec)
1350
    {
1351
        // Decode parameters:
1352
        $parameters = unserialize($queueRec['parameters']);
1353
        $result = 'ERROR';
1354
        if (is_array($parameters)) {
1355
            if ($parameters['_CALLBACKOBJ']) { // Calling object:
1356
                $objRef = $parameters['_CALLBACKOBJ'];
1357
                $callBackObj = &\TYPO3\CMS\Core\Utility\GeneralUtility::getUserObj($objRef);
1358
                if (is_object($callBackObj)) {
1359
                    unset($parameters['_CALLBACKOBJ']);
1360
                    $result = ['content' => serialize($callBackObj->crawler_execute($parameters, $this))];
1361
                } else {
1362
                    $result = ['content' => 'No object: ' . $objRef];
1363
                }
1364
            } else { // Regular FE request:
1365
1366
                // Prepare:
1367
                $crawlerId = $queueRec['qid'] . ':' . md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey']);
1368
1369
                // Get result:
1370
                $result = $this->requestUrl($parameters['url'], $crawlerId);
1371
1372
                tx_crawler_domain_events_dispatcher::getInstance()->post('urlCrawled', $queueRec['set_id'], ['url' => $parameters['url'], 'result' => $result]);
1373
            }
1374
        }
1375
1376
        return $result;
0 ignored issues
show
Bug Best Practice introduced by
The expression return $result also could return the type array<string,string>|array which is incompatible with the documented return type string.
Loading history...
1377
    }
1378
1379
    /**
1380
     * Gets the content of a URL.
1381
     *
1382
     * @param string $originalUrl URL to read
1383
     * @param string $crawlerId Crawler ID string (qid + hash to verify)
1384
     * @param integer $timeout Timeout time
1385
     * @param integer $recursion Recursion limiter for 302 redirects
1386
     * @return array
1387
     */
1388
    public function requestUrl($originalUrl, $crawlerId, $timeout = 2, $recursion = 10)
1389
    {
1390
        if (!$recursion) {
1391
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type array.
Loading history...
1392
        }
1393
1394
        // Parse URL, checking for scheme:
1395
        $url = parse_url($originalUrl);
1396
1397
        if ($url === false) {
1398
            if (TYPO3_DLOG) {
0 ignored issues
show
Bug introduced by
The constant TYPO3_DLOG was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
1399
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Could not parse_url() for string "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1400
            }
1401
            return false;
1402
        }
1403
1404
        if (!in_array($url['scheme'], ['','http','https'])) {
1405
            if (TYPO3_DLOG) {
1406
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Scheme does not match for url "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
0 ignored issues
show
Bug introduced by
$url of type array is incompatible with the type string expected by parameter $args of sprintf(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1406
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Scheme does not match for url "%s"', /** @scrutinizer ignore-type */ $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
Loading history...
1407
            }
1408
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type array.
Loading history...
1409
        }
1410
1411
        // direct request
1412
        if ($this->extensionSettings['makeDirectRequests']) {
1413
            $result = $this->sendDirectRequest($originalUrl, $crawlerId);
0 ignored issues
show
Bug introduced by
$crawlerId of type string is incompatible with the type integer expected by parameter $crawlerId of tx_crawler_lib::sendDirectRequest(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1413
            $result = $this->sendDirectRequest($originalUrl, /** @scrutinizer ignore-type */ $crawlerId);
Loading history...
1414
            return $result;
1415
        }
1416
1417
        $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1418
1419
        // thanks to Pierrick Caillon for adding proxy support
1420
        $rurl = $url;
1421
1422
        if ($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlUse'] && $GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']) {
1423
            $rurl = parse_url($GLOBALS['TYPO3_CONF_VARS']['SYS']['curlProxyServer']);
1424
            $url['path'] = $url['scheme'] . '://' . $url['host'] . ($url['port'] > 0 ? ':' . $url['port'] : '') . $url['path'];
1425
            $reqHeaders = $this->buildRequestHeaderArray($url, $crawlerId);
1426
        }
1427
1428
        $host = $rurl['host'];
1429
1430
        if ($url['scheme'] == 'https') {
1431
            $host = 'ssl://' . $host;
1432
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 443;
1433
        } else {
1434
            $port = ($rurl['port'] > 0) ? $rurl['port'] : 80;
1435
        }
1436
1437
        $startTime = microtime(true);
1438
        $fp = fsockopen($host, $port, $errno, $errstr, $timeout);
1439
1440
        if (!$fp) {
1441
            if (TYPO3_DLOG) {
1442
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
0 ignored issues
show
Bug introduced by
$url of type array<mixed,mixed|string>|array is incompatible with the type string expected by parameter $args of sprintf(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1442
                \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', /** @scrutinizer ignore-type */ $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
Loading history...
1443
            }
1444
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type array.
Loading history...
1445
        } else {
1446
            // Request message:
1447
            $msg = implode("\r\n", $reqHeaders) . "\r\n\r\n";
1448
            fputs($fp, $msg);
0 ignored issues
show
Bug introduced by
The call to fputs() has too few arguments starting with length. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1448
            /** @scrutinizer ignore-call */ 
1449
            fputs($fp, $msg);

This check compares calls to functions or methods with their respective definitions. If the call has less arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
1449
1450
            // Read response:
1451
            $d = $this->getHttpResponseFromStream($fp);
1452
            fclose($fp);
1453
1454
            $time = microtime(true) - $startTime;
1455
            $this->log($originalUrl . ' ' . $time);
1456
1457
            // Implode content and headers:
1458
            $result = [
1459
                'request' => $msg,
1460
                'headers' => implode('', $d['headers']),
1461
                'content' => implode('', (array)$d['content'])
1462
            ];
1463
1464
            if (($this->extensionSettings['follow30x']) && ($newUrl = $this->getRequestUrlFrom302Header($d['headers'], $url['user'], $url['pass']))) {
1465
                $result = array_merge(['parentRequest' => $result], $this->requestUrl($newUrl, $crawlerId, $recursion--));
1466
                $newRequestUrl = $this->requestUrl($newUrl, $crawlerId, $timeout, --$recursion);
1467
1468
                if (is_array($newRequestUrl)) {
1469
                    $result = array_merge(['parentRequest' => $result], $newRequestUrl);
1470
                } else {
1471
                    if (TYPO3_DLOG) {
1472
                        \TYPO3\CMS\Core\Utility\GeneralUtility::devLog(sprintf('Error while opening "%s"', $url), 'crawler', 4, ['crawlerId' => $crawlerId]);
1473
                    }
1474
                    return false;
1475
                }
1476
            }
1477
1478
            return $result;
1479
        }
1480
    }
1481
1482
    /**
1483
     * Gets the base path of the website frontend.
1484
     * (e.g. if you call http://mydomain.com/cms/index.php in
1485
     * the browser the base path is "/cms/")
1486
     *
1487
     * @return string Base path of the website frontend
1488
     */
1489
    protected function getFrontendBasePath()
1490
    {
1491
        $frontendBasePath = '/';
1492
1493
        // Get the path from the extension settings:
1494
        if (isset($this->extensionSettings['frontendBasePath']) && $this->extensionSettings['frontendBasePath']) {
1495
            $frontendBasePath = $this->extensionSettings['frontendBasePath'];
1496
            // If empty, try to use config.absRefPrefix:
1497
        } elseif (isset($GLOBALS['TSFE']->absRefPrefix) && !empty($GLOBALS['TSFE']->absRefPrefix)) {
1498
            $frontendBasePath = $GLOBALS['TSFE']->absRefPrefix;
1499
            // If not in CLI mode the base path can be determined from $_SERVER environment:
1500
        } elseif (!defined('TYPO3_REQUESTTYPE_CLI') || !TYPO3_REQUESTTYPE_CLI) {
0 ignored issues
show
Bug introduced by
The constant TYPO3_REQUESTTYPE_CLI was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
1501
            $frontendBasePath = \TYPO3\CMS\Core\Utility\GeneralUtility::getIndpEnv('TYPO3_SITE_PATH');
1502
        }
1503
1504
        // Base path must be '/<pathSegements>/':
1505
        if ($frontendBasePath != '/') {
1506
            $frontendBasePath = '/' . ltrim($frontendBasePath, '/');
1507
            $frontendBasePath = rtrim($frontendBasePath, '/') . '/';
1508
        }
1509
1510
        return $frontendBasePath;
1511
    }
1512
1513
    /**
1514
     * Executes a shell command and returns the outputted result.
1515
     *
1516
     * @param string $command Shell command to be executed
1517
     * @return string Outputted result of the command execution
1518
     */
1519
    protected function executeShellCommand($command)
1520
    {
1521
        $result = shell_exec($command);
1522
        return $result;
1523
    }
1524
1525
    /**
1526
     * Reads HTTP response from the given stream.
1527
     *
1528
     * @param  resource $streamPointer  Pointer to connection stream.
1529
     * @return array                    Associative array with the following items:
1530
     *                                  headers <array> Response headers sent by server.
1531
     *                                  content <array> Content, with each line as an array item.
1532
     */
1533
    protected function getHttpResponseFromStream($streamPointer)
1534
    {
1535
        $response = ['headers' => [], 'content' => []];
1536
1537
        if (is_resource($streamPointer)) {
1538
            // read headers
1539
            while ($line = fgets($streamPointer, '2048')) {
1540
                $line = trim($line);
1541
                if ($line !== '') {
1542
                    $response['headers'][] = $line;
1543
                } else {
1544
                    break;
1545
                }
1546
            }
1547
1548
            // read content
1549
            while ($line = fgets($streamPointer, '2048')) {
1550
                $response['content'][] = $line;
1551
            }
1552
        }
1553
1554
        return $response;
1555
    }
1556
1557
    /**
1558
     * @param message
1559
     */
1560
    protected function log($message)
1561
    {
1562
        if (!empty($this->extensionSettings['logFileName'])) {
1563
            @file_put_contents($this->extensionSettings['logFileName'], date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);
0 ignored issues
show
Security Best Practice introduced by
It seems like you do not handle an error condition for file_put_contents(). This can introduce security issues, and is generally not recommended. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unhandled  annotation

1563
            /** @scrutinizer ignore-unhandled */ @file_put_contents($this->extensionSettings['logFileName'], date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);

If you suppress an error, we recommend checking for the error condition explicitly:

// For example instead of
@mkdir($dir);

// Better use
if (@mkdir($dir) === false) {
    throw new \RuntimeException('The directory '.$dir.' could not be created.');
}
Loading history...
Bug introduced by
Are you sure date('Ymd His') of type false|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1563
            @file_put_contents($this->extensionSettings['logFileName'], /** @scrutinizer ignore-type */ date('Ymd His') . ' ' . $message . PHP_EOL, FILE_APPEND);
Loading history...
1564
        }
1565
    }
1566
1567
    /**
1568
     * Builds HTTP request headers.
1569
     *
1570
     * @param array $url
1571
     * @param string $crawlerId
1572
     *
1573
     * @return array
1574
     */
1575
    protected function buildRequestHeaderArray(array $url, $crawlerId)
1576
    {
1577
        $reqHeaders = [];
1578
        $reqHeaders[] = 'GET ' . $url['path'] . ($url['query'] ? '?' . $url['query'] : '') . ' HTTP/1.0';
1579
        $reqHeaders[] = 'Host: ' . $url['host'];
1580
        if (stristr($url['query'], 'ADMCMD_previewWS')) {
1581
            $reqHeaders[] = 'Cookie: $Version="1"; be_typo_user="1"; $Path=/';
1582
        }
1583
        $reqHeaders[] = 'Connection: close';
1584
        if ($url['user'] != '') {
1585
            $reqHeaders[] = 'Authorization: Basic ' . base64_encode($url['user'] . ':' . $url['pass']);
1586
        }
1587
        $reqHeaders[] = 'X-T3crawler: ' . $crawlerId;
1588
        $reqHeaders[] = 'User-Agent: TYPO3 crawler';
1589
        return $reqHeaders;
1590
    }
1591
1592
    /**
1593
     * Check if the submitted HTTP-Header contains a redirect location and built new crawler-url
1594
     *
1595
     * @param array $headers HTTP Header
1596
     * @param string $user HTTP Auth. User
1597
     * @param string $pass HTTP Auth. Password
1598
     * @return string
1599
     */
1600
    protected function getRequestUrlFrom302Header($headers, $user = '', $pass = '')
1601
    {
1602
        if (!is_array($headers)) {
1603
            return false;
1604
        }
1605
        if (!(stristr($headers[0], '301 Moved') || stristr($headers[0], '302 Found') || stristr($headers[0], '302 Moved'))) {
1606
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type string.
Loading history...
1607
        }
1608
1609
        foreach ($headers as $hl) {
1610
            $tmp = explode(": ", $hl);
1611
            $header[trim($tmp[0])] = trim($tmp[1]);
1612
            if (trim($tmp[0]) == 'Location') {
1613
                break;
1614
            }
1615
        }
1616
        if (!array_key_exists('Location', $header)) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $header seems to be defined by a foreach iteration on line 1609. Are you sure the iterator is never empty, otherwise this variable is not defined?
Loading history...
1617
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type string.
Loading history...
1618
        }
1619
1620
        if ($user != '') {
1621
            if (!($tmp = parse_url($header['Location']))) {
1622
                return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type string.
Loading history...
1623
            }
1624
            $newUrl = $tmp['scheme'] . '://' . $user . ':' . $pass . '@' . $tmp['host'] . $tmp['path'];
1625
            if ($tmp['query'] != '') {
1626
                $newUrl .= '?' . $tmp['query'];
1627
            }
1628
        } else {
1629
            $newUrl = $header['Location'];
1630
        }
1631
        return $newUrl;
1632
    }
1633
1634
    /**************************
1635
     *
1636
     * tslib_fe hooks:
1637
     *
1638
     **************************/
1639
1640
    /**
1641
     * Initialization hook (called after database connection)
1642
     * Takes the "HTTP_X_T3CRAWLER" header and looks up queue record and verifies if the session comes from the system (by comparing hashes)
1643
     *
1644
     * @param array $params Parameters from frontend
1645
     * @param object $ref TSFE object (reference under PHP5)
1646
     * @return void
1647
     */
1648
    public function fe_init(&$params, $ref)
0 ignored issues
show
Unused Code introduced by
The parameter $ref is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

1648
    public function fe_init(&$params, /** @scrutinizer ignore-unused */ $ref)

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
1649
    {
1650
1651
            // Authenticate crawler request:
1652
        if (isset($_SERVER['HTTP_X_T3CRAWLER'])) {
1653
            list($queueId, $hash) = explode(':', $_SERVER['HTTP_X_T3CRAWLER']);
1654
            list($queueRec) = $this->db->exec_SELECTgetRows('*', 'tx_crawler_queue', 'qid=' . intval($queueId));
1655
1656
            // If a crawler record was found and hash was matching, set it up:
1657
            if (is_array($queueRec) && $hash === md5($queueRec['qid'] . '|' . $queueRec['set_id'] . '|' . $GLOBALS['TYPO3_CONF_VARS']['SYS']['encryptionKey'])) {
1658
                $params['pObj']->applicationData['tx_crawler']['running'] = true;
1659
                $params['pObj']->applicationData['tx_crawler']['parameters'] = unserialize($queueRec['parameters']);
1660
                $params['pObj']->applicationData['tx_crawler']['log'] = [];
1661
            } else {
1662
                die('No crawler entry found!');
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
1663
            }
1664
        }
1665
    }
1666
1667
    /*****************************
1668
     *
1669
     * Compiling URLs to crawl - tools
1670
     *
1671
     *****************************/
1672
1673
    /**
1674
     * @param integer $id Root page id to start from.
1675
     * @param integer $depth Depth of tree, 0=only id-page, 1= on sublevel, 99 = infinite
1676
     * @param integer $scheduledTime Unix Time when the URL is timed to be visited when put in queue
1677
     * @param integer $reqMinute Number of requests per minute (creates the interleave between requests)
1678
     * @param boolean $submitCrawlUrls If set, submits the URLs to queue in database (real crawling)
1679
     * @param boolean $downloadCrawlUrls If set (and submitcrawlUrls is false) will fill $downloadUrls with entries)
1680
     * @param array $incomingProcInstructions Array of processing instructions
1681
     * @param array $configurationSelection Array of configuration keys
1682
     * @return string
1683
     */
1684
    public function getPageTreeAndUrls(
1685
        $id,
1686
        $depth,
1687
        $scheduledTime,
1688
        $reqMinute,
1689
        $submitCrawlUrls,
1690
        $downloadCrawlUrls,
1691
        array $incomingProcInstructions,
1692
        array $configurationSelection
1693
    ) {
1694
        global $BACK_PATH;
1695
        global $LANG;
1696
        if (!is_object($LANG)) {
1697
            $LANG = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('language');
1698
            $LANG->init(0);
1699
        }
1700
        $this->scheduledTime = $scheduledTime;
0 ignored issues
show
Bug Best Practice introduced by
The property scheduledTime does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1701
        $this->reqMinute = $reqMinute;
0 ignored issues
show
Bug Best Practice introduced by
The property reqMinute does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1702
        $this->submitCrawlUrls = $submitCrawlUrls;
0 ignored issues
show
Bug Best Practice introduced by
The property submitCrawlUrls does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1703
        $this->downloadCrawlUrls = $downloadCrawlUrls;
0 ignored issues
show
Bug Best Practice introduced by
The property downloadCrawlUrls does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1704
        $this->incomingProcInstructions = $incomingProcInstructions;
1705
        $this->incomingConfigurationSelection = $configurationSelection;
1706
1707
        $this->duplicateTrack = [];
1708
        $this->downloadUrls = [];
1709
1710
        // Drawing tree:
1711
        /* @var $tree \TYPO3\CMS\Backend\Tree\View\PageTreeView */
1712
        $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1713
        $perms_clause = $GLOBALS['BE_USER']->getPagePermsClause(1);
1714
        $tree->init('AND ' . $perms_clause);
1715
1716
        $pageinfo = \TYPO3\CMS\Backend\Utility\BackendUtility::readPageAccess($id, $perms_clause);
1717
1718
        // Set root row:
1719
        $tree->tree[] = [
1720
            'row' => $pageinfo,
1721
            'HTML' => \AOE\Crawler\Utility\IconUtility::getIconForRecord('pages', $pageinfo)
1722
        ];
1723
1724
        // Get branch beneath:
1725
        if ($depth) {
1726
            $tree->getTree($id, $depth, '');
1727
        }
1728
1729
        // Traverse page tree:
1730
        $code = '';
1731
1732
        foreach ($tree->tree as $data) {
1733
            $this->MP = false;
1734
1735
            // recognize mount points
1736
            if ($data['row']['doktype'] == 7) {
1737
                $mountpage = $this->db->exec_SELECTgetRows('*', 'pages', 'uid = ' . $data['row']['uid']);
1738
1739
                // fetch mounted pages
1740
                $this->MP = $mountpage[0]['mount_pid'] . '-' . $data['row']['uid'];
0 ignored issues
show
Documentation Bug introduced by
The property $MP was declared of type boolean, but $mountpage[0]['mount_pid...' . $data['row']['uid'] is of type string. Maybe add a type cast?

This check looks for assignments to scalar types that may be of the wrong type.

To ensure the code behaves as expected, it may be a good idea to add an explicit type cast.

$answer = 42;

$correct = false;

$correct = (bool) $answer;
Loading history...
1741
1742
                $mountTree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1743
                $mountTree->init('AND ' . $perms_clause);
1744
                $mountTree->getTree($mountpage[0]['mount_pid'], $depth, '');
1745
1746
                foreach ($mountTree->tree as $mountData) {
1747
                    $code .= $this->drawURLs_addRowsForPage(
1748
                        $mountData['row'],
1749
                        $mountData['HTML'] . \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordTitle('pages', $mountData['row'], true)
1750
                    );
1751
                }
1752
1753
                // replace page when mount_pid_ol is enabled
1754
                if ($mountpage[0]['mount_pid_ol']) {
1755
                    $data['row']['uid'] = $mountpage[0]['mount_pid'];
1756
                } else {
1757
                    // if the mount_pid_ol is not set the MP must not be used for the mountpoint page
1758
                    $this->MP = false;
1759
                }
1760
            }
1761
1762
            $code .= $this->drawURLs_addRowsForPage(
1763
                $data['row'],
1764
                $data['HTML'] . \TYPO3\CMS\Backend\Utility\BackendUtility::getRecordTitle('pages', $data['row'], true)
1765
            );
1766
        }
1767
1768
        return $code;
1769
    }
1770
1771
    /**
1772
     * Expands exclude string
1773
     *
1774
     * @param string $excludeString Exclude string
1775
     * @return array
1776
     */
1777
    public function expandExcludeString($excludeString)
1778
    {
1779
        // internal static caches;
1780
        static $expandedExcludeStringCache;
1781
        static $treeCache;
1782
1783
        if (empty($expandedExcludeStringCache[$excludeString])) {
1784
            $pidList = [];
1785
1786
            if (!empty($excludeString)) {
1787
                /* @var $tree \TYPO3\CMS\Backend\Tree\View\PageTreeView */
1788
                $tree = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\CMS\Backend\Tree\View\PageTreeView');
1789
                $tree->init('AND ' . $this->backendUser->getPagePermsClause(1));
1790
1791
                $excludeParts = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $excludeString);
1792
1793
                foreach ($excludeParts as $excludePart) {
1794
                    list($pid, $depth) = \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode('+', $excludePart);
1795
1796
                    // default is "page only" = "depth=0"
1797
                    if (empty($depth)) {
1798
                        $depth = (stristr($excludePart, '+')) ? 99 : 0;
1799
                    }
1800
1801
                    $pidList[] = $pid;
1802
1803
                    if ($depth > 0) {
1804
                        if (empty($treeCache[$pid][$depth])) {
1805
                            $tree->reset();
1806
                            $tree->getTree($pid, $depth);
1807
                            $treeCache[$pid][$depth] = $tree->tree;
1808
                        }
1809
1810
                        foreach ($treeCache[$pid][$depth] as $data) {
1811
                            $pidList[] = $data['row']['uid'];
1812
                        }
1813
                    }
1814
                }
1815
            }
1816
1817
            $expandedExcludeStringCache[$excludeString] = array_unique($pidList);
1818
        }
1819
1820
        return $expandedExcludeStringCache[$excludeString];
1821
    }
1822
1823
    /**
1824
     * Create the rows for display of the page tree
1825
     * For each page a number of rows are shown displaying GET variable configuration
1826
     *
1827
     * @param    array        Page row
0 ignored issues
show
Bug introduced by
The type Page was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
1828
     * @param    string        Page icon and title for row
1829
     * @return    string        HTML <tr> content (one or more)
1830
     */
1831
    public function drawURLs_addRowsForPage(array $pageRow, $pageTitleAndIcon)
1832
    {
1833
        $skipMessage = '';
1834
1835
        // Get list of configurations
1836
        $configurations = $this->getUrlsForPageRow($pageRow, $skipMessage);
1837
1838
        if (count($this->incomingConfigurationSelection) > 0) {
1839
            // remove configuration that does not match the current selection
1840
            foreach ($configurations as $confKey => $confArray) {
1841
                if (!in_array($confKey, $this->incomingConfigurationSelection)) {
1842
                    unset($configurations[$confKey]);
1843
                }
1844
            }
1845
        }
1846
1847
        // Traverse parameter combinations:
1848
        $c = 0;
1849
        $cc = 0;
0 ignored issues
show
Unused Code introduced by
The assignment to $cc is dead and can be removed.
Loading history...
1850
        $content = '';
1851
        if (count($configurations)) {
1852
            foreach ($configurations as $confKey => $confArray) {
1853
1854
                    // Title column:
1855
                if (!$c) {
1856
                    $titleClm = '<td rowspan="' . count($configurations) . '">' . $pageTitleAndIcon . '</td>';
1857
                } else {
1858
                    $titleClm = '';
1859
                }
1860
1861
                if (!in_array($pageRow['uid'], $this->expandExcludeString($confArray['subCfg']['exclude']))) {
1862
1863
                        // URL list:
1864
                    $urlList = $this->urlListFromUrlArray(
1865
                        $confArray,
1866
                        $pageRow,
1867
                        $this->scheduledTime,
1868
                        $this->reqMinute,
1869
                        $this->submitCrawlUrls,
1870
                        $this->downloadCrawlUrls,
1871
                        $this->duplicateTrack,
1872
                        $this->downloadUrls,
1873
                        $this->incomingProcInstructions // if empty the urls won't be filtered by processing instructions
1874
                    );
1875
1876
                    // Expanded parameters:
1877
                    $paramExpanded = '';
1878
                    $calcAccu = [];
1879
                    $calcRes = 1;
1880
                    foreach ($confArray['paramExpanded'] as $gVar => $gVal) {
1881
                        $paramExpanded .= '
1882
                            <tr>
1883
                                <td class="bgColor4-20">' . htmlspecialchars('&' . $gVar . '=') . '<br/>' .
1884
                                                '(' . count($gVal) . ')' .
1885
                                                '</td>
1886
                                <td class="bgColor4" nowrap="nowrap">' . nl2br(htmlspecialchars(implode(chr(10), $gVal))) . '</td>
1887
                            </tr>
1888
                        ';
1889
                        $calcRes *= count($gVal);
1890
                        $calcAccu[] = count($gVal);
1891
                    }
1892
                    $paramExpanded = '<table class="lrPadding c-list param-expanded">' . $paramExpanded . '</table>';
1893
                    $paramExpanded .= 'Comb: ' . implode('*', $calcAccu) . '=' . $calcRes;
1894
1895
                    // Options
1896
                    $optionValues = '';
1897
                    if ($confArray['subCfg']['userGroups']) {
1898
                        $optionValues .= 'User Groups: ' . $confArray['subCfg']['userGroups'] . '<br/>';
1899
                    }
1900
                    if ($confArray['subCfg']['baseUrl']) {
1901
                        $optionValues .= 'Base Url: ' . $confArray['subCfg']['baseUrl'] . '<br/>';
1902
                    }
1903
                    if ($confArray['subCfg']['procInstrFilter']) {
1904
                        $optionValues .= 'ProcInstr: ' . $confArray['subCfg']['procInstrFilter'] . '<br/>';
1905
                    }
1906
1907
                    // Compile row:
1908
                    $content .= '
1909
                        <tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
1910
                            ' . $titleClm . '
1911
                            <td>' . htmlspecialchars($confKey) . '</td>
1912
                            <td>' . nl2br(htmlspecialchars(rawurldecode(trim(str_replace('&', chr(10) . '&', \TYPO3\CMS\Core\Utility\GeneralUtility::implodeArrayForUrl('', $confArray['paramParsed'])))))) . '</td>
1913
                            <td>' . $paramExpanded . '</td>
1914
                            <td nowrap="nowrap">' . $urlList . '</td>
1915
                            <td nowrap="nowrap">' . $optionValues . '</td>
1916
                            <td nowrap="nowrap">' . \TYPO3\CMS\Core\Utility\DebugUtility::viewArray($confArray['subCfg']['procInstrParams.']) . '</td>
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\DebugUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
1917
                        </tr>';
1918
                } else {
1919
                    $content .= '<tr class="bgColor' . ($c % 2 ? '-20' : '-10') . '">
1920
                            ' . $titleClm . '
1921
                            <td>' . htmlspecialchars($confKey) . '</td>
1922
                            <td colspan="5"><em>No entries</em> (Page is excluded in this configuration)</td>
1923
                        </tr>';
1924
                }
1925
1926
                $c++;
1927
            }
1928
        } else {
1929
            $message = !empty($skipMessage) ? ' (' . $skipMessage . ')' : '';
1930
1931
            // Compile row:
1932
            $content .= '
1933
                <tr class="bgColor-20" style="border-bottom: 1px solid black;">
1934
                    <td>' . $pageTitleAndIcon . '</td>
1935
                    <td colspan="6"><em>No entries</em>' . $message . '</td>
1936
                </tr>';
1937
        }
1938
1939
        return $content;
1940
    }
1941
1942
    /**
1943
     * @return int
1944
     */
1945
    public function getUnprocessedItemsCount()
1946
    {
1947
        $res = $this->db->exec_SELECTquery(
1948
            'count(*) as num',
1949
            'tx_crawler_queue',
1950
            'exec_time=0 AND process_scheduled=0 AND scheduled<=' . $this->getCurrentTime()
1951
        );
1952
1953
        $count = $this->db->sql_fetch_assoc($res);
1954
        return $count['num'];
1955
    }
1956
1957
    /*****************************
1958
     *
1959
     * CLI functions
1960
     *
1961
     *****************************/
1962
1963
    /**
1964
     * Main function for running from Command Line PHP script (cron job)
1965
     * See ext/crawler/cli/crawler_cli.phpsh for details
1966
     *
1967
     * @return int number of remaining items or false if error
1968
     */
1969
    public function CLI_main()
1970
    {
1971
        $this->setAccessMode('cli');
1972
        $result = self::CLI_STATUS_NOTHING_PROCCESSED;
1973
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli');
1974
1975
        if (isset($cliObj->cli_args['-h']) || isset($cliObj->cli_args['--help'])) {
1976
            $cliObj->cli_validateArgs();
1977
            $cliObj->cli_help();
1978
            exit;
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
1979
        }
1980
1981
        if (!$this->getDisabled() && $this->CLI_checkAndAcquireNewProcess($this->CLI_buildProcessId())) {
1982
            $countInARun = $cliObj->cli_argValue('--countInARun') ? intval($cliObj->cli_argValue('--countInARun')) : $this->extensionSettings['countInARun'];
1983
            // Seconds
1984
            $sleepAfterFinish = $cliObj->cli_argValue('--sleepAfterFinish') ? intval($cliObj->cli_argValue('--sleepAfterFinish')) : $this->extensionSettings['sleepAfterFinish'];
1985
            // Milliseconds
1986
            $sleepTime = $cliObj->cli_argValue('--sleepTime') ? intval($cliObj->cli_argValue('--sleepTime')) : $this->extensionSettings['sleepTime'];
1987
1988
            try {
1989
                // Run process:
1990
                $result = $this->CLI_run($countInARun, $sleepTime, $sleepAfterFinish);
1991
            } catch (Exception $e) {
1992
                $result = self::CLI_STATUS_ABORTED;
1993
            }
1994
1995
            // Cleanup
1996
            $this->db->exec_DELETEquery('tx_crawler_process', 'assigned_items_count = 0');
1997
1998
            //TODO can't we do that in a clean way?
1999
            $releaseStatus = $this->CLI_releaseProcesses($this->CLI_buildProcessId());
0 ignored issues
show
Unused Code introduced by
The assignment to $releaseStatus is dead and can be removed.
Loading history...
2000
2001
            $this->CLI_debug("Unprocessed Items remaining:" . $this->getUnprocessedItemsCount() . " (" . $this->CLI_buildProcessId() . ")");
2002
            $result |= ($this->getUnprocessedItemsCount() > 0 ? self::CLI_STATUS_REMAIN : self::CLI_STATUS_NOTHING_PROCCESSED);
2003
        } else {
2004
            $result |= self::CLI_STATUS_ABORTED;
2005
        }
2006
2007
        return $result;
2008
    }
2009
2010
    /**
2011
     * Function executed by crawler_im.php cli script.
2012
     *
2013
     * @return void
2014
     */
2015
    public function CLI_main_im()
2016
    {
2017
        $this->setAccessMode('cli_im');
2018
2019
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli_im');
2020
2021
        // Force user to admin state and set workspace to "Live":
2022
        $this->backendUser->user['admin'] = 1;
2023
        $this->backendUser->setWorkspace(0);
2024
2025
        // Print help
2026
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2027
            $cliObj->cli_validateArgs();
2028
            $cliObj->cli_help();
2029
            exit;
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
2030
        }
2031
2032
        $cliObj->cli_validateArgs();
2033
2034
        if ($cliObj->cli_argValue('-o') === 'exec') {
2035
            $this->registerQueueEntriesInternallyOnly = true;
0 ignored issues
show
Documentation Bug introduced by
It seems like true of type true is incompatible with the declared type array of property $registerQueueEntriesInternallyOnly.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
2036
        }
2037
2038
        if (isset($cliObj->cli_args['_DEFAULT'][2])) {
2039
            // Crawler is called over TYPO3 BE
2040
            $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][2], 0);
2041
        } else {
2042
            // Crawler is called over cli
2043
            $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2044
        }
2045
2046
        $configurationKeys = $this->getConfigurationKeys($cliObj);
2047
2048
        if (!is_array($configurationKeys)) {
2049
            $configurations = $this->getUrlsForPageId($pageId);
2050
            if (is_array($configurations)) {
2051
                $configurationKeys = array_keys($configurations);
2052
            } else {
2053
                $configurationKeys = [];
2054
            }
2055
        }
2056
2057
        if ($cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec') {
2058
            $reason = new tx_crawler_domain_reason();
2059
            $reason->setReason(tx_crawler_domain_reason::REASON_GUI_SUBMIT);
2060
            $reason->setDetailText('The cli script of the crawler added to the queue');
2061
            tx_crawler_domain_events_dispatcher::getInstance()->post(
2062
                'invokeQueueChange',
2063
                $this->setID,
2064
                ['reason' => $reason]
2065
            );
2066
        }
2067
2068
        if ($this->extensionSettings['cleanUpOldQueueEntries']) {
2069
            $this->cleanUpOldQueueEntries();
2070
        }
2071
2072
        $this->setID = \TYPO3\CMS\Core\Utility\GeneralUtility::md5int(microtime());
2073
        $this->getPageTreeAndUrls(
2074
            $pageId,
2075
            \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_argValue('-d'), 0, 99),
2076
            $this->getCurrentTime(),
2077
            \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_isArg('-n') ? $cliObj->cli_argValue('-n') : 30, 1, 1000),
2078
            $cliObj->cli_argValue('-o') === 'queue' || $cliObj->cli_argValue('-o') === 'exec',
2079
            $cliObj->cli_argValue('-o') === 'url',
2080
            \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $cliObj->cli_argValue('-proc'), 1),
2081
            $configurationKeys
2082
        );
2083
2084
        if ($cliObj->cli_argValue('-o') === 'url') {
2085
            $cliObj->cli_echo(implode(chr(10), $this->downloadUrls) . chr(10), 1);
2086
        } elseif ($cliObj->cli_argValue('-o') === 'exec') {
2087
            $cliObj->cli_echo("Executing " . count($this->urlList) . " requests right away:\n\n");
2088
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2089
            $cliObj->cli_echo("\nProcessing:\n");
2090
2091
            foreach ($this->queueEntries as $queueRec) {
2092
                $p = unserialize($queueRec['parameters']);
2093
                $cliObj->cli_echo($p['url'] . ' (' . implode(',', $p['procInstructions']) . ') => ');
2094
2095
                $result = $this->readUrlFromArray($queueRec);
2096
2097
                $requestResult = unserialize($result['content']);
2098
                if (is_array($requestResult)) {
2099
                    $resLog = is_array($requestResult['log']) ? chr(10) . chr(9) . chr(9) . implode(chr(10) . chr(9) . chr(9), $requestResult['log']) : '';
2100
                    $cliObj->cli_echo('OK: ' . $resLog . chr(10));
2101
                } else {
2102
                    $cliObj->cli_echo('Error checking Crawler Result: ' . substr(preg_replace('/\s+/', ' ', strip_tags($result['content'])), 0, 30000) . '...' . chr(10));
0 ignored issues
show
Bug introduced by
Are you sure substr(preg_replace('/\s...'content'])), 0, 30000) of type false|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

2102
                    $cliObj->cli_echo('Error checking Crawler Result: ' . /** @scrutinizer ignore-type */ substr(preg_replace('/\s+/', ' ', strip_tags($result['content'])), 0, 30000) . '...' . chr(10));
Loading history...
2103
                }
2104
            }
2105
        } elseif ($cliObj->cli_argValue('-o') === 'queue') {
2106
            $cliObj->cli_echo("Putting " . count($this->urlList) . " entries in queue:\n\n");
2107
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10));
2108
        } else {
2109
            $cliObj->cli_echo(count($this->urlList) . " entries found for processing. (Use -o to decide action):\n\n", 1);
2110
            $cliObj->cli_echo(implode(chr(10), $this->urlList) . chr(10), 1);
2111
        }
2112
    }
2113
2114
    /**
2115
     * Function executed by crawler_im.php cli script.
2116
     *
2117
     * @return bool
2118
     */
2119
    public function CLI_main_flush()
2120
    {
2121
        $this->setAccessMode('cli_flush');
2122
        $cliObj = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('tx_crawler_cli_flush');
2123
2124
        // Force user to admin state and set workspace to "Live":
2125
        $this->backendUser->user['admin'] = 1;
2126
        $this->backendUser->setWorkspace(0);
2127
2128
        // Print help
2129
        if (!isset($cliObj->cli_args['_DEFAULT'][1])) {
2130
            $cliObj->cli_validateArgs();
2131
            $cliObj->cli_help();
2132
            exit;
0 ignored issues
show
Best Practice introduced by
Using exit here is not recommended.

In general, usage of exit should be done with care and only when running in a scripting context like a CLI script.

Loading history...
2133
        }
2134
2135
        $cliObj->cli_validateArgs();
2136
        $pageId = \TYPO3\CMS\Core\Utility\MathUtility::forceIntegerInRange($cliObj->cli_args['_DEFAULT'][1], 0);
2137
        $fullFlush = ($pageId == 0);
2138
2139
        $mode = $cliObj->cli_argValue('-o');
2140
2141
        switch ($mode) {
2142
            case 'all':
2143
                $result = $this->getLogEntriesForPageId($pageId, '', true, $fullFlush);
2144
                break;
2145
            case 'finished':
2146
            case 'pending':
2147
                $result = $this->getLogEntriesForPageId($pageId, $mode, true, $fullFlush);
2148
                break;
2149
            default:
2150
                $cliObj->cli_validateArgs();
2151
                $cliObj->cli_help();
2152
                $result = false;
2153
        }
2154
2155
        return $result !== false;
2156
    }
2157
2158
    /**
2159
     * Obtains configuration keys from the CLI arguments
2160
     *
2161
     * @param  tx_crawler_cli_im $cliObj    Command line object
2162
     * @return mixed                        Array of keys or null if no keys found
2163
     */
2164
    protected function getConfigurationKeys(tx_crawler_cli_im &$cliObj)
2165
    {
2166
        $parameter = trim($cliObj->cli_argValue('-conf'));
2167
        return ($parameter != '' ? \TYPO3\CMS\Core\Utility\GeneralUtility::trimExplode(',', $parameter) : []);
2168
    }
2169
2170
    /**
2171
     * Running the functionality of the CLI (crawling URLs from queue)
2172
     *
2173
     * @param int $countInARun
2174
     * @param int $sleepTime
2175
     * @param int $sleepAfterFinish
2176
     * @return string
2177
     */
2178
    public function CLI_run($countInARun, $sleepTime, $sleepAfterFinish)
2179
    {
2180
        $result = 0;
2181
        $counter = 0;
2182
2183
        // First, run hooks:
2184
        $this->CLI_runHooks();
2185
2186
        // Clean up the queue
2187
        if (intval($this->extensionSettings['purgeQueueDays']) > 0) {
2188
            $purgeDate = $this->getCurrentTime() - 24 * 60 * 60 * intval($this->extensionSettings['purgeQueueDays']);
2189
            $del = $this->db->exec_DELETEquery(
0 ignored issues
show
Unused Code introduced by
The assignment to $del is dead and can be removed.
Loading history...
2190
                'tx_crawler_queue',
2191
                'exec_time!=0 AND exec_time<' . $purgeDate
2192
            );
2193
        }
2194
2195
        // Select entries:
2196
        //TODO Shouldn't this reside within the transaction?
2197
        $rows = $this->db->exec_SELECTgetRows(
2198
            'qid,scheduled',
2199
            'tx_crawler_queue',
2200
            'exec_time=0
2201
                AND process_scheduled= 0
2202
                AND scheduled<=' . $this->getCurrentTime(),
2203
            '',
2204
            'scheduled, qid',
2205
        intval($countInARun)
2206
        );
2207
2208
        if (count($rows) > 0) {
2209
            $quidList = [];
2210
2211
            foreach ($rows as $r) {
2212
                $quidList[] = $r['qid'];
2213
            }
2214
2215
            $processId = $this->CLI_buildProcessId();
2216
2217
            //reserve queue entrys for process
2218
            $this->db->sql_query('BEGIN');
2219
            //TODO make sure we're not taking assigned queue-entires
2220
            $this->db->exec_UPDATEquery(
2221
                'tx_crawler_queue',
2222
                'qid IN (' . implode(',', $quidList) . ')',
2223
                [
2224
                    'process_scheduled' => intval($this->getCurrentTime()),
2225
                    'process_id' => $processId
2226
                ]
2227
            );
2228
2229
            //save the number of assigned queue entrys to determine who many have been processed later
2230
            $numberOfAffectedRows = $this->db->sql_affected_rows();
2231
            $this->db->exec_UPDATEquery(
2232
                'tx_crawler_process',
2233
                "process_id = '" . $processId . "'",
2234
                [
2235
                    'assigned_items_count' => intval($numberOfAffectedRows)
2236
                ]
2237
            );
2238
2239
            if ($numberOfAffectedRows == count($quidList)) {
2240
                $this->db->sql_query('COMMIT');
2241
            } else {
2242
                $this->db->sql_query('ROLLBACK');
2243
                $this->CLI_debug("Nothing processed due to multi-process collision (" . $this->CLI_buildProcessId() . ")");
2244
                return ($result | self::CLI_STATUS_ABORTED);
2245
            }
2246
2247
            foreach ($rows as $r) {
2248
                $result |= $this->readUrl($r['qid']);
2249
2250
                $counter++;
2251
                usleep(intval($sleepTime)); // Just to relax the system
2252
2253
                // if during the start and the current read url the cli has been disable we need to return from the function
2254
                // mark the process NOT as ended.
2255
                if ($this->getDisabled()) {
2256
                    return ($result | self::CLI_STATUS_ABORTED);
2257
                }
2258
2259
                if (!$this->CLI_checkIfProcessIsActive($this->CLI_buildProcessId())) {
2260
                    $this->CLI_debug("conflict / timeout (" . $this->CLI_buildProcessId() . ")");
2261
2262
                    //TODO might need an additional returncode
2263
                    $result |= self::CLI_STATUS_ABORTED;
2264
                    break; //possible timeout
2265
                }
2266
            }
2267
2268
            sleep(intval($sleepAfterFinish));
2269
2270
            $msg = 'Rows: ' . $counter;
2271
            $this->CLI_debug($msg . " (" . $this->CLI_buildProcessId() . ")");
2272
        } else {
2273
            $this->CLI_debug("Nothing within queue which needs to be processed (" . $this->CLI_buildProcessId() . ")");
2274
        }
2275
2276
        if ($counter > 0) {
2277
            $result |= self::CLI_STATUS_PROCESSED;
2278
        }
2279
2280
        return $result;
2281
    }
2282
2283
    /**
2284
     * Activate hooks
2285
     *
2286
     * @return void
2287
     */
2288
    public function CLI_runHooks()
2289
    {
2290
        global $TYPO3_CONF_VARS;
2291
        if (is_array($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'])) {
2292
            foreach ($TYPO3_CONF_VARS['EXTCONF']['crawler']['cli_hooks'] as $objRef) {
2293
                $hookObj = &\TYPO3\CMS\Core\Utility\GeneralUtility::getUserObj($objRef);
2294
                if (is_object($hookObj)) {
2295
                    $hookObj->crawler_init($this);
2296
                }
2297
            }
2298
        }
2299
    }
2300
2301
    /**
2302
     * Try to acquire a new process with the given id
2303
     * also performs some auto-cleanup for orphan processes
2304
     * @todo preemption might not be the most elegant way to clean up
2305
     *
2306
     * @param string $id identification string for the process
2307
     * @return boolean
2308
     */
2309
    public function CLI_checkAndAcquireNewProcess($id)
2310
    {
2311
        $ret = true;
2312
2313
        $systemProcessId = getmypid();
2314
        if ($systemProcessId < 1) {
2315
            return false;
2316
        }
2317
2318
        $processCount = 0;
2319
        $orphanProcesses = [];
2320
2321
        $this->db->sql_query('BEGIN');
2322
2323
        $res = $this->db->exec_SELECTquery(
2324
            'process_id,ttl',
2325
            'tx_crawler_process',
2326
            'active=1 AND deleted=0'
2327
            );
2328
2329
        $currentTime = $this->getCurrentTime();
2330
2331
        while ($row = $this->db->sql_fetch_assoc($res)) {
2332
            if ($row['ttl'] < $currentTime) {
2333
                $orphanProcesses[] = $row['process_id'];
2334
            } else {
2335
                $processCount++;
2336
            }
2337
        }
2338
2339
        // if there are less than allowed active processes then add a new one
2340
        if ($processCount < intval($this->extensionSettings['processLimit'])) {
2341
            $this->CLI_debug("add " . $this->CLI_buildProcessId() . " (" . ($processCount + 1) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2342
2343
            // create new process record
2344
            $this->db->exec_INSERTquery(
2345
                'tx_crawler_process',
2346
                [
2347
                    'process_id' => $id,
2348
                    'active' => '1',
2349
                    'ttl' => ($currentTime + intval($this->extensionSettings['processMaxRunTime'])),
2350
                    'system_process_id' => $systemProcessId
2351
                ]
2352
                );
2353
        } else {
2354
            $this->CLI_debug("Processlimit reached (" . ($processCount) . "/" . intval($this->extensionSettings['processLimit']) . ")");
2355
            $ret = false;
2356
        }
2357
2358
        $this->CLI_releaseProcesses($orphanProcesses, true); // maybe this should be somehow included into the current lock
2359
        $this->CLI_deleteProcessesMarkedDeleted();
2360
2361
        $this->db->sql_query('COMMIT');
2362
2363
        return $ret;
2364
    }
2365
2366
    /**
2367
     * Release a process and the required resources
2368
     *
2369
     * @param  mixed    $releaseIds   string with a single process-id or array with multiple process-ids
2370
     * @param  boolean  $withinLock   show whether the DB-actions are included within an existing lock
2371
     * @return boolean
2372
     */
2373
    public function CLI_releaseProcesses($releaseIds, $withinLock = false)
2374
    {
2375
        if (!is_array($releaseIds)) {
2376
            $releaseIds = [$releaseIds];
2377
        }
2378
2379
        if (!count($releaseIds) > 0) {
2380
            return false;   //nothing to release
2381
        }
2382
2383
        if (!$withinLock) {
2384
            $this->db->sql_query('BEGIN');
2385
        }
2386
2387
        // some kind of 2nd chance algo - this way you need at least 2 processes to have a real cleanup
2388
        // this ensures that a single process can't mess up the entire process table
2389
2390
        // mark all processes as deleted which have no "waiting" queue-entires and which are not active
2391
        $this->db->exec_UPDATEquery(
2392
            'tx_crawler_queue',
2393
            'process_id IN (SELECT process_id FROM tx_crawler_process WHERE active=0 AND deleted=0)',
2394
            [
2395
                'process_scheduled' => 0,
2396
                'process_id' => ''
2397
            ]
2398
        );
2399
        $this->db->exec_UPDATEquery(
2400
            'tx_crawler_process',
2401
            'active=0 AND deleted=0
2402
            AND NOT EXISTS (
2403
                SELECT * FROM tx_crawler_queue
2404
                WHERE tx_crawler_queue.process_id = tx_crawler_process.process_id
2405
                AND tx_crawler_queue.exec_time = 0
2406
            )',
2407
            [
2408
                'deleted' => '1',
2409
                'system_process_id' => 0
2410
            ]
2411
        );
2412
        // mark all requested processes as non-active
2413
        $this->db->exec_UPDATEquery(
2414
            'tx_crawler_process',
2415
            'process_id IN (\'' . implode('\',\'', $releaseIds) . '\') AND deleted=0',
2416
            [
2417
                'active' => '0'
2418
            ]
2419
        );
2420
        $this->db->exec_UPDATEquery(
2421
            'tx_crawler_queue',
2422
            'exec_time=0 AND process_id IN ("' . implode('","', $releaseIds) . '")',
2423
            [
2424
                'process_scheduled' => 0,
2425
                'process_id' => ''
2426
            ]
2427
        );
2428
2429
        if (!$withinLock) {
2430
            $this->db->sql_query('COMMIT');
2431
        }
2432
2433
        return true;
2434
    }
2435
2436
    /**
2437
     * Delete processes marked as deleted
2438
     *
2439
     * @return void
2440
     */
2441
    public function CLI_deleteProcessesMarkedDeleted()
2442
    {
2443
        $this->db->exec_DELETEquery('tx_crawler_process', 'deleted = 1');
2444
    }
2445
2446
    /**
2447
     * Check if there are still resources left for the process with the given id
2448
     * Used to determine timeouts and to ensure a proper cleanup if there's a timeout
2449
     *
2450
     * @param  string  identification string for the process
0 ignored issues
show
Bug introduced by
The type identification was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2451
     * @return boolean determines if the process is still active / has resources
2452
     *
2453
     * FIXME: Please remove Transaction, not needed as only a select query.
2454
     */
2455
    public function CLI_checkIfProcessIsActive($pid)
2456
    {
2457
        $ret = false;
2458
        $this->db->sql_query('BEGIN');
2459
        $res = $this->db->exec_SELECTquery(
2460
            'process_id,active,ttl',
2461
            'tx_crawler_process',
2462
            'process_id = \'' . $pid . '\'  AND deleted=0',
2463
            '',
2464
            'ttl',
2465
            '0,1'
2466
        );
2467
        if ($row = $this->db->sql_fetch_assoc($res)) {
2468
            $ret = intVal($row['active']) == 1;
2469
        }
2470
        $this->db->sql_query('COMMIT');
2471
2472
        return $ret;
2473
    }
2474
2475
    /**
2476
     * Create a unique Id for the current process
2477
     *
2478
     * @return string  the ID
2479
     */
2480
    public function CLI_buildProcessId()
2481
    {
2482
        if (!$this->processID) {
2483
            $this->processID = \TYPO3\CMS\Core\Utility\GeneralUtility::shortMD5($this->microtime(true));
2484
        }
2485
        return $this->processID;
2486
    }
2487
2488
    /**
2489
     * @param bool $get_as_float
2490
     *
2491
     * @return mixed
2492
     */
2493
    protected function microtime($get_as_float = false)
2494
    {
2495
        return microtime($get_as_float);
2496
    }
2497
2498
    /**
2499
     * Prints a message to the stdout (only if debug-mode is enabled)
2500
     *
2501
     * @param  string $msg  the message
2502
     */
2503
    public function CLI_debug($msg)
2504
    {
2505
        if (intval($this->extensionSettings['processDebug'])) {
2506
            echo $msg . "\n";
2507
            flush();
2508
        }
2509
    }
2510
2511
    /**
2512
     * Get URL content by making direct request to TYPO3.
2513
     *
2514
     * @param  string $url          Page URL
2515
     * @param  int    $crawlerId    Crawler-ID
2516
     * @return array
2517
     */
2518
    protected function sendDirectRequest($url, $crawlerId)
2519
    {
2520
        $requestHeaders = $this->buildRequestHeaderArray(parse_url($url), $crawlerId);
2521
2522
        $cmd = escapeshellcmd($this->extensionSettings['phpPath']);
2523
        $cmd .= ' ';
2524
        $cmd .= escapeshellarg(\TYPO3\CMS\Core\Utility\ExtensionManagementUtility::extPath('crawler') . 'cli/bootstrap.php');
2525
        $cmd .= ' ';
2526
        $cmd .= escapeshellarg($this->getFrontendBasePath());
2527
        $cmd .= ' ';
2528
        $cmd .= escapeshellarg($url);
2529
        $cmd .= ' ';
2530
        $cmd .= escapeshellarg(base64_encode(serialize($requestHeaders)));
2531
2532
        $startTime = microtime(true);
2533
        $content = $this->executeShellCommand($cmd);
2534
        $this->log($url . ' ' . (microtime(true) - $startTime));
2535
2536
        $result = [
2537
            'request' => implode("\r\n", $requestHeaders) . "\r\n\r\n",
2538
            'headers' => '',
2539
            'content' => $content
2540
        ];
2541
2542
        return $result;
2543
    }
2544
2545
    /**
2546
     * Cleans up entries that stayed for too long in the queue. These are:
2547
     * - processed entries that are over 1.5 days in age
2548
     * - scheduled entries that are over 7 days old
2549
     *
2550
     * @return void
2551
     */
2552
    protected function cleanUpOldQueueEntries()
2553
    {
2554
        $processedAgeInSeconds = $this->extensionSettings['cleanUpProcessedAge'] * 86400; // 24*60*60 Seconds in 24 hours
2555
        $scheduledAgeInSeconds = $this->extensionSettings['cleanUpScheduledAge'] * 86400;
2556
2557
        $now = time();
2558
        $condition = '(exec_time<>0 AND exec_time<' . ($now - $processedAgeInSeconds) . ') OR scheduled<=' . ($now - $scheduledAgeInSeconds);
2559
        $this->flushQueue($condition);
2560
    }
2561
2562
    /**
2563
     * Initializes a TypoScript Frontend necessary for using TypoScript and TypoLink functions
2564
     *
2565
     * @param int $id
2566
     * @param int $typeNum
2567
     *
2568
     * @return void
2569
     */
2570
    protected function initTSFE($id = 1, $typeNum = 0)
2571
    {
2572
        \TYPO3\CMS\Frontend\Utility\EidUtility::initTCA();
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Utility\EidUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2573
        if (!is_object($GLOBALS['TT'])) {
2574
            $GLOBALS['TT'] = new \TYPO3\CMS\Core\TimeTracker\NullTimeTracker;
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\TimeTracker\NullTimeTracker was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2575
            $GLOBALS['TT']->start();
2576
        }
2577
2578
        $GLOBALS['TSFE'] = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(\TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::class, $GLOBALS['TYPO3_CONF_VARS'], $id, $typeNum);
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Contr...criptFrontendController was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2579
        $GLOBALS['TSFE']->sys_page = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance(\TYPO3\CMS\Frontend\Page\PageRepository::class);
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Page\PageRepository was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2580
        $GLOBALS['TSFE']->sys_page->init(true);
2581
        $GLOBALS['TSFE']->connectToDB();
2582
        $GLOBALS['TSFE']->initFEuser();
2583
        $GLOBALS['TSFE']->determineId();
2584
        $GLOBALS['TSFE']->initTemplate();
2585
        $GLOBALS['TSFE']->rootLine = $GLOBALS['TSFE']->sys_page->getRootLine($id, '');
2586
        $GLOBALS['TSFE']->getConfigArray();
2587
        \TYPO3\CMS\Frontend\Page\PageGenerator::pagegenInit();
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Page\PageGenerator was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
2588
    }
2589
}
2590
2591
if (defined('TYPO3_MODE') && $TYPO3_CONF_VARS[TYPO3_MODE]['XCLASS']['ext/crawler/class.tx_crawler_lib.php']) {
0 ignored issues
show
Bug introduced by
The constant TYPO3_MODE was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
2592
    include_once($TYPO3_CONF_VARS[TYPO3_MODE]['XCLASS']['ext/crawler/class.tx_crawler_lib.php']);
2593
}
2594