Test Failed
Push — 6-0 ( cfb4d5 )
by Tomas Norre
03:23
created

CrawlerApi   C

Complexity

Total Complexity 55

Size/Duplication

Total Lines 532
Duplicated Lines 0 %

Importance

Changes 0
Metric Value
dl 0
loc 532
rs 6
c 0
b 0
f 0
wmc 55

23 Methods

Rating   Name   Duplication   Size   Complexity  
A addPageToQueue() 0 5 1
A filterUnallowedConfigurations() 0 12 4
A setAllowedConfigurations() 0 3 1
B addPageToQueueTimed() 0 32 3
B getCurrentCrawlingSpeed() 0 34 5
A getQueueStatistics() 0 5 1
A removeQueueEntrie() 0 6 1
A getQueueRepository() 0 7 2
A getLastProcessedQueueEntries() 0 3 1
B getLatestCrawlTimestampForPage() 0 22 4
A overwriteSetId() 0 3 1
A isPageInQueueTimed() 0 5 1
A getSetId() 0 3 1
A getCrawlHistoryForPage() 0 13 2
A countUnprocessedItems() 0 8 1
C getPerformanceData() 0 48 8
A getUnprocessedItems() 0 7 1
A findCrawler() 0 11 3
A getQueueStatisticsByConfiguration() 0 14 2
A countEntriesInQueueForPageByScheduletime() 0 20 2
A getCrawlerProcInstructions() 0 7 2
C isPageInQueue() 0 33 7
A getActiveProcessesCount() 0 5 1

How to fix   Complexity   

Complex Class

Complex classes like CrawlerApi often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use CrawlerApi, and based on these observations, apply Extract Interface, too.

1
<?php
2
namespace AOE\Crawler\Api;
3
4
/***************************************************************
5
 *  Copyright notice
6
 *
7
 *  (c) 2016 AOE GmbH <[email protected]>
8
 *
9
 *  All rights reserved
10
 *
11
 *  This script is part of the TYPO3 project. The TYPO3 project is
12
 *  free software; you can redistribute it and/or modify
13
 *  it under the terms of the GNU General Public License as published by
14
 *  the Free Software Foundation; either version 3 of the License, or
15
 *  (at your option) any later version.
16
 *
17
 *  The GNU General Public License can be found at
18
 *  http://www.gnu.org/copyleft/gpl.html.
19
 *
20
 *  This script is distributed in the hope that it will be useful,
21
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
22
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
23
 *  GNU General Public License for more details.
24
 *
25
 *  This copyright notice MUST APPEAR in all copies of the script!
26
 ***************************************************************/
27
28
use TYPO3\CMS\Core\Utility\GeneralUtility;
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\GeneralUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
29
use TYPO3\CMS\Core\Utility\MathUtility;
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Core\Utility\MathUtility was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
30
use TYPO3\CMS\Frontend\Page\PageRepository;
0 ignored issues
show
Bug introduced by
The type TYPO3\CMS\Frontend\Page\PageRepository was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
31
32
/**
33
 * Class CrawlerApi
34
 *
35
 * @package AOE\Crawler\Api
36
 */
37
class CrawlerApi
38
{
39
    /**
40
     * @var \tx_crawler_lib
41
     */
42
    private $crawlerObj;
43
44
    /**
45
     * @var \tx_crawler_domain_queue_repository queue repository
46
     */
47
    protected $queueRepository;
48
49
    /**
50
     * @var $allowedConfigrations array
51
     */
52
    protected $allowedConfigrations = [];
53
54
    /**
55
     * Each crawler run has a setid, this facade method delegates
56
     * the it to the crawler object
57
     *
58
     * @param int
59
     */
60
    public function overwriteSetId($id)
61
    {
62
        $this->findCrawler()->setID = intval($id);
63
    }
64
65
    /**
66
     * This method is used to limit the configuration selection to
67
     * a set of configurations.
68
     *
69
     * @param array $allowedConfigurations
70
     */
71
    public function setAllowedConfigurations(array $allowedConfigurations)
72
    {
73
        $this->allowedConfigrations = $allowedConfigurations;
74
    }
75
76
    /**
77
     * Returns the setID of the crawler
78
     *
79
     * @return int
80
     */
81
    public function getSetId()
82
    {
83
        return $this->findCrawler()->setID;
84
    }
85
86
    /**
87
     * Method to get an instance of the internal crawler singleton
88
     *
89
     * @return \tx_crawler_lib Instance of the crawler lib
90
     *
91
     * @throws \Exception
92
     */
93
    protected function findCrawler()
94
    {
95
        if (!is_object($this->crawlerObj)) {
96
            $this->crawlerObj = GeneralUtility::makeInstance(\tx_crawler_lib::class);
97
            $this->crawlerObj->setID = GeneralUtility::md5int(microtime());
98
        }
99
100
        if (is_object($this->crawlerObj)) {
101
            return $this->crawlerObj;
102
        } else {
103
            throw new \Exception('no crawler object', 1512659759);
104
        }
105
    }
106
107
    /**
108
     * Adds a page to the crawlerqueue by uid
109
     *
110
     * @param int $uid uid
111
     */
112
    public function addPageToQueue($uid)
113
    {
114
        $uid = intval($uid);
115
        //non timed elements will be added with timestamp 0
116
        $this->addPageToQueueTimed($uid, 0);
117
    }
118
119
    /**
120
     * This method is used to limit the processing instructions to the processing instructions
121
     * that are allowed.
122
     *
123
     * @return array
124
     */
125
    protected function filterUnallowedConfigurations($configurations)
126
    {
127
        if (count($this->allowedConfigrations) > 0) {
128
            // 	remove configuration that does not match the current selection
129
            foreach ($configurations as $confKey => $confArray) {
130
                if (!in_array($confKey, $this->allowedConfigrations)) {
131
                    unset($configurations[$confKey]);
132
                }
133
            }
134
        }
135
136
        return $configurations;
137
    }
138
139
    /**
140
     * Adds a page to the crawlerqueue by uid and sets a
141
     * timestamp when the page should be crawled.
142
     *
143
     * @param int $uid pageid
144
     * @param int $time timestamp
145
     */
146
    public function addPageToQueueTimed($uid, $time)
147
    {
148
        $uid = intval($uid);
149
        $time = intval($time);
150
151
        $crawler = $this->findCrawler();
152
        $pageData = GeneralUtility::makeInstance(PageRepository::class)->getPage($uid);
153
        $configurations = $crawler->getUrlsForPageRow($pageData);
154
        $configurations = $this->filterUnallowedConfigurations($configurations);
155
        $downloadUrls = [];
156
        $duplicateTrack = [];
157
158
        if (is_array($configurations)) {
159
            foreach ($configurations as $cv) {
160
                //enable inserting of entries
161
                $crawler->registerQueueEntriesInternallyOnly = false;
0 ignored issues
show
Documentation Bug introduced by
It seems like false of type false is incompatible with the declared type array of property $registerQueueEntriesInternallyOnly.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
162
                $crawler->urlListFromUrlArray(
163
                    $cv,
164
                    $pageData,
165
                    $time,
166
                    300,
167
                    true,
168
                    false,
169
                    $duplicateTrack,
170
                    $downloadUrls,
171
                    array_keys($this->getCrawlerProcInstructions())
172
                );
173
174
                //reset the queue because the entries have been written to the db
175
                unset($crawler->queueEntries);
176
            }
177
        } else {
178
            //no configuration found
179
        }
180
    }
181
182
    /**
183
     * Counts all entrys in the database which are scheduled for a given page id and a schedule timestamp.
184
     *
185
     * @param int $page_uid
186
     * @param int $schedule_timestamp
187
     *
188
     * @return int
189
     */
190
    protected function countEntriesInQueueForPageByScheduletime($page_uid, $schedule_timestamp)
191
    {
192
        //if the same page is scheduled for the same time and has not be executed?
193
        if ($schedule_timestamp == 0) {
194
            //untimed elements need an exec_time with 0 because they can occure multiple times
195
            $where = 'page_id=' . $page_uid . ' AND exec_time = 0 AND scheduled=' . $schedule_timestamp;
196
        } else {
197
            //timed elementes have got a fixed schedule time, if a record with this time
198
            //exists it is maybe queued for the future, or is has been queue for the past and therefore
199
            //also been processed.
200
            $where = 'page_id=' . $page_uid . ' AND scheduled=' . $schedule_timestamp;
201
        }
202
203
        $row = $GLOBALS['TYPO3_DB']->sql_fetch_assoc($GLOBALS['TYPO3_DB']->exec_SELECTquery(
204
            'count(*) as cnt',
205
            'tx_crawler_queue',
206
            $where
207
        ));
208
209
        return intval($row['cnt']);
210
    }
211
212
    /**
213
     * Determines if a page is queued
214
     *
215
     * @param $uid
216
     * @param bool $unprocessed_only
217
     * @param bool $timed_only
218
     * @param bool $timestamp
219
     *
220
     * @return bool
221
     */
222
    public function isPageInQueue($uid, $unprocessed_only = true, $timed_only = false, $timestamp = false)
223
    {
224
        if (MathUtility::canBeInterpretedAsInteger($uid)) {
225
            throw new \InvalidArgumentException('Invalid parameter type', 1468931945);
226
        }
227
228
        $isPageInQueue = false;
229
230
        $whereClause = 'page_id = ' . (integer)$uid;
231
232
        if (false !== $unprocessed_only) {
233
            $whereClause .= ' AND exec_time = 0';
234
        }
235
236
        if (false !== $timed_only) {
237
            $whereClause .= ' AND scheduled != 0';
238
        }
239
240
        if (false !== $timestamp) {
241
            $whereClause .= ' AND scheduled = ' . (integer)$timestamp;
242
        }
243
244
        $count = $GLOBALS['TYPO3_DB']->exec_SELECTcountRows(
245
            '*',
246
            'tx_crawler_queue',
247
            $whereClause
248
        );
249
250
        if (false !== $count && $count > 0) {
251
            $isPageInQueue = true;
252
        }
253
254
        return $isPageInQueue;
255
    }
256
257
    /**
258
     * Method to return the latest Crawle Timestamp for a page.
259
     *
260
     * @param int $uid uid id of the page
261
     * @param bool $future_crawldates_only
262
     * @param bool $unprocessed_only
263
     *
264
     * @return int
265
     */
266
    public function getLatestCrawlTimestampForPage($uid, $future_crawldates_only = false, $unprocessed_only = false)
267
    {
268
        $uid = intval($uid);
269
        $query = 'max(scheduled) as latest';
270
        $where = ' page_id = ' . $uid;
271
272
        if ($future_crawldates_only) {
273
            $where .= ' AND scheduled > ' . time();
274
        }
275
276
        if ($unprocessed_only) {
277
            $where .= ' AND exec_time = 0';
278
        }
279
280
        $rs = $GLOBALS['TYPO3_DB']->exec_SELECTquery($query, 'tx_crawler_queue', $where);
281
        if ($row = $GLOBALS['TYPO3_DB']->sql_fetch_assoc($rs)) {
282
            $res = $row['latest'];
283
        } else {
284
            $res = 0;
285
        }
286
287
        return $res;
288
    }
289
290
    /**
291
     * Returns an array with timestamps when the page has been scheduled for crawling and
292
     * at what time the scheduled crawl has been executed. The array also contains items that are
293
     * scheduled but have note been crawled yet.
294
     *
295
     * @param int $uid uid of the page
296
     * @param bool $limit
297
     *
298
     * @return array array with the crawlhistory of a page => 0 : scheduled time , 1 : execuded_time, 2 : set_id
299
     */
300
    public function getCrawlHistoryForPage($uid, $limit = false)
301
    {
302
        $uid = intval($uid);
303
        $limit = $GLOBALS['TYPO3_DB']->fullQuoteStr($limit, 'tx_crawler_queue');
304
305
        $query = 'scheduled, exec_time, set_id';
306
        $where = ' page_id = ' . $uid;
307
308
        $limit_query = ($limit) ? $limit : null;
309
310
        $rows = $GLOBALS['TYPO3_DB']->exec_SELECTgetRows($query, 'tx_crawler_queue', $where, null, null, $limit_query);
311
312
        return $rows;
313
    }
314
315
    /**
316
     * Method to determine unprocessed Items in the crawler queue.
317
     *
318
     * @return array
319
     */
320
    public function getUnprocessedItems()
321
    {
322
        $query = '*';
323
        $where = 'exec_time = 0';
324
        $rows = $GLOBALS['TYPO3_DB']->exec_SELECTgetRows($query, 'tx_crawler_queue', $where, '', 'page_id, scheduled');
325
326
        return $rows;
327
    }
328
329
    /**
330
     * Method to get the number of unprocessed items in the crawler
331
     *
332
     * @param int number of unprocessed items in the queue
333
     */
334
    public function countUnprocessedItems()
335
    {
336
        $query = 'count(page_id) as anz';
337
        $where = 'exec_time = 0';
338
        $rs = $GLOBALS['TYPO3_DB']->exec_SELECTquery($query, 'tx_crawler_queue', $where);
339
        $row = $GLOBALS['TYPO3_DB']->sql_fetch_assoc($rs);
340
341
        return $row['anz'];
342
    }
343
344
    /**
345
     * Method to check if a page is in the queue which is timed for a
346
     * date when it should be crawled
347
     *
348
     * @param int $uid uid of the page
349
     * @param boolean $show_unprocessed only respect unprocessed pages
350
     *
351
     * @return boolean
352
     */
353
    public function isPageInQueueTimed($uid, $show_unprocessed = true)
354
    {
355
        $uid = intval($uid);
356
357
        return $this->isPageInQueue($uid, $show_unprocessed);
358
    }
359
360
    /**
361
     * Reads the registered processingInstructions of the crawler
362
     *
363
     * @return array
364
     */
365
    private function getCrawlerProcInstructions()
366
    {
367
        if (isset($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['procInstructions'])) {
368
            return $GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['procInstructions'];
369
        }
370
371
        return [];
372
    }
373
374
    /**
375
     * Removes an queue entry with a given queue id
376
     *
377
     * @param int $qid
378
     */
379
    public function removeQueueEntrie($qid)
380
    {
381
        $qid = intval($qid);
382
        $table = 'tx_crawler_queue';
383
        $where = ' qid=' . $qid;
384
        $GLOBALS['TYPO3_DB']->exec_DELETEquery($table, $where);
385
    }
386
387
    /**
388
     * Get queue statistics
389
     *
390
     * @param void
391
     *
392
     * @return array array('assignedButUnprocessed' => <>, 'unprocessed' => <>);
393
     */
394
    public function getQueueStatistics()
395
    {
396
        return [
397
            'assignedButUnprocessed' => $this->getQueueRepository()->countAllAssignedPendingItems(),
398
            'unprocessed' => $this->getQueueRepository()->countAllPendingItems()
399
        ];
400
    }
401
402
    /**
403
     * Get queue repository
404
     *
405
     * @param void
406
     *
407
     * @return \tx_crawler_domain_queue_repository queue repository
408
     */
409
    protected function getQueueRepository()
410
    {
411
        if (!$this->queueRepository instanceof \tx_crawler_domain_queue_repository) {
412
            $this->queueRepository = new \tx_crawler_domain_queue_repository();
413
        }
414
415
        return $this->queueRepository;
416
    }
417
418
    /**
419
     * Get queue statistics by configuration
420
     *
421
     * @param void
422
     *
423
     * @return array array of array('configuration' => <>, 'assignedButUnprocessed' => <>, 'unprocessed' => <>)
424
     */
425
    public function getQueueStatisticsByConfiguration()
426
    {
427
        $statistics = $this->getQueueRepository()->countPendingItemsGroupedByConfigurationKey();
428
429
        $setIds = $this->getQueueRepository()->getSetIdWithUnprocessedEntries();
430
431
        $totals = $this->getQueueRepository()->getTotalQueueEntriesByConfiguration($setIds);
432
433
        // "merge" arrays
434
        foreach ($statistics as $key => &$value) {
435
            $value['total'] = $totals[$value['configuration']];
436
        }
437
438
        return $statistics;
439
    }
440
441
    /**
442
     * Get active processes count
443
     *
444
     * @param void
445
     *
446
     * @return int
447
     */
448
    public function getActiveProcessesCount()
449
    {
450
        $processRepository = new \tx_crawler_domain_process_repository();
451
452
        return $processRepository->countActive();
453
    }
454
455
    /**
456
     * Get last processed entries
457
     *
458
     * @param int limit
0 ignored issues
show
Bug introduced by
The type AOE\Crawler\Api\limit was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
459
     *
460
     * @return array
461
     */
462
    public function getLastProcessedQueueEntries($limit)
463
    {
464
        return $this->getQueueRepository()->getLastProcessedEntries('*', $limit);
465
    }
466
467
    /**
468
     * Get current crawling speed
469
     *
470
     * @param float|false page speed in pages per minute
0 ignored issues
show
Bug introduced by
The type AOE\Crawler\Api\page was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
471
     *
472
     * @return int
473
     */
474
    public function getCurrentCrawlingSpeed()
475
    {
476
        $lastProcessedEntries = $this->getQueueRepository()->getLastProcessedEntriesTimestamps();
477
478
        if (count($lastProcessedEntries) < 10) {
479
            // not enough information
480
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type integer.
Loading history...
481
        }
482
483
        $tooOldDelta = 60; // time between two entries is "too old"
484
485
        $compareValue = time();
486
        $startTime = $lastProcessedEntries[0];
487
488
        $pages = 0;
489
490
        reset($lastProcessedEntries);
491
        while (list($key, $timestamp) = each($lastProcessedEntries)) {
0 ignored issues
show
Deprecated Code introduced by
The function each() has been deprecated: 7.2 ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-deprecated  annotation

491
        while (list($key, $timestamp) = /** @scrutinizer ignore-deprecated */ each($lastProcessedEntries)) {

This function has been deprecated. The supplier of the function has supplied an explanatory message.

The explanatory message should give you some clue as to whether and when the function will be removed and what other function to use instead.

Loading history...
492
            if ($compareValue - $timestamp > $tooOldDelta) {
493
                break;
494
            }
495
            $compareValue = $timestamp;
496
            $pages++;
497
        }
498
499
        if ($pages < 10) {
500
            // not enough information
501
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type integer.
Loading history...
502
        }
503
        $oldestTimestampThatIsNotTooOld = $compareValue;
504
        $time = $startTime - $oldestTimestampThatIsNotTooOld;
505
        $speed = $pages / ($time / 60);
506
507
        return $speed;
508
    }
509
510
    /**
511
     * Get some performance data
512
     *
513
     * @param integer $start
514
     * @param integer $end
515
     * @param integer $resolution
516
     *
517
     * @return array data
518
     *
519
     * @throws \Exception
520
     */
521
    public function getPerformanceData($start, $end, $resolution)
522
    {
523
        $data = [];
524
525
        $data['urlcount'] = 0;
526
        $data['start'] = $start;
527
        $data['end'] = $end;
528
        $data['duration'] = $data['end'] - $data['start'];
529
530
        if ($data['duration'] < 1) {
531
            throw new \Exception('End timestamp must be after start timestamp', 1512659945);
532
        }
533
534
        for ($slotStart = $start; $slotStart < $end; $slotStart += $resolution) {
535
            $slotEnd = min($slotStart + $resolution - 1, $end);
536
            $slotData = $this->getQueueRepository()->getPerformanceData($slotStart, $slotEnd);
537
538
            $slotUrlCount = 0;
539
            foreach ($slotData as $processId => &$processData) {
540
                $duration = $processData['end'] - $processData['start'];
541
                if ($processData['urlcount'] > 5 && $duration > 0) {
542
                    $processData['speed'] = 60 * 1 / ($duration / $processData['urlcount']);
543
                }
544
                $slotUrlCount += $processData['urlcount'];
545
            }
546
547
            $data['urlcount'] += $slotUrlCount;
548
549
            $data['slots'][$slotEnd] = [
550
                'amountProcesses' => count($slotData),
551
                'urlcount' => $slotUrlCount,
552
                'processes' => $slotData,
553
            ];
554
555
            if ($slotUrlCount > 5) {
556
                $data['slots'][$slotEnd]['speed'] = 60 * 1 / ($slotEnd - $slotStart / $slotUrlCount);
557
            } else {
558
                $data['slots'][$slotEnd]['speed'] = 0;
559
            }
560
        }
561
562
        if ($data['urlcount'] > 5) {
563
            $data['speed'] = 60 * 1 / ($data['duration'] / $data['urlcount']);
564
        } else {
565
            $data['speed'] = 0;
566
        }
567
568
        return $data;
569
    }
570
}
571