Passed
Push — deprecations ( 3cb403...87297a )
by Tomas Norre
10:49 queued 06:55
created

CrawlerApi   B

Complexity

Total Complexity 45

Size/Duplication

Total Lines 437
Duplicated Lines 0 %

Test Coverage

Coverage 38.17%

Importance

Changes 1
Bugs 0 Features 0
Metric Value
wmc 45
eloc 157
c 1
b 0
f 0
dl 0
loc 437
ccs 92
cts 241
cp 0.3817
rs 8.8

19 Methods

Rating   Name   Duplication   Size   Complexity  
A setAllowedConfigurations() 0 3 1
A getAllowedConfigurations() 0 3 1
A getSetId() 0 3 1
A overwriteSetId() 0 3 1
A __construct() 0 6 1
A addPageToQueue() 0 5 1
A filterUnallowedConfigurations() 0 12 4
A addPageToQueueTimed() 0 30 3
A getCurrentCrawlingSpeed() 0 33 5
A getQueueStatistics() 0 5 1
A getLastProcessedQueueEntries() 0 3 1
A getLatestCrawlTimestampForPage() 0 32 4
A getCrawlHistoryForPage() 0 17 2
B getPerformanceData() 0 48 8
A countEntriesInQueueForPageByScheduleTime() 0 26 2
A findCrawler() 0 11 3
A getQueueStatisticsByConfiguration() 0 12 2
A getCrawlerProcInstructions() 0 10 3
A getActiveProcessesCount() 0 5 1

How to fix   Complexity   

Complex Class

Complex classes like CrawlerApi often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use CrawlerApi, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
declare(strict_types=1);
4
5
namespace AOE\Crawler\Api;
6
7
/***************************************************************
8
 *  Copyright notice
9
 *
10
 *  (c) 2018 AOE GmbH <[email protected]>
11
 *
12
 *  All rights reserved
13
 *
14
 *  This script is part of the TYPO3 project. The TYPO3 project is
15
 *  free software; you can redistribute it and/or modify
16
 *  it under the terms of the GNU General Public License as published by
17
 *  the Free Software Foundation; either version 3 of the License, or
18
 *  (at your option) any later version.
19
 *
20
 *  The GNU General Public License can be found at
21
 *  http://www.gnu.org/copyleft/gpl.html.
22
 *
23
 *  This script is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU General Public License for more details.
27
 *
28
 *  This copyright notice MUST APPEAR in all copies of the script!
29
 ***************************************************************/
30
31
use AOE\Crawler\Controller\CrawlerController;
32
use AOE\Crawler\Domain\Repository\ProcessRepository;
33
use AOE\Crawler\Domain\Repository\QueueRepository;
34
use TYPO3\CMS\Core\Database\ConnectionPool;
35
use TYPO3\CMS\Core\Database\Query\QueryBuilder;
36
use TYPO3\CMS\Core\Utility\GeneralUtility;
37
use TYPO3\CMS\Extbase\Object\ObjectManager;
38
use TYPO3\CMS\Frontend\Page\PageRepository;
39
40
/**
41
 * Class CrawlerApi
42
 * @deprecated The CrawlerApi is deprecated an will be removed in v10.x
43
 */
44
class CrawlerApi
45
{
46
    /**
47
     * @var QueueRepository
48
     */
49
    protected $queueRepository;
50
51
    /**
52
     * @var array
53
     */
54
    protected $allowedConfigurations = [];
55
56
    /**
57
     * @var QueryBuilder
58
     */
59
    protected $queryBuilder;
60
61
    /**
62
     * @var string
63
     */
64
    protected $tableName = 'tx_crawler_queue';
65
66
    /**
67
     * @var CrawlerController
68
     */
69
    protected $crawlerController;
70
71
    /**
72
     * CrawlerApi constructor.
73
     */
74 9
    public function __construct()
75
    {
76 9
        trigger_error('The ' . __CLASS__ . ' is deprecated an will be removed in v10.x, most of functionality is moved to the respective Domain Model and Repository, and some to the CrawlerController.');
77
        /** @var ObjectManager $objectManager */
78 9
        $objectManager = GeneralUtility::makeInstance(ObjectManager::class);
79 9
        $this->queueRepository = $objectManager->get(QueueRepository::class);
80 9
    }
81
82
    /**
83
     * Each crawler run has a setid, this facade method delegates
84
     * the it to the crawler object
85
     *
86
     * @throws \Exception
87
     */
88 1
    public function overwriteSetId(int $id): void
89
    {
90 1
        $this->findCrawler()->setID = $id;
91 1
    }
92
93
    /**
94
     * This method is used to limit the configuration selection to
95
     * a set of configurations.
96
     */
97 1
    public function setAllowedConfigurations(array $allowedConfigurations): void
98
    {
99 1
        $this->allowedConfigurations = $allowedConfigurations;
100 1
    }
101
102
    /**
103
     * @return array
104
     */
105 1
    public function getAllowedConfigurations()
106
    {
107 1
        return $this->allowedConfigurations;
108
    }
109
110
    /**
111
     * Returns the setID of the crawler
112
     *
113
     * @return int
114
     */
115 1
    public function getSetId()
116
    {
117 1
        return $this->findCrawler()->setID;
118
    }
119
120
    /**
121
     * Adds a page to the crawlerqueue by uid
122
     *
123
     * @param int $uid uid
124
     */
125
    public function addPageToQueue($uid): void
126
    {
127
        $uid = intval($uid);
128
        //non timed elements will be added with timestamp 0
129
        $this->addPageToQueueTimed($uid, 0);
130
    }
131
132
    /**
133
     * Adds a page to the crawlerqueue by uid and sets a
134
     * timestamp when the page should be crawled.
135
     *
136
     * @param int $uid pageid
137
     * @param int $time timestamp
138
     *
139
     * @throws \Exception
140
     */
141 2
    public function addPageToQueueTimed($uid, $time): void
142
    {
143 2
        $uid = intval($uid);
144 2
        $time = intval($time);
145
146 2
        $crawler = $this->findCrawler();
147 2
        $pageData = GeneralUtility::makeInstance(PageRepository::class)->getPage($uid);
148 2
        $configurations = $crawler->getUrlsForPageRow($pageData);
149 2
        $configurations = $this->filterUnallowedConfigurations($configurations);
150 2
        $downloadUrls = [];
151 2
        $duplicateTrack = [];
152
153 2
        if (is_array($configurations)) {
0 ignored issues
show
introduced by
The condition is_array($configurations) is always true.
Loading history...
154 2
            foreach ($configurations as $cv) {
155
                //enable inserting of entries
156 2
                $crawler->registerQueueEntriesInternallyOnly = false;
157 2
                $crawler->urlListFromUrlArray(
158 2
                    $cv,
159 2
                    $pageData,
160 2
                    $time,
161 2
                    300,
162 2
                    true,
163 2
                    false,
164 2
                    $duplicateTrack,
165 2
                    $downloadUrls,
166 2
                    array_keys($this->getCrawlerProcInstructions())
167
                );
168
169
                //reset the queue because the entries have been written to the db
170 2
                unset($crawler->queueEntries);
171
            }
172
        }
173 2
    }
174
175
    /**
176
     * Method to return the latest Crawle Timestamp for a page.
177
     *
178
     * @param int $uid uid id of the page
179
     * @param bool $future_crawldates_only
180
     * @param bool $unprocessed_only
181
     *
182
     * @return int
183
     */
184 1
    public function getLatestCrawlTimestampForPage($uid, $future_crawldates_only = false, $unprocessed_only = false)
185
    {
186 1
        $uid = intval($uid);
187
188 1
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
189
        $query = $queryBuilder
190 1
            ->from($this->tableName)
191 1
            ->selectLiteral('max(scheduled) as latest')
192 1
            ->where(
193 1
                $queryBuilder->expr()->eq('page_id', $queryBuilder->createNamedParameter($uid))
194
            );
195
196 1
        if ($future_crawldates_only) {
197
            $query->andWhere(
198
                $queryBuilder->expr()->gt('scheduled', time())
199
            );
200
        }
201
202 1
        if ($unprocessed_only) {
203
            $query->andWhere(
204
                $queryBuilder->expr()->eq('exec_time', 0)
205
            );
206
        }
207
208 1
        $row = $query->execute()->fetch(0);
209 1
        if ($row['latest']) {
210 1
            $res = $row['latest'];
211
        } else {
212
            $res = 0;
213
        }
214
215 1
        return intval($res);
216
    }
217
218
    /**
219
     * Returns an array with timestamps when the page has been scheduled for crawling and
220
     * at what time the scheduled crawl has been executed. The array also contains items that are
221
     * scheduled but have note been crawled yet.
222
     *
223
     * @return array array with the crawl-history of a page => 0 : scheduled time , 1 : executed_time, 2 : set_id
224
     */
225 1
    public function getCrawlHistoryForPage(int $uid, int $limit = 0)
226
    {
227 1
        $uid = intval($uid);
228 1
        $limit = intval($limit);
229
230 1
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
231
        $statement = $queryBuilder
232 1
            ->from($this->tableName)
233 1
            ->select('scheduled', 'exec_time', 'set_id')
234 1
            ->where(
235 1
                $queryBuilder->expr()->eq('page_id', $queryBuilder->createNamedParameter($uid, \PDO::PARAM_INT))
236
            );
237 1
        if ($limit) {
238 1
            $statement->setMaxResults($limit);
239
        }
240
241 1
        return $statement->execute()->fetchAll();
242
    }
243
244
    /**
245
     * Get queue statistics
246
     */
247 1
    public function getQueueStatistics(): array
248
    {
249
        return [
250 1
            'assignedButUnprocessed' => $this->queueRepository->countAllAssignedPendingItems(),
251 1
            'unprocessed' => $this->queueRepository->countAllPendingItems(),
252
        ];
253
    }
254
255
    /**
256
     * Get queue statistics by configuration
257
     *
258
     * @return array array of array('configuration' => <>, 'assignedButUnprocessed' => <>, 'unprocessed' => <>)
259
     */
260
    public function getQueueStatisticsByConfiguration()
261
    {
262
        $statistics = $this->queueRepository->countPendingItemsGroupedByConfigurationKey();
263
        $setIds = $this->queueRepository->getSetIdWithUnprocessedEntries();
264
        $totals = $this->queueRepository->getTotalQueueEntriesByConfiguration($setIds);
265
266
        // "merge" arrays
267
        foreach ($statistics as &$value) {
268
            $value['total'] = $totals[$value['configuration']];
269
        }
270
271
        return $statistics;
272
    }
273
274
    /**
275
     * Get active processes count
276
     */
277
    public function getActiveProcessesCount(): int
278
    {
279
        $processRepository = new ProcessRepository();
280
281
        return $processRepository->countActive();
282
    }
283
284
    /**
285
     * @param int $limit
286
     * @return array
287
     */
288
    public function getLastProcessedQueueEntries(int $limit): array
289
    {
290
        return $this->queueRepository->getLastProcessedEntries($limit);
291
    }
292
293
    /**
294
     * Get current crawling speed
295
     *
296
     * @return int|float|bool
297
     */
298
    public function getCurrentCrawlingSpeed()
299
    {
300
        $lastProcessedEntries = $this->queueRepository->getLastProcessedEntriesTimestamps();
301
302
        if (count($lastProcessedEntries) < 10) {
303
            // not enough information
304
            return false;
305
        }
306
307
        $tooOldDelta = 60; // time between two entries is "too old"
308
309
        $compareValue = time();
310
        $startTime = $lastProcessedEntries[0];
311
312
        $pages = 0;
313
314
        reset($lastProcessedEntries);
315
        foreach ($lastProcessedEntries as $timestamp) {
316
            if ($compareValue - $timestamp > $tooOldDelta) {
317
                break;
318
            }
319
            $compareValue = $timestamp;
320
            $pages++;
321
        }
322
323
        if ($pages < 10) {
324
            // not enough information
325
            return false;
326
        }
327
        $oldestTimestampThatIsNotTooOld = $compareValue;
328
        $time = $startTime - $oldestTimestampThatIsNotTooOld;
329
330
        return $pages / ($time / 60);
331
    }
332
333
    /**
334
     * Get some performance data
335
     *
336
     * @param integer $start
337
     * @param integer $end
338
     * @param integer $resolution
339
     *
340
     * @return array data
341
     *
342
     * @throws \Exception
343
     */
344
    public function getPerformanceData($start, $end, $resolution)
345
    {
346
        $data = [];
347
348
        $data['urlcount'] = 0;
349
        $data['start'] = $start;
350
        $data['end'] = $end;
351
        $data['duration'] = $data['end'] - $data['start'];
352
353
        if ($data['duration'] < 1) {
354
            throw new \Exception('End timestamp must be after start timestamp', 1512659945);
355
        }
356
357
        for ($slotStart = $start; $slotStart < $end; $slotStart += $resolution) {
358
            $slotEnd = min($slotStart + $resolution - 1, $end);
359
            $slotData = $this->queueRepository->getPerformanceData($slotStart, $slotEnd);
360
361
            $slotUrlCount = 0;
362
            foreach ($slotData as &$processData) {
363
                $duration = $processData['end'] - $processData['start'];
364
                if ($processData['urlcount'] > 5 && $duration > 0) {
365
                    $processData['speed'] = 60 * 1 / ($duration / $processData['urlcount']);
366
                }
367
                $slotUrlCount += $processData['urlcount'];
368
            }
369
370
            $data['urlcount'] += $slotUrlCount;
371
372
            $data['slots'][$slotEnd] = [
373
                'amountProcesses' => count($slotData),
374
                'urlcount' => $slotUrlCount,
375
                'processes' => $slotData,
376
            ];
377
378
            if ($slotUrlCount > 5) {
379
                $data['slots'][$slotEnd]['speed'] = 60 * 1 / ($slotEnd - $slotStart / $slotUrlCount);
380
            } else {
381
                $data['slots'][$slotEnd]['speed'] = 0;
382
            }
383
        }
384
385
        if ($data['urlcount'] > 5) {
386
            $data['speed'] = 60 * 1 / ($data['duration'] / $data['urlcount']);
387
        } else {
388
            $data['speed'] = 0;
389
        }
390
391
        return $data;
392
    }
393
394
    /**
395
     * Method to get an instance of the internal crawler singleton
396
     *
397
     * @return CrawlerController Instance of the crawler lib
398
     *
399
     * @throws \Exception
400
     */
401 2
    protected function findCrawler()
402
    {
403 2
        if (! is_object($this->crawlerController)) {
404 2
            $this->crawlerController = GeneralUtility::makeInstance(CrawlerController::class);
405 2
            $this->crawlerController->setID = GeneralUtility::md5int(microtime());
406
        }
407
408 2
        if (is_object($this->crawlerController)) {
409 2
            return $this->crawlerController;
410
        }
411
        throw new \Exception('no crawler object', 1512659759);
412
    }
413
414
    /**
415
     * This method is used to limit the processing instructions to the processing instructions
416
     * that are allowed.
417
     */
418 2
    protected function filterUnallowedConfigurations(array $configurations): array
419
    {
420 2
        if (count($this->allowedConfigurations) > 0) {
421
            // 	remove configuration that does not match the current selection
422
            foreach ($configurations as $confKey => $confArray) {
423
                if (! in_array($confKey, $this->allowedConfigurations, true)) {
424
                    unset($configurations[$confKey]);
425
                }
426
            }
427
        }
428
429 2
        return $configurations;
430
    }
431
432
    /**
433
     * Counts all entries in the database which are scheduled for a given page id and a schedule timestamp.
434
     *
435
     * @param int $page_uid
436
     * @param int $schedule_timestamp
437
     *
438
     * @return int
439
     */
440 1
    protected function countEntriesInQueueForPageByScheduleTime($page_uid, $schedule_timestamp)
441
    {
442 1
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable($this->tableName);
443
        $count = $queryBuilder
444 1
            ->count('*')
445 1
            ->from($this->tableName);
446
447
        //if the same page is scheduled for the same time and has not be executed?
448
        //un-timed elements need an exec_time with 0 because they can occur multiple times
449 1
        if ($schedule_timestamp === 0) {
450 1
            $count->where(
451 1
                $queryBuilder->expr()->eq('page_id', $page_uid),
452 1
                $queryBuilder->expr()->eq('exec_time', 0),
453 1
                $queryBuilder->expr()->eq('scheduled', $schedule_timestamp)
454
            );
455
        } else {
456
            //timed elements have got a fixed schedule time, if a record with this time
457
            //exists it is maybe queued for the future, or is has been queue for the past and therefore
458
            //also been processed.
459 1
            $count->where(
460 1
                $queryBuilder->expr()->eq('page_id', $page_uid),
461 1
                $queryBuilder->expr()->eq('scheduled', $schedule_timestamp)
462
            );
463
        }
464
465 1
        return $count->execute()->rowCount();
466
    }
467
468
    /**
469
     * Reads the registered processingInstructions of the crawler
470
     */
471 2
    private function getCrawlerProcInstructions(): array
472
    {
473 2
        $crawlerProcInstructions = [];
474 2
        if (! empty($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['procInstructions'])) {
475
            foreach ($GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['procInstructions'] as $configuration) {
476
                $crawlerProcInstructions[$configuration['key']] = $configuration['value'];
477
            }
478
        }
479
480 2
        return $crawlerProcInstructions;
481
    }
482
}
483