Passed
Push — articleinfo-bot-percentages ( 0613da )
by MusikAnimal
09:12
created

ArticleInfoApi   B

Complexity

Total Complexity 46

Size/Duplication

Total Lines 456
Duplicated Lines 0 %

Importance

Changes 1
Bugs 0 Features 0
Metric Value
eloc 158
c 1
b 0
f 0
dl 0
loc 456
rs 8.72
wmc 46

24 Methods

Rating   Name   Duplication   Size   Complexity  
A __construct() 0 6 1
A getNumRevisions() 0 6 2
B getArticleInfoApiData() 0 60 6
A getProseStats() 0 30 3
A linksInCount() 0 3 1
A getMaxRevisions() 0 6 2
A getLinksAndRedirects() 0 6 2
A getTransclusionData() 0 7 2
A getBugs() 0 6 2
A tooManyRevisions() 0 3 2
A countCharsAndWords() 0 13 1
A linksExtCount() 0 3 1
A getAssessments() 0 9 2
A getTopEditorsByEditCount() 0 36 3
A getNumFiles() 0 3 1
A numBugs() 0 3 1
A getBots() 0 26 4
A getBotRevisionCount() 0 18 4
A getBasicEditingInfo() 0 3 1
A getNumCategories() 0 3 1
A getNumTemplates() 0 3 1
A linksOutCount() 0 3 1
A redirectsCount() 0 3 1
A getNumBots() 0 3 1

How to fix   Complexity   

Complex Class

Complex classes like ArticleInfoApi often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use ArticleInfoApi, and based on these observations, apply Extract Interface, too.

1
<?php
2
declare(strict_types = 1);
3
4
namespace AppBundle\Model;
5
6
use AppBundle\Repository\ArticleInfoRepository;
7
use DateTime;
8
use Doctrine\DBAL\Statement;
9
use Symfony\Component\DependencyInjection\ContainerInterface;
10
use Symfony\Component\DomCrawler\Crawler;
11
use Symfony\Component\HttpKernel\Exception\HttpException;
12
use Symfony\Component\HttpKernel\Exception\ServiceUnavailableHttpException;
13
14
/**
15
 * An ArticleInfoApi is standalone logic for the Article Info tool. These methods perform SQL queries
16
 * or make API requests and can be called directly, without any knowledge of the child ArticleInfo class.
17
 * It does require that the ArticleInfoRepository be set, however.
18
 * @see ArticleInfo
19
 */
20
class ArticleInfoApi extends Model
21
{
22
    /** @var ContainerInterface The application's DI container. */
23
    protected $container;
24
25
    /** @var int Number of revisions that belong to the page. */
26
    protected $numRevisions;
27
28
    /** @var mixed[] Prose stats, with keys 'characters', 'words', 'references', 'unique_references', 'sections'. */
29
    protected $proseStats;
30
31
    /** @var array Number of categories, templates and files on the page. */
32
    protected $transclusionData;
33
34
    /** @var mixed[] Various statistics about bots that edited the page. */
35
    protected $bots;
36
37
    /** @var int Number of edits made to the page by bots. */
38
    protected $botRevisionCount;
39
40
    /** @var int[] Number of in and outgoing links and redirects to the page. */
41
    protected $linksAndRedirects;
42
43
    /** @var string[] Assessments of the page (see Page::getAssessments). */
44
    protected $assessments;
45
46
    /** @var string[] List of Wikidata and Checkwiki errors. */
47
    protected $bugs;
48
49
    /**
50
     * ArticleInfoApi constructor.
51
     * @param Page $page The page to process.
52
     * @param ContainerInterface $container The DI container.
53
     * @param false|int $start Start date as Unix timestmap.
54
     * @param false|int $end End date as Unix timestamp.
55
     */
56
    public function __construct(Page $page, ContainerInterface $container, $start = false, $end = false)
57
    {
58
        $this->page = $page;
59
        $this->container = $container;
60
        $this->start = $start;
61
        $this->end = $end;
62
    }
63
64
    /**
65
     * Get the number of revisions belonging to the page.
66
     * @return int
67
     */
68
    public function getNumRevisions(): int
69
    {
70
        if (!isset($this->numRevisions)) {
71
            $this->numRevisions = $this->page->getNumRevisions(null, $this->start, $this->end);
72
        }
73
        return $this->numRevisions;
74
    }
75
76
    /**
77
     * Are there more revisions than we should process, based on the config?
78
     * @return bool
79
     */
80
    public function tooManyRevisions(): bool
81
    {
82
        return $this->getMaxRevisions() > 0 && $this->getNumRevisions() > $this->getMaxRevisions();
83
    }
84
85
    /**
86
     * Get the maximum number of revisions that we should process.
87
     * @return int
88
     */
89
    public function getMaxRevisions(): int
90
    {
91
        if (!isset($this->maxRevisions)) {
92
            $this->maxRevisions = (int) $this->container->getParameter('app.max_page_revisions');
0 ignored issues
show
Bug Best Practice introduced by
The property maxRevisions does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
93
        }
94
        return $this->maxRevisions;
95
    }
96
97
    /**
98
     * Get various basic info used in the API, including the number of revisions, unique authors, initial author
99
     * and edit count of the initial author. This is combined into one query for better performance. Caching is
100
     * intentionally disabled, because using the gadget, this will get hit for a different page constantly, where
101
     * the likelihood of cache benefiting us is slim.
102
     * @return string[]|false false if the page was not found.
103
     */
104
    public function getBasicEditingInfo()
105
    {
106
        return $this->getRepository()->getBasicEditingInfo($this->page);
0 ignored issues
show
Bug introduced by
The method getBasicEditingInfo() does not exist on AppBundle\Repository\Repository. It seems like you code against a sub-type of AppBundle\Repository\Repository such as AppBundle\Repository\ArticleInfoRepository. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

106
        return $this->getRepository()->/** @scrutinizer ignore-call */ getBasicEditingInfo($this->page);
Loading history...
107
    }
108
109
    /**
110
     * Get the top editors to the page by edit count.
111
     * @param int $limit Default 20, maximum 1,000.
112
     * @param bool $noBots Set to non-false to exclude bots from the result.
113
     * @return array
114
     */
115
    public function getTopEditorsByEditCount(int $limit = 20, bool $noBots = false): array
116
    {
117
        // Quick cache, valid only for the same request.
118
        static $topEditors = null;
119
        if (null !== $topEditors) {
120
            return $topEditors;
121
        }
122
123
        $rows = $this->getRepository()->getTopEditorsByEditCount(
0 ignored issues
show
Bug introduced by
The method getTopEditorsByEditCount() does not exist on AppBundle\Repository\Repository. It seems like you code against a sub-type of AppBundle\Repository\Repository such as AppBundle\Repository\ArticleInfoRepository. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

123
        $rows = $this->getRepository()->/** @scrutinizer ignore-call */ getTopEditorsByEditCount(
Loading history...
124
            $this->page,
125
            $this->start,
126
            $this->end,
127
            min($limit, 1000),
128
            $noBots
129
        );
130
131
        $topEditors = [];
132
        $rank = 0;
133
        foreach ($rows as $row) {
134
            $topEditors[] = [
135
                'rank' => ++$rank,
136
                'username' => $row['username'],
137
                'count' => $row['count'],
138
                'minor' => $row['minor'],
139
                'first_edit' => [
140
                    'id' => $row['first_revid'],
141
                    'timestamp' => $row['first_timestamp'],
142
                ],
143
                'latest_edit' => [
144
                    'id' => $row['latest_revid'],
145
                    'timestamp' => $row['latest_timestamp'],
146
                ],
147
            ];
148
        }
149
150
        return $topEditors;
151
    }
152
153
    /**
154
     * Get prose and reference information.
155
     * @return array With keys 'characters', 'words', 'references', 'unique_references'
156
     */
157
    public function getProseStats(): array
158
    {
159
        if (isset($this->proseStats)) {
160
            return $this->proseStats;
161
        }
162
163
        $datetime = is_int($this->end) ? new DateTime("@{$this->end}") : null;
164
        $html = $this->page->getHTMLContent($datetime);
165
166
        $crawler = new Crawler($html);
167
168
        [$chars, $words] = $this->countCharsAndWords($crawler, '#mw-content-text p');
169
170
        $refs = $crawler->filter('#mw-content-text .reference');
171
        $refContent = [];
172
        $refs->each(function ($ref) use (&$refContent): void {
173
            $refContent[] = $ref->text();
174
        });
175
        $uniqueRefs = count(array_unique($refContent));
176
177
        $sections = count($crawler->filter('#mw-content-text .mw-headline'));
178
179
        $this->proseStats = [
180
            'characters' => $chars,
181
            'words' => $words,
182
            'references' => $refs->count(),
183
            'unique_references' => $uniqueRefs,
184
            'sections' => $sections,
185
        ];
186
        return $this->proseStats;
187
    }
188
189
    /**
190
     * Count the number of characters and words of the plain text within the DOM element matched by the given selector.
191
     * @param Crawler $crawler
192
     * @param string $selector HTML selector.
193
     * @return array [num chars, num words]
194
     */
195
    private function countCharsAndWords(Crawler $crawler, string $selector): array
196
    {
197
        $totalChars = 0;
198
        $totalWords = 0;
199
        $paragraphs = $crawler->filter($selector);
200
        $paragraphs->each(function ($node) use (&$totalChars, &$totalWords): void {
201
            /** @var Crawler $node */
202
            $text = preg_replace('/\[\d+]/', '', trim($node->text(null, true)));
203
            $totalChars += strlen($text);
204
            $totalWords += count(explode(' ', $text));
205
        });
206
207
        return [$totalChars, $totalWords];
208
    }
209
210
    /**
211
     * Get the page assessments of the page.
212
     * @see https://www.mediawiki.org/wiki/Extension:PageAssessments
213
     * @return string[]|false False if unsupported.
214
     * @codeCoverageIgnore
215
     */
216
    public function getAssessments()
217
    {
218
        if (!is_array($this->assessments)) {
0 ignored issues
show
introduced by
The condition is_array($this->assessments) is always true.
Loading history...
219
            $this->assessments = $this->page
220
                ->getProject()
221
                ->getPageAssessments()
222
                ->getAssessments($this->page);
223
        }
224
        return $this->assessments;
225
    }
226
227
    /**
228
     * Get the list of page's wikidata and Checkwiki errors.
229
     * @see Page::getErrors()
230
     * @return string[]
231
     */
232
    public function getBugs(): array
233
    {
234
        if (!is_array($this->bugs)) {
0 ignored issues
show
introduced by
The condition is_array($this->bugs) is always true.
Loading history...
235
            $this->bugs = $this->page->getErrors();
236
        }
237
        return $this->bugs;
238
    }
239
240
    /**
241
     * Get the number of wikidata nad CheckWiki errors.
242
     * @return int
243
     */
244
    public function numBugs(): int
245
    {
246
        return count($this->getBugs());
247
    }
248
249
    /**
250
     * Generate the data structure that will used in the ArticleInfo API response.
251
     * @param Project $project
252
     * @param Page $page
253
     * @return array
254
     * @codeCoverageIgnore
255
     */
256
    public function getArticleInfoApiData(Project $project, Page $page): array
257
    {
258
        /** @var int $pageviewsOffset Number of days to query for pageviews */
259
        $pageviewsOffset = 30;
260
261
        $data = [
262
            'project' => $project->getDomain(),
263
            'page' => $page->getTitle(),
264
            'watchers' => (int) $page->getWatchers(),
265
            'pageviews' => $page->getLastPageviews($pageviewsOffset),
266
            'pageviews_offset' => $pageviewsOffset,
267
        ];
268
269
        $info = false;
270
271
        try {
272
            $articleInfoRepo = new ArticleInfoRepository();
273
            $articleInfoRepo->setContainer($this->container);
274
            $info = $articleInfoRepo->getBasicEditingInfo($page);
275
        } catch (ServiceUnavailableHttpException $e) {
276
            // No more open database connections.
277
            $data['error'] = 'Unable to fetch revision data. Please try again later.';
278
        } catch (HttpException $e) {
279
            /**
280
             * The query most likely exceeded the maximum query time,
281
             * so we'll abort and give only info retrieved by the API.
282
             */
283
            $data['error'] = 'Unable to fetch revision data. The query may have timed out.';
284
        }
285
286
        if (false !== $info) {
287
            $creationDateTime = DateTime::createFromFormat('YmdHis', $info['created_at']);
288
            $modifiedDateTime = DateTime::createFromFormat('YmdHis', $info['modified_at']);
289
            $secsSinceLastEdit = (new DateTime)->getTimestamp() - $modifiedDateTime->getTimestamp();
290
291
            // Some wikis (such foundation.wikimedia.org) may be missing the creation date.
292
            $creationDateTime = false === $creationDateTime
293
                ? null
294
                : $creationDateTime->format('Y-m-d');
295
296
            $assessment = $page->getProject()
297
                ->getPageAssessments()
298
                ->getAssessment($page);
299
300
            $data = array_merge($data, [
301
                'revisions' => (int) $info['num_edits'],
302
                'editors' => (int) $info['num_editors'],
303
                'minor_edits' => (int) $info['minor_edits'],
304
                'author' => $info['author'],
305
                'author_editcount' => null === $info['author_editcount'] ? null : (int) $info['author_editcount'],
0 ignored issues
show
introduced by
The condition null === $info['author_editcount'] is always false.
Loading history...
306
                'created_at' => $creationDateTime,
307
                'created_rev_id' => $info['created_rev_id'],
308
                'modified_at' => $modifiedDateTime->format('Y-m-d H:i'),
309
                'secs_since_last_edit' => $secsSinceLastEdit,
310
                'last_edit_id' => (int) $info['modified_rev_id'],
311
                'assessment' => $assessment,
312
            ]);
313
        }
314
315
        return $data;
316
    }
317
318
    /************************ Link statistics ************************/
319
320
    /**
321
     * Get the number of external links on the page.
322
     * @return int
323
     */
324
    public function linksExtCount(): int
325
    {
326
        return $this->getLinksAndRedirects()['links_ext_count'];
327
    }
328
329
    /**
330
     * Get the number of incoming links to the page.
331
     * @return int
332
     */
333
    public function linksInCount(): int
334
    {
335
        return $this->getLinksAndRedirects()['links_in_count'];
336
    }
337
338
    /**
339
     * Get the number of outgoing links from the page.
340
     * @return int
341
     */
342
    public function linksOutCount(): int
343
    {
344
        return $this->getLinksAndRedirects()['links_out_count'];
345
    }
346
347
    /**
348
     * Get the number of redirects to the page.
349
     * @return int
350
     */
351
    public function redirectsCount(): int
352
    {
353
        return $this->getLinksAndRedirects()['redirects_count'];
354
    }
355
356
    /**
357
     * Get the number of external, incoming and outgoing links, along with the number of redirects to the page.
358
     * @return int[]
359
     * @codeCoverageIgnore
360
     */
361
    private function getLinksAndRedirects(): array
362
    {
363
        if (!is_array($this->linksAndRedirects)) {
0 ignored issues
show
introduced by
The condition is_array($this->linksAndRedirects) is always true.
Loading history...
364
            $this->linksAndRedirects = $this->page->countLinksAndRedirects();
365
        }
366
        return $this->linksAndRedirects;
367
    }
368
369
    /**
370
     * Fetch transclusion data (categories, templates and files) that are on the page.
371
     * @return array With keys 'categories', 'templates' and 'files'.
372
     */
373
    public function getTransclusionData(): array
374
    {
375
        if (!is_array($this->transclusionData)) {
0 ignored issues
show
introduced by
The condition is_array($this->transclusionData) is always true.
Loading history...
376
            $this->transclusionData = $this->getRepository()
377
                ->getTransclusionData($this->page);
378
        }
379
        return $this->transclusionData;
380
    }
381
382
    /**
383
     * Get the number of categories that are on the page.
384
     * @return int
385
     */
386
    public function getNumCategories(): int
387
    {
388
        return $this->getTransclusionData()['categories'];
389
    }
390
391
    /**
392
     * Get the number of templates that are on the page.
393
     * @return int
394
     */
395
    public function getNumTemplates(): int
396
    {
397
        return $this->getTransclusionData()['templates'];
398
    }
399
400
    /**
401
     * Get the number of files that are on the page.
402
     * @return int
403
     */
404
    public function getNumFiles(): int
405
    {
406
        return $this->getTransclusionData()['files'];
407
    }
408
409
    /************************ Bot statistics ************************/
410
411
    /**
412
     * Number of edits made to the page by current or former bots.
413
     * @param string[] $bots Used only in unit tests, where we supply mock data for the bots that will get processed.
414
     * @return int
415
     */
416
    public function getBotRevisionCount(?array $bots = null): int
417
    {
418
        if (isset($this->botRevisionCount)) {
419
            return $this->botRevisionCount;
420
        }
421
422
        if (null === $bots) {
423
            $bots = $this->getBots();
424
        }
425
426
        $count = 0;
427
428
        foreach (array_values($bots) as $data) {
429
            $count += $data['count'];
430
        }
431
432
        $this->botRevisionCount = $count;
433
        return $count;
434
    }
435
436
    /**
437
     * Get and set $this->bots about bots that edited the page. This is done separately from the main query because
438
     * we use this information when computing the top 10 editors in ArticleInfo, where we don't want to include bots.
439
     * @return mixed[]
440
     */
441
    public function getBots(): array
442
    {
443
        if (isset($this->bots)) {
444
            return $this->bots;
445
        }
446
447
        // Parse the bot edits.
448
        $this->bots = [];
449
450
        $limit = $this->tooManyRevisions() ? $this->getMaxRevisions() : null;
451
452
        /** @var Statement $botData */
453
        $botData = $this->getRepository()->getBotData($this->page, $this->start, $this->end, $limit);
0 ignored issues
show
Bug introduced by
The method getBotData() does not exist on AppBundle\Repository\Repository. It seems like you code against a sub-type of AppBundle\Repository\Repository such as AppBundle\Repository\ArticleInfoRepository. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

453
        $botData = $this->getRepository()->/** @scrutinizer ignore-call */ getBotData($this->page, $this->start, $this->end, $limit);
Loading history...
454
        while ($bot = $botData->fetch()) {
455
            $this->bots[$bot['username']] = [
456
                'count' => (int)$bot['count'],
457
                'current' => '1' === $bot['current'],
458
            ];
459
        }
460
461
        // Sort by edit count.
462
        uasort($this->bots, function ($a, $b) {
463
            return $b['count'] - $a['count'];
464
        });
465
466
        return $this->bots;
467
    }
468
469
    /**
470
     * Get the number of bots that edited the page.
471
     * @return int
472
     */
473
    public function getNumBots(): int
474
    {
475
        return count($this->getBots());
476
    }
477
}
478