Completed
Push — master ( 248067...8de5c0 )
by Andrew
02:29
created

ImageExtractor   D

Complexity

Total Complexity 84

Size/Duplication

Total Lines 616
Duplicated Lines 3.57 %

Coupling/Cohesion

Components 1
Dependencies 10

Importance

Changes 3
Bugs 0 Features 0
Metric Value
wmc 84
c 3
b 0
f 0
lcom 1
cbo 10
dl 22
loc 616
rs 4.8078

24 Methods

Rating   Name   Duplication   Size   Complexity  
A run() 0 12 4
B getBestImage() 0 23 5
A checkForMetaTag() 0 21 4
B isWorthyImage() 0 11 8
B isBannerDimensions() 12 21 6
A filterBadNames() 0 13 3
A checkForLinkTag() 0 3 1
A checkForOpenGraphTag() 0 3 1
A checkForTwitterTag() 0 3 1
A ensureMinimumImageSize() 0 8 3
A getLocallyStoredImage() 0 5 1
A getLocallyStoredImages() 0 3 1
A getCleanDomain() 0 3 1
B scoreLocalImages() 0 27 3
B checkForLargeImages() 0 30 4
B getDepthLevel() 0 28 4
B getAllImages() 0 26 2
A isOkImageFileName() 0 15 3
A getImageCandidates() 0 7 1
B findImagesThatPassByteSizeTest() 0 29 4
B checkForTag() 5 33 5
C checkForKnownElements() 5 56 10
B buildImagePath() 0 23 6
A customSiteMapping() 0 15 3

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complex Class

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like ImageExtractor often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use ImageExtractor, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
namespace Goose\Modules\Extractors;
4
5
use Goose\Article;
6
use Goose\Images\Image;
7
use Goose\Images\ImageUtils;
8
use Goose\Images\LocallyStoredImage;
9
use Goose\Traits\ArticleMutatorTrait;
10
use Goose\Modules\AbstractModule;
11
use Goose\Modules\ModuleInterface;
12
use DOMWrap\Element;
13
use DOMWrap\NodeList;
14
15
/**
16
 * Image Extractor
17
 *
18
 * @package Goose\Modules\Extractors
19
 * @license http://www.apache.org/licenses/LICENSE-2.0 Apache License 2.0
20
 */
21
class ImageExtractor extends AbstractModule implements ModuleInterface {
22
    use ArticleMutatorTrait;
23
24
    /** @var string[] */
25
    private $badFileNames = [
26
        '\.html', '\.gif', '\.ico', 'button', 'twitter\.jpg', 'facebook\.jpg',
27
        'ap_buy_photo', 'digg\.jpg', 'digg\.png', 'delicious\.png',
28
        'facebook\.png', 'reddit\.jpg', 'doubleclick', 'diggthis',
29
        'diggThis', 'adserver', '\/ads\/', 'ec\.atdmt\.com', 'mediaplex\.com',
30
        'adsatt', 'view\.atdmt',
31
    ];
32
33
    /** @var string[] */
34
    private static $KNOWN_IMG_DOM_NAMES = [
35
        'yn-story-related-media',
36
        'cnn_strylccimg300cntr',
37
        'big_photo',
38
        'ap-smallphoto-a'
39
    ];
40
41
    /** @var int */
42
    private static $MAX_PARENT_DEPTH = 2;
43
44
    /** @var string[] */
45
    private static $CUSTOM_SITE_MAPPING = [];
46
47
    public function run(Article $article) {
48
        $this->article($article);
49
50
        if ($this->config()->get('image_fetch_best')) {
51
            $article->setTopImage($this->getBestImage());
52
53
            if ($this->config()->get('image_fetch_all')
54
              && $article->getTopNode() instanceof Element) {
55
                $article->setAllImages($this->getAllImages());
56
            }
57
        }
58
    }
59
60
    /**
61
     * @return Image|null
62
     */
63
    private function getBestImage() {
64
        $image = $this->checkForKnownElements();
65
66
        if ($image) {
67
            return $image;
68
        }
69
70
        $image = $this->checkForMetaTag();
71
72
        if ($image) {
73
            return $image;
74
        }
75
76
        if ($this->article()->getTopNode() instanceof Element) {
77
            $image = $this->checkForLargeImages($this->article()->getTopNode(), 0, 0);
78
79
            if ($image) {
80
                return $image;
81
            }
82
        }
83
84
        return null;
85
    }
86
87
    /**
88
     * Prefer Twitter images (as they tend to have the right size for us), then Open Graph images
89
     * (which seem to be smaller), and finally linked images.
90
     *
91
     * @return Image|null
92
     */
93
    private function checkForMetaTag() {
94
        $image = $this->checkForTwitterTag();
95
96
        if ($image) {
97
            return $image;
98
        }
99
100
        $image = $this->checkForOpenGraphTag();
101
102
        if ($image) {
103
            return $image;
104
        }
105
106
        $image = $this->checkForLinkTag();
107
108
        if ($image) {
109
            return $image;
110
        }
111
112
        return null;
113
    }
114
115
    /**
116
     * although slow the best way to determine the best image is to download them and check the actual dimensions of the image when on disk
117
     * so we'll go through a phased approach...
118
     * 1. get a list of ALL images from the parent node
119
     * 2. filter out any bad image names that we know of (gifs, ads, etc..)
120
     * 3. do a head request on each file to make sure it meets our bare requirements
121
     * 4. any images left over let's do a full GET request, download em to disk and check their dimensions
122
     * 5. Score images based on different factors like height/width and possibly things like color density
123
     *
124
     * @param Element $node
125
     * @param int $parentDepthLevel
126
     * @param int $siblingDepthLevel
127
     *
128
     * @return Image|null
129
     */
130
    private function checkForLargeImages(Element $node, $parentDepthLevel, $siblingDepthLevel) {
131
        $goodLocalImages = $this->getImageCandidates($node);
132
133
        $scoredLocalImages = $this->scoreLocalImages($goodLocalImages, $parentDepthLevel);
134
135
        ksort($scoredLocalImages);
136
137
        if (!empty($scoredLocalImages)) {
138
            foreach ($scoredLocalImages as $imageScore => $scoredLocalImage) {
139
                $mainImage = new Image();
140
                $mainImage->setImageSrc($scoredLocalImage->getImgSrc());
141
                $mainImage->setImageExtractionType('bigimage');
142
                $mainImage->setConfidenceScore(100 / count($scoredLocalImages));
143
                $mainImage->setImageScore($imageScore);
144
                $mainImage->setBytes($scoredLocalImage->getBytes());
145
                $mainImage->setHeight($scoredLocalImage->getHeight());
146
                $mainImage->setWidth($scoredLocalImage->getWidth());
147
148
                return $mainImage;
149
            }
150
        } else {
151
            $depthObj = $this->getDepthLevel($node, $parentDepthLevel, $siblingDepthLevel);
152
153
            if ($depthObj) {
154
                return $this->checkForLargeImages($depthObj->node, $depthObj->parentDepth, $depthObj->siblingDepth);
155
            }
156
        }
157
158
        return null;
159
    }
160
161
    /**
162
     * @param Element $node
163
     * @param int $parentDepth
164
     * @param int $siblingDepth
165
     *
166
     * @return object|null
167
     */
168
    private function getDepthLevel(Element $node, $parentDepth, $siblingDepth) {
169
        if (is_null($node)) {
170
            return null;
171
        }
172
173
        if ($parentDepth > self::$MAX_PARENT_DEPTH) {
174
            return null;
175
        }
176
177
        // Find previous sibling element node
178
        $siblingNode = $node->preceding(function($node) {
179
            return $node instanceof Element;
180
        });
181
182
        if (is_null($siblingNode)) {
183
            return (object)[
184
                'node' => $node->parent(),
185
                'parentDepth' => $parentDepth + 1,
186
                'siblingDepth' => 0,
187
            ];
188
        }
189
190
        return (object)[
191
            'node' => $siblingNode,
192
            'parentDepth' => $parentDepth,
193
            'siblingDepth' => $siblingDepth + 1,
194
        ];
195
    }
196
197
    /**
198
     * Set image score and on locally downloaded images
199
     *
200
     * we're going to score the images in the order in which they appear so images higher up will have more importance,
201
     * we'll count the area of the 1st image as a score of 1 and then calculate how much larger or small each image after it is
202
     * we'll also make sure to try and weed out banner type ad blocks that have big widths and small heights or vice versa
203
     * so if the image is 3rd found in the dom it's sequence score would be 1 / 3 = .33 * diff in area from the first image
204
     *
205
     * @param LocallyStoredImage[] $locallyStoredImages
206
     * @param int $depthLevel
207
     *
208
     * @return LocallyStoredImage[]
209
     */
210
    private function scoreLocalImages($locallyStoredImages, $depthLevel) {
0 ignored issues
show
Unused Code introduced by
The parameter $depthLevel is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
211
        $results = [];
212
        $i = 1;
213
        $initialArea = 0;
214
215
        // Limit to the first 30 images
216
        $locallyStoredImages = array_slice($locallyStoredImages, 0, 30);
217
218
        foreach ($locallyStoredImages as $locallyStoredImage) {
219
            $sequenceScore = 1 / $i;
220
            $area = $locallyStoredImage->getWidth() * $locallyStoredImage->getHeight();
221
222
            if ($initialArea == 0) {
223
                $initialArea = $area * 1.48;
224
                $totalScore = 1;
225
            } else {
226
                $areaDifference = $area * $initialArea;
227
                $totalScore = $sequenceScore * $areaDifference;
228
            }
229
230
            $i++;
231
232
            $results[$totalScore] = $locallyStoredImage;
233
        }
234
235
        return $results;
236
    }
237
238
    /**
239
     * @param LocallyStoredImage $locallyStoredImage
240
     * @param int $depthLevel
241
     *
242
     * @return bool
243
     */
244
    private function isWorthyImage($locallyStoredImage, $depthLevel) {
0 ignored issues
show
Unused Code introduced by
This method is not used, and could be removed.
Loading history...
245
        if ($locallyStoredImage->getWidth() <= $this->config()->get('image_min_width')
246
          || $locallyStoredImage->getHeight() <= $this->config()->get('image_min_height')
247
          || $locallyStoredImage->getFileExtension() == 'NA'
248
          || ($depthLevel < 1 && $locallyStoredImage->getWidth() < 300) || $depthLevel >= 1
249
          || $this->isBannerDimensions($locallyStoredImage->getWidth(), $locallyStoredImage->getHeight())) {
250
            return false;
251
        }
252
253
        return true;
254
    }
255
256
    /**
257
     * @return Image[]
258
     */
259
    private function getAllImages() {
260
        $results = [];
261
262
        $images = $this->article()->getTopNode()->find('img');
263
264
        // Generate a complete URL for each image
265
        $imageUrls = array_map(function($image) {
266
            return $this->buildImagePath($image->attr('src'));
267
        }, $images->toArray());
268
269
        $localImages = $this->getLocallyStoredImages($imageUrls);
270
271
        foreach ($localImages as $localImage) {
272
            $image = new Image();
273
            $image->setImageSrc($localImage->getImgSrc());
274
            $image->setBytes($localImage->getBytes());
275
            $image->setHeight($localImage->getHeight());
276
            $image->setWidth($localImage->getWidth());
277
            $image->setImageExtractionType('all');
278
            $image->setConfidenceScore(0);
279
280
            $results[] = $image;
281
        }
282
283
        return $results;
284
    }
285
286
    /**
287
     * returns true if we think this is kind of a bannery dimension
288
     * like 600 / 100 = 6 may be a fishy dimension for a good image
289
     *
290
     * @param int $width
291
     * @param int $height
292
     */
293
    private function isBannerDimensions($width, $height) {
294
        if ($width == $height) {
295
            return false;
296
        }
297
298 View Code Duplication
        if ($width > $height) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
299
            $diff = $width / $height;
300
            if ($diff > 5) {
301
                return true;
302
            }
303
        }
304
305 View Code Duplication
        if ($height > $width) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
306
            $diff = $height / $width;
307
            if ($diff > 5) {
308
                return true;
309
            }
310
        }
311
312
        return false;
313
    }
314
315
    /**
316
     * takes a list of image elements and filters out the ones with bad names
317
     *
318
     * @param \DOMWrap\NodeList $images
319
     *
320
     * @return Element[]
321
     */
322
    private function filterBadNames(NodeList $images) {
323
        $goodImages = [];
324
325
        foreach ($images as $image) {
326
            if ($this->isOkImageFileName($image)) {
327
                $goodImages[] = $image;
328
            } else {
329
                $image->remove();
330
            }
331
        }
332
333
        return $goodImages;
334
    }
335
336
    /**
337
     * will check the image src against a list of bad image files we know of like buttons, etc...
338
     *
339
     * @param Element $imageNode
340
     *
341
     * @return bool
342
     */
343
    private function isOkImageFileName(Element $imageNode) {
344
        $imgSrc = $imageNode->attr('src');
345
346
        if (empty($imgSrc)) {
347
            return false;
348
        }
349
350
        $regex = '@' . implode('|', $this->badFileNames) . '@i';
351
352
        if (preg_match($regex, $imgSrc)) {
353
            return false;
354
        }
355
356
        return true;
357
    }
358
359
    /**
360
     * @param Element $node
361
     *
362
     * @return LocallyStoredImage[]
363
     */
364
    private function getImageCandidates(Element $node) {
365
        $images = $node->find('img');
366
        $filteredImages = $this->filterBadNames($images);
367
        $goodImages = $this->findImagesThatPassByteSizeTest($filteredImages);
368
369
        return $goodImages;
370
    }
371
372
    /**
373
     * loop through all the images and find the ones that have the best bytes to even make them a candidate
374
     *
375
     * @param Element[] $images
376
     *
377
     * @return LocallyStoredImage[]
378
     */
379
    private function findImagesThatPassByteSizeTest($images) {
380
        $i = 0; /** @todo Re-factor how the LocallyStoredImage => Image relation works ? Note: PHP 5.6.x adds a 3rd argument to array_filter() to pass the key as well as value. */
381
382
        // Limit to the first 30 images
383
        $images = array_slice($images, 0, 30);
384
385
        // Generate a complete URL for each image
386
        $imageUrls = array_map(function($image) {
387
            return $this->buildImagePath($image->attr('src'));
388
        }, $images);
389
390
        $localImages = $this->getLocallyStoredImages($imageUrls, true);
391
392
        $results = array_filter($localImages, function($localImage) use($images, $i) {
393
            $image = $images[$i++];
394
395
            $bytes = $localImage->getBytes();
396
397
            if ($bytes < $this->config()->get('image_min_bytes') && $bytes != 0 || $bytes > $this->config()->get('image_max_bytes')) {
398
                $image->remove();
399
400
                return false;
401
            }
402
403
            return true;
404
        });
405
406
        return $results;
407
    }
408
409
    /**
410
     * checks to see if we were able to find feature image tags on this page
411
     *
412
     * @return Image|null
413
     */
414
    private function checkForLinkTag() {
415
        return $this->checkForTag('link[rel="image_src"]', 'href', 'linktag');
416
    }
417
418
    /**
419
     * checks to see if we were able to find open graph tags on this page
420
     *
421
     * @return Image|null
422
     */
423
    private function checkForOpenGraphTag() {
424
        return $this->checkForTag('meta[property="og:image"],meta[name="og:image"]', 'content', 'opengraph');
425
    }
426
427
    /**
428
     * checks to see if we were able to find twitter tags on this page
429
     *
430
     * @return Image|null
431
     */
432
    private function checkForTwitterTag() {
433
        return $this->checkForTag('meta[property="twitter:image"],meta[name="twitter:image"],meta[property="twitter:image:src"],meta[name="twitter:image:src"]', 'content', 'twitter');
434
    }
435
436
    /**
437
     * @param string $selector
438
     * @param string $attr
439
     * @param string $type
440
     *
441
     * @return Image|null
442
     */
443
    private function checkForTag($selector, $attr, $type) {
444
        $meta = $this->article()->getRawDoc()->find($selector);
445
446
        if (!$meta->count()) {
447
            return null;
448
        }
449
450
        $node = $meta->first();
451
452
        if (!($node instanceof Element)) {
453
            return null;
454
        }
455
456
        if (!$node->hasAttribute($attr)) {
457
            return null;
458
        }
459
460
        $imagePath = $this->buildImagePath($node->attr($attr));
0 ignored issues
show
Bug introduced by
It seems like $node->attr($attr) targeting DOMWrap\Traits\ManipulationTrait::attr() can also be of type null or object<DOMWrap\Element>; however, Goose\Modules\Extractors...actor::buildImagePath() does only seem to accept string, maybe add an additional type check?

This check looks at variables that are passed out again to other methods.

If the outgoing method call has stricter type requirements than the method itself, an issue is raised.

An additional type check may prevent trouble.

Loading history...
461
        $mainImage = new Image();
462
        $mainImage->setImageSrc($imagePath);
463
        $mainImage->setImageExtractionType($type);
464
        $mainImage->setConfidenceScore(100);
465
466
        $locallyStoredImage = $this->getLocallyStoredImage($mainImage->getImageSrc());
467
468 View Code Duplication
        if (!empty($locallyStoredImage)) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
469
            $mainImage->setBytes($locallyStoredImage->getBytes());
470
            $mainImage->setHeight($locallyStoredImage->getHeight());
471
            $mainImage->setWidth($locallyStoredImage->getWidth());
472
        }
473
474
        return $this->ensureMinimumImageSize($mainImage);
475
    }
476
477
    /**
478
     * @param Image $mainImage
479
     *
480
     * @return Image|null
481
     */
482
    private function ensureMinimumImageSize(Image $mainImage) {
483
        if ($mainImage->getWidth() >= $this->config()->get('image_min_width')
484
          && $mainImage->getHeight() >= $this->config()->get('image_min_height')) {
485
            return $mainImage;
486
        }
487
488
        return null;
489
    }
490
491
    /**
492
     * @param string $imageSrc
493
     * @param bool $returnAll
494
     *
495
     * @return LocallyStoredImage|null
496
     */
497
    private function getLocallyStoredImage($imageSrc, $returnAll = false) {
498
        $locallyStoredImages = ImageUtils::storeImagesToLocalFile([$imageSrc], $returnAll, $this->config());
499
500
        return array_shift($locallyStoredImages);
501
    }
502
503
    /**
504
     * @param string[] $imageSrcs
505
     * @param bool $returnAll
506
     *
507
     * @return LocallyStoredImage[]
508
     */
509
    private function getLocallyStoredImages($imageSrcs, $returnAll = false) {
510
        return ImageUtils::storeImagesToLocalFile($imageSrcs, $returnAll, $this->config());
511
    }
512
513
    /**
514
     * @return string
515
     */
516
    private function getCleanDomain() {
517
        return implode('.', array_slice(explode('.', $this->article()->getDomain()), -2, 2));
518
    }
519
520
    /**
521
     * In here we check for known image contains from sites we've checked out like yahoo, techcrunch, etc... that have
522
     * known  places to look for good images.
523
     *
524
     * @todo enable this to use a series of settings files so people can define what the image ids/classes are on specific sites
525
     *
526
     * @return Image|null
527
     */
528
    private function checkForKnownElements() {
529
        if (!$this->article()->getRawDoc()) {
530
            return null;
531
        }
532
533
        $knownImgDomNames = self::$KNOWN_IMG_DOM_NAMES;
534
535
        $domain = $this->getCleanDomain();
536
537
        $customSiteMapping = $this->customSiteMapping();
538
539
        if (isset($customSiteMapping[$domain])) {
540
            foreach (explode('|', $customSiteMapping[$domain]) as $class) {
541
                $knownImgDomNames[] = $class;
542
            }
543
        }
544
545
        $knownImage = null;
546
547
        foreach ($knownImgDomNames as $knownName) {
548
            $known = $this->article()->getRawDoc()->find('#' . $knownName);
549
550
            if (!$known->count()) {
551
                $known = $this->article()->getRawDoc()->find('.' . $knownName);
552
            }
553
554
            if ($known->count()) {
555
                $mainImage = $known->first()->find('img');
556
557
                if ($mainImage->count()) {
558
                    $knownImage = $mainImage->first();
559
                }
560
            }
561
        }
562
563
        if (is_null($knownImage)) {
564
            return null;
565
        }
566
567
        $knownImgSrc = $knownImage->attr('src');
568
569
        $mainImage = new Image();
570
        $mainImage->setImageSrc($this->buildImagePath($knownImgSrc));
571
        $mainImage->setImageExtractionType('known');
572
        $mainImage->setConfidenceScore(90);
573
574
        $locallyStoredImage = $this->getLocallyStoredImage($mainImage->getImageSrc());
575
576 View Code Duplication
        if (!empty($locallyStoredImage)) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
577
            $mainImage->setBytes($locallyStoredImage->getBytes());
578
            $mainImage->setHeight($locallyStoredImage->getHeight());
579
            $mainImage->setWidth($locallyStoredImage->getWidth());
580
        }
581
582
        return $this->ensureMinimumImageSize($mainImage);
583
    }
584
585
    /**
586
     * This method will take an image path and build out the absolute path to that image
587
     * using the initial url we crawled so we can find a link to the image if they use relative urls like ../myimage.jpg
588
     *
589
     * @param string $imageSrc
590
     *
591
     * @return string
592
     */
593
    private function buildImagePath($imageSrc) {
594
        $parts = array(
595
            'scheme',
596
            'host',
597
            'port',
598
            'path',
599
            'query',
600
        );
601
602
        $imageUrlParts = parse_url($imageSrc);
603
        $articleUrlParts = parse_url($this->article()->getFinalUrl());
604
605
        foreach ($parts as $part) {
606
            if (!isset($imageUrlParts[$part]) && isset($articleUrlParts[$part])) {
607
                $imageUrlParts[$part] = $articleUrlParts[$part];
608
609
            } else if (isset($imageUrlParts[$part]) && !isset($articleUrlParts[$part])) {
610
                break;
611
            }
612
        }
613
614
        return http_build_url($imageUrlParts, array());
615
    }
616
617
    /**
618
     * @param string[]
619
     */
620
    private function customSiteMapping() {
621
        if (empty(self::$CUSTOM_SITE_MAPPING)) {
622
            $file = __DIR__ . '/../../../resources/images/known-image-css.txt';
623
624
            $lines = explode("\n", str_replace(["\r\n", "\r"], "\n", file_get_contents($file)));
625
626
            foreach ($lines as $line) {
627
                list($domain, $css) = explode('^', $line);
628
629
                self::$CUSTOM_SITE_MAPPING[$domain] = $css;
630
            }
631
        }
632
633
        return self::$CUSTOM_SITE_MAPPING;
634
    }
635
    
636
}
637