Completed
Push — master ( 140b40...7cb46a )
by Andrew
03:24
created

PublishDateExtractor   B

Complexity

Total Complexity 40

Size/Duplication

Total Lines 245
Duplicated Lines 24.49 %

Coupling/Cohesion

Components 1
Dependencies 4

Importance

Changes 2
Bugs 0 Features 1
Metric Value
wmc 40
c 2
b 0
f 1
lcom 1
cbo 4
dl 60
loc 245
rs 8.2608

6 Methods

Rating   Name   Duplication   Size   Complexity  
B run() 0 25 5
A getDateFromURL() 0 17 3
C getDateFromSchemaOrg() 12 46 9
B getDateFromDublinCore() 11 23 5
B getDateFromOpenGraph() 0 19 5
C getDateFromParsely() 37 65 13

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complex Class

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like PublishDateExtractor often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use PublishDateExtractor, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
namespace Goose\Modules\Extractors;
4
5
use Goose\Article;
6
use Goose\Traits\ArticleMutatorTrait;
7
use Goose\Modules\AbstractModule;
8
use Goose\Modules\ModuleInterface;
9
use DOMWrap\Element;
10
11
/**
12
 * Publish Date Extractor
13
 *
14
 * @package Goose\Modules\Extractors
15
 * @license http://www.apache.org/licenses/LICENSE-2.0 Apache License 2.0
16
 */
17
class PublishDateExtractor extends AbstractModule implements ModuleInterface {
18
    use ArticleMutatorTrait;
19
20
    /**
21
     * @param Article $article
22
     *
23
     * @return \DateTime
24
     */
25
    public function run(Article $article) {
26
        $this->article($article);
27
28
        $dt = null;
0 ignored issues
show
Unused Code introduced by
$dt is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
29
30
        $dt = $this->getDateFromSchemaOrg();
31
32
        if (is_null($dt)) {
33
            $dt = $this->getDateFromOpenGraph();
34
        }
35
36
        if (is_null($dt)) {
37
            $dt = $this->getDateFromURL();
38
        }
39
40
        if (is_null($dt)) {
41
            $dt = $this->getDateFromDublinCore();
42
        }
43
44
        if (is_null($dt)) {
45
            $dt = $this->getDateFromParsely();
46
        }
47
48
        $article->setPublishDate($dt);
49
    }
50
51
    private function getDateFromURL() {
52
        // Determine date based on URL
53
        if (preg_match('@(?:[\d]{4})(?<delimiter>[/-])(?:[\d]{2})\k<delimiter>(?:[\d]{2})@U', $this->article()->getFinalUrl(), $matches)) {
54
            $dt = \DateTime::createFromFormat('Y' . $matches['delimiter'] . 'm' . $matches['delimiter'] . 'd', $matches[0]);
55
            $dt->setTime(0, 0, 0);
56
57
            if ($dt === false) {
58
                return null;
59
            }
60
61
            return $dt;
62
        }
63
64
        /** @todo Add more date detection methods */
65
66
        return null;
67
    }
68
69
    /**
70
     * Check for and determine dates from Schema.org's datePublished property.
71
     *
72
     * Checks HTML tags (e.g. <meta>, <time>, etc.) and JSON-LD.
73
     *
74
     * @return \DateTime|null
75
     *
76
     * @see https://schema.org/datePublished
77
     */
78
    private function getDateFromSchemaOrg() {
79
        $dt = null;
80
81
        // Check for HTML tags (<meta>, <time>, etc.)
82
        $nodes = $this->article()->getRawDoc()->find('*[itemprop="datePublished"]');
83
84
        /* @var $node Element */
85
        foreach ($nodes as $node) {
86
            try {
87
                if ($node->hasAttribute('datetime')) {
88
                    $dt = new \DateTime($node->getAttribute('datetime'));
89
                    break;
90
                }
91
                if ($node->hasAttribute('content')) {
92
                    $dt = new \DateTime($node->getAttribute('content'));
93
                    break;
94
                }
95
            }
96
            catch (\Exception $e) {
97
                // Do nothing here in case the node has unrecognizable date information.
98
            }
99
        }
100
101
        if (!is_null($dt)) {
102
            return $dt;
103
        }
104
105
        // Check for JSON-LD
106
        $nodes = $this->article()->getRawDoc()->find('script[type="application/ld+json"]');
107
108
        /* @var $node Element */
109 View Code Duplication
        foreach ($nodes as $node) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
110
            try {
111
                $json = json_decode($node->text());
112
                if (isset($json->datePublished)) {
113
                    $dt = new \DateTime($json->datePublished);
114
                    break;
115
                }
116
            }
117
            catch (\Exception $e) {
118
                // Do nothing here in case the node has unrecognizable date information.
119
            }
120
        }
121
122
        return $dt;
123
    }
124
125
    /**
126
     * Check for and determine dates based on Dublin Core standards.
127
     *
128
     * @return \DateTime|null
129
     *
130
     * @see http://dublincore.org/documents/dcmi-terms/#elements-date
131
     * @see http://dublincore.org/documents/2000/07/16/usageguide/qualified-html.shtml
132
     */
133
    private function getDateFromDublinCore() {
134
        $dt = null;
135
        $nodes = $this->article()->getRawDoc()->find('*[name="dc.date"], *[name="dc.date.issued"], *[name="DC.date.issued"]');
136
137
        /* @var $node Element */
138 View Code Duplication
        foreach ($nodes as $node) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
139
            try {
140
                if ($node->hasAttribute('content')) {
141
                    $dt = new \DateTime($node->getAttribute('content'));
142
                    break;
143
                }
144
            }
145
            catch (\Exception $e) {
146
                // Do nothing here in case the node has unrecognizable date information.
147
            }
148
        }
149
150
        if (!is_null($dt)) {
151
            return $dt;
152
        }
153
154
        return $dt;
155
    }
156
157
    /**
158
     * Check for and determine dates based on OpenGraph standards.
159
     *
160
     * @return \DateTime|null
161
     *
162
     * @see http://ogp.me/
163
     * @see http://ogp.me/#type_article
164
     */
165
    private function getDateFromOpenGraph() {
166
        $dt = null;
167
168
        $og_data = $this->article()->getOpenGraph();
169
170
        try {
171
            if (isset($og_data['published_time'])) {
172
                $dt = new \DateTime($og_data['published_time']);
173
            }
174
            if (is_null($dt) && isset($og_data['pubdate'])) {
175
                $dt = new \DateTime($og_data['pubdate']);
176
            }
177
        }
178
        catch (\Exception $e) {
179
            // Do nothing here in case the node has unrecognizable date information.
180
        }
181
182
        return $dt;
183
    }
184
185
    /**
186
     * Check for and determine dates based on Parsely metadata.
187
     *
188
     * Checks JSON-LD, <meta> tags and parsely-page.
189
     *
190
     * @return \DateTime|null
191
     *
192
     * @see https://www.parsely.com/help/integration/jsonld/
193
     * @see https://www.parsely.com/help/integration/metatags/
194
     * @see https://www.parsely.com/help/integration/ppage/
195
     */
196
    private function getDateFromParsely() {
197
        $dt = null;
198
199
        // JSON-LD
200
        $nodes = $this->article()->getRawDoc()->find('script[type="application/ld+json"]');
201
202
        /* @var $node Element */
203 View Code Duplication
        foreach ($nodes as $node) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
204
            try {
205
                $json = json_decode($node->text());
206
                if (isset($json->dateCreated)) {
207
                    $dt = new \DateTime($json->dateCreated);
208
                    break;
209
                }
210
            }
211
            catch (\Exception $e) {
212
                // Do nothing here in case the node has unrecognizable date information.
213
            }
214
        }
215
216
        if (!is_null($dt)) {
217
            return $dt;
218
        }
219
220
        // <meta> tags
221
        $nodes = $this->article()->getRawDoc()->find('meta[name="parsely-pub-date"]');
222
223
        /* @var $node Element */
224 View Code Duplication
        foreach ($nodes as $node) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
225
            try {
226
                if ($node->hasAttribute('content')) {
227
                    $dt = new \DateTime($node->getAttribute('content'));
228
                    break;
229
                }
230
            }
231
            catch (\Exception $e) {
232
                // Do nothing here in case the node has unrecognizable date information.
233
            }
234
        }
235
236
        if (!is_null($dt)) {
237
            return $dt;
238
        }
239
240
        // parsely-page
241
        $nodes = $this->article()->getRawDoc()->find('meta[name="parsely-page"]');
242
243
        /* @var $node Element */
244 View Code Duplication
        foreach ($nodes as $node) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
245
            try {
246
                if ($node->hasAttribute('content')) {
247
                    $json = json_decode($node->getAttribute('content'));
248
                    if (isset($json->pub_date)) {
249
                        $dt = new \DateTime($json->pub_date);
250
                        break;
251
                    }
252
                }
253
            }
254
            catch (\Exception $e) {
255
                // Do nothing here in case the node has unrecognizable date information.
256
            }
257
        }
258
259
        return $dt;
260
    }
261
}
262