PublishDateExtractor::getDateFromParsely()   F
last analyzed

Complexity

Conditions 14
Paths 328

Size

Total Lines 68
Code Lines 33

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
eloc 33
dl 0
loc 68
c 0
b 0
f 0
rs 3.8333
cc 14
nc 328
nop 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php declare(strict_types=1);
2
3
namespace Goose\Modules\Extractors;
4
5
use Goose\Article;
6
use Goose\Traits\ArticleMutatorTrait;
7
use Goose\Modules\{AbstractModule, ModuleInterface};
8
use DOMWrap\Element;
9
10
/**
11
 * Publish Date Extractor
12
 *
13
 * @package Goose\Modules\Extractors
14
 * @license http://www.apache.org/licenses/LICENSE-2.0 Apache License 2.0
15
 */
16
class PublishDateExtractor extends AbstractModule implements ModuleInterface {
17
    use ArticleMutatorTrait;
18
19
    /** @inheritdoc  */
20
    public function run(Article $article): self {
21
        $this->article($article);
22
23
        $dt = $this->getDateFromSchemaOrg();
24
25
        if (is_null($dt)) {
26
            $dt = $this->getDateFromOpenGraph();
27
        }
28
29
        if (is_null($dt)) {
30
            $dt = $this->getDateFromURL();
31
        }
32
33
        if (is_null($dt)) {
34
            $dt = $this->getDateFromDublinCore();
35
        }
36
37
        if (is_null($dt)) {
38
            $dt = $this->getDateFromParsely();
39
        }
40
41
        $article->setPublishDate($dt);
42
43
        return $this;
44
    }
45
46
    /**
47
     * @return \DateTime|null
48
     */
49
    private function getDateFromURL(): ?\DateTime {
50
        // Determine date based on URL
51
        if (preg_match('@(?:[\d]{4})(?<delimiter>[/-])(?:[\d]{2})\k<delimiter>(?:[\d]{2})@U', $this->article()->getFinalUrl(), $matches)) {
0 ignored issues
show
Bug introduced by
It seems like $this->article()->getFinalUrl() can also be of type null; however, parameter $subject of preg_match() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

51
        if (preg_match('@(?:[\d]{4})(?<delimiter>[/-])(?:[\d]{2})\k<delimiter>(?:[\d]{2})@U', /** @scrutinizer ignore-type */ $this->article()->getFinalUrl(), $matches)) {
Loading history...
52
            $dt = \DateTime::createFromFormat('Y' . $matches['delimiter'] . 'm' . $matches['delimiter'] . 'd', $matches[0]);
53
            $dt->setTime(0, 0, 0);
54
55
            if ($dt === false) {
56
                return null;
57
            }
58
59
            return $dt;
60
        }
61
62
        /** @todo Add more date detection methods */
63
64
        return null;
65
    }
66
67
    /**
68
     * Check for and determine dates from Schema.org's datePublished property.
69
     *
70
     * Checks HTML tags (e.g. <meta>, <time>, etc.) and JSON-LD.
71
     *
72
     * @return \DateTime|null
73
     *
74
     * @see https://schema.org/datePublished
75
     */
76
    private function getDateFromSchemaOrg(): ?\DateTime {
77
        $dt = null;
78
79
        // Check for HTML tags (<meta>, <time>, etc.)
80
        $nodes = $this->article()->getRawDoc()->find('*[itemprop="datePublished"]');
81
82
        /* @var $node Element */
0 ignored issues
show
Unused Code Comprehensibility introduced by
43% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
83
        foreach ($nodes as $node) {
84
            try {
85
                if ($node->hasAttribute('datetime')) {
86
                    $dt = new \DateTime($node->getAttribute('datetime'));
87
                    break;
88
                }
89
                if ($node->hasAttribute('content')) {
90
                    $dt = new \DateTime($node->getAttribute('content'));
91
                    break;
92
                }
93
            }
94
            catch (\Exception $e) {
95
                // Do nothing here in case the node has unrecognizable date information.
96
            }
97
        }
98
99
        if (!is_null($dt)) {
100
            return $dt;
101
        }
102
103
        // Check for JSON-LD
104
        $nodes = $this->article()->getRawDoc()->find('script[type="application/ld+json"]');
105
106
        /* @var $node Element */
0 ignored issues
show
Unused Code Comprehensibility introduced by
43% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
107
        foreach ($nodes as $node) {
108
            try {
109
                $json = json_decode($node->text());
110
111
                // Extract the published date from the Schema.org meta data
112
                if (isset($json->{'@graph'}) && is_array($json->{'@graph'})) {
113
                    foreach ($json->{'@graph'} as $graphData) {
114
                        $graphData = (array)$graphData;
115
116
                        if (!isset($graphData['datePublished'])) {
117
                            continue;
118
                        }
119
120
                        $date = @$graphData['datePublished'];
121
122
                        try {
123
                            $dt = new \DateTime($date);
124
                        } catch (\Error $ex) {
125
                            // Do nothing here in case the node has unrecognizable date information.
126
                        }
127
                    }
128
                }
129
130
                if (isset($json->datePublished)) {
131
                    $date = is_array($json->datePublished)
132
                        ? array_shift($json->datePublished)
133
                        : $json->datePublished;
134
135
                    try {
136
                        $dt = new \DateTime($date);
137
                    } catch (\Error $ex) {
138
                        // Do nothing here in case the node has unrecognizable date information.
139
                    }
140
141
                    break;
142
                }
143
            }
144
            catch (\Exception $e) {
145
                // Do nothing here in case the node has unrecognizable date information.
146
            }
147
        }
148
149
        return $dt;
150
    }
151
152
    /**
153
     * Check for and determine dates based on Dublin Core standards.
154
     *
155
     * @return \DateTime|null
156
     *
157
     * @see http://dublincore.org/documents/dcmi-terms/#elements-date
158
     * @see http://dublincore.org/documents/2000/07/16/usageguide/qualified-html.shtml
159
     */
160
    private function getDateFromDublinCore(): ?\DateTime {
161
        $dt = null;
162
        $nodes = $this->article()->getRawDoc()->find('*[name="dc.date"], *[name="dc.date.issued"], *[name="DC.date.issued"]');
163
164
        /* @var $node Element */
0 ignored issues
show
Unused Code Comprehensibility introduced by
43% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
165
        foreach ($nodes as $node) {
166
            try {
167
                if ($node->hasAttribute('content')) {
168
                    $dt = new \DateTime($node->getAttribute('content'));
169
                    break;
170
                }
171
            }
172
            catch (\Exception $e) {
173
                // Do nothing here in case the node has unrecognizable date information.
174
            }
175
        }
176
177
        if (!is_null($dt)) {
178
            return $dt;
179
        }
180
181
        return $dt;
182
    }
183
184
    /**
185
     * Check for and determine dates based on OpenGraph standards.
186
     *
187
     * @return \DateTime|null
188
     *
189
     * @see http://ogp.me/
190
     * @see http://ogp.me/#type_article
191
     */
192
    private function getDateFromOpenGraph(): ?\DateTime {
193
        $dt = null;
194
195
        $og_data = $this->article()->getOpenGraph();
196
197
        try {
198
            if (isset($og_data['published_time'])) {
199
                $dt = new \DateTime($og_data['published_time']);
200
            }
201
            if (is_null($dt) && isset($og_data['pubdate'])) {
202
                $dt = new \DateTime($og_data['pubdate']);
203
            }
204
        }
205
        catch (\Exception $e) {
206
            // Do nothing here in case the node has unrecognizable date information.
207
        }
208
209
        return $dt;
210
    }
211
212
    /**
213
     * Check for and determine dates based on Parsely metadata.
214
     *
215
     * Checks JSON-LD, <meta> tags and parsely-page.
216
     *
217
     * @return \DateTime|null
218
     *
219
     * @see https://www.parsely.com/help/integration/jsonld/
220
     * @see https://www.parsely.com/help/integration/metatags/
221
     * @see https://www.parsely.com/help/integration/ppage/
222
     */
223
    private function getDateFromParsely(): ?\DateTime {
224
        $dt = null;
225
226
        // JSON-LD
227
        $nodes = $this->article()->getRawDoc()->find('script[type="application/ld+json"]');
228
229
        /* @var $node Element */
0 ignored issues
show
Unused Code Comprehensibility introduced by
43% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
230
        foreach ($nodes as $node) {
231
            try {
232
                $json = json_decode($node->text());
233
                if (isset($json->dateCreated)) {
234
                    $date = is_array($json->dateCreated)
235
                        ? array_shift($json->dateCreated)
236
                        : $json->dateCreated;
237
238
                    $dt = new \DateTime($date);
239
                    break;
240
                }
241
            }
242
            catch (\Exception $e) {
243
                // Do nothing here in case the node has unrecognizable date information.
244
            }
245
        }
246
247
        if (!is_null($dt)) {
248
            return $dt;
249
        }
250
251
        // <meta> tags
252
        $nodes = $this->article()->getRawDoc()->find('meta[name="parsely-pub-date"]');
253
254
        /* @var $node Element */
0 ignored issues
show
Unused Code Comprehensibility introduced by
43% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
255
        foreach ($nodes as $node) {
256
            try {
257
                if ($node->hasAttribute('content')) {
258
                    $dt = new \DateTime($node->getAttribute('content'));
259
                    break;
260
                }
261
            }
262
            catch (\Exception $e) {
263
                // Do nothing here in case the node has unrecognizable date information.
264
            }
265
        }
266
267
        if (!is_null($dt)) {
268
            return $dt;
269
        }
270
271
        // parsely-page
272
        $nodes = $this->article()->getRawDoc()->find('meta[name="parsely-page"]');
273
274
        /* @var $node Element */
0 ignored issues
show
Unused Code Comprehensibility introduced by
43% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
275
        foreach ($nodes as $node) {
276
            try {
277
                if ($node->hasAttribute('content')) {
278
                    $json = json_decode($node->getAttribute('content'));
279
                    if (isset($json->pub_date)) {
280
                        $dt = new \DateTime($json->pub_date);
281
                        break;
282
                    }
283
                }
284
            }
285
            catch (\Exception $e) {
286
                // Do nothing here in case the node has unrecognizable date information.
287
            }
288
        }
289
290
        return $dt;
291
    }
292
}
293