Completed
Push — master ( 693ab0...8bffac )
by Lars
01:43
created

HtmlDomParser   F

Complexity

Total Complexity 122

Size/Duplication

Total Lines 1047
Duplicated Lines 16.05 %

Coupling/Cohesion

Components 1
Dependencies 7

Test Coverage

Coverage 94.22%

Importance

Changes 0
Metric Value
dl 168
loc 1047
ccs 310
cts 329
cp 0.9422
rs 1.412
c 0
b 0
f 0
wmc 122
lcom 1
cbo 7

35 Methods

Rating   Name   Duplication   Size   Complexity  
A findOne() 0 4 1
B __get() 0 18 7
A __construct() 28 28 5
A __call() 0 10 2
A __callStatic() 20 20 3
A __toString() 0 4 1
A clear() 0 4 1
F createDOMDocument() 34 163 37
B find() 31 31 6
A findMulti() 0 4 1
A findMultiOrFalse() 0 10 2
A findOneOrFalse() 0 10 2
C fixHtmlOutput() 0 134 8
A getElementByClass() 0 4 1
A getElementById() 0 4 1
A getElementByTagName() 0 10 2
A getElementsById() 0 4 1
A getElementsByTagName() 0 27 5
A html() 0 18 4
A loadHtml() 0 9 1
B loadHtmlFile() 30 30 6
A xml() 25 25 4
A __invoke() 0 4 1
A getIsDOMDocumentCreatedWithoutHeadWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutPTagWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutHtml() 0 4 1
A getIsDOMDocumentCreatedWithoutBodyWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutHtmlWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutWrapper() 0 4 1
A getIsDOMDocumentCreatedWithFakeEndScript() 0 4 1
A keepBrokenHtml() 0 45 3
A keepSpecialScriptTags() 0 37 3
A useKeepBrokenHtml() 0 6 1
A overwriteTemplateLogicSyntaxInSpecialScriptTags() 0 12 3
A overwriteSpecialScriptTags() 0 12 3

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complex Class

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like HtmlDomParser often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use HtmlDomParser, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
declare(strict_types=1);
4
5
namespace voku\helper;
6
7
/**
8
 * @property-read string $outerText
9
 *                                 <p>Get dom node's outer html (alias for "outerHtml").</p>
10
 * @property-read string $outerHtml
11
 *                                 <p>Get dom node's outer html.</p>
12
 * @property-read string $innerText
13
 *                                 <p>Get dom node's inner html (alias for "innerHtml").</p>
14
 * @property-read string $innerHtml
15
 *                                 <p>Get dom node's inner html.</p>
16
 * @property-read string $plaintext
17
 *                                 <p>Get dom node's plain text.</p>
18
 *
19
 * @method string outerText()
20
 *                                 <p>Get dom node's outer html (alias for "outerHtml()").</p>
21
 * @method string outerHtml()
22
 *                                 <p>Get dom node's outer html.</p>
23
 * @method string innerText()
24
 *                                 <p>Get dom node's inner html (alias for "innerHtml()").</p>
25
 * @method HtmlDomParser load(string $html)
26
 *                                 <p>Load HTML from string.</p>
27
 * @method HtmlDomParser load_file(string $html)
28
 *                                 <p>Load HTML from file.</p>
29
 * @method static HtmlDomParser file_get_html($filePath, $libXMLExtraOptions = null)
30
 *                                 <p>Load HTML from file.</p>
31
 * @method static HtmlDomParser str_get_html($html, $libXMLExtraOptions = null)
32
 *                                 <p>Load HTML from string.</p>
33
 */
34
class HtmlDomParser extends AbstractDomParser
35
{
36
    /**
37
     * @var string[]
38
     */
39
    protected static $functionAliases = [
40
        'outertext' => 'html',
41
        'outerhtml' => 'html',
42
        'innertext' => 'innerHtml',
43
        'innerhtml' => 'innerHtml',
44
        'load'      => 'loadHtml',
45
        'load_file' => 'loadHtmlFile',
46
    ];
47
48
    /**
49
     * @var string[]
50
     */
51
    protected $templateLogicSyntaxInSpecialScriptTags = [
52
        '+',
53
        '<%',
54
        '{%',
55
        '{{',
56
    ];
57
58
    /**
59
     * The properties specified for each special script tag is an array.
60
     *
61
     * ```php
62
     * protected $specialScriptTags = [
63
     *     'text/html',
64
     *     'text/x-custom-template',
65
     *     'text/x-handlebars-template'
66
     * ]
67
     * ```
68
     *
69
     * @var string[]
70
     */
71
    protected $specialScriptTags = [
72
        'text/html',
73
        'text/x-custom-template',
74
        'text/x-handlebars-template',
75
    ];
76
77
    /**
78
     * @var bool
79
     */
80
    protected $isDOMDocumentCreatedWithoutHtml = false;
81
82
    /**
83
     * @var bool
84
     */
85
    protected $isDOMDocumentCreatedWithoutWrapper = false;
86
87
    /**
88
     * @var bool
89
     */
90
    protected $isDOMDocumentCreatedWithCommentWrapper = false;
91
92
    /**
93
     * @var bool
94
     */
95
    protected $isDOMDocumentCreatedWithoutHeadWrapper = false;
96
97
    /**
98
     * @var bool
99
     */
100
    protected $isDOMDocumentCreatedWithoutPTagWrapper = false;
101
102
    /**
103
     * @var bool
104
     */
105
    protected $isDOMDocumentCreatedWithoutHtmlWrapper = false;
106
107
    /**
108
     * @var bool
109
     */
110
    protected $isDOMDocumentCreatedWithoutBodyWrapper = false;
111
112
    /**
113
     * @var bool
114
     */
115
    protected $isDOMDocumentCreatedWithFakeEndScript = false;
116
117
    /**
118
     * @var bool
119
     */
120
    protected $keepBrokenHtml;
121
122
    /**
123
     * @param \DOMNode|SimpleHtmlDomInterface|string $element HTML code or SimpleHtmlDomInterface, \DOMNode
124
     */
125 214 View Code Duplication
    public function __construct($element = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
126
    {
127 214
        $this->document = new \DOMDocument('1.0', $this->getEncoding());
128
129
        // DOMDocument settings
130 214
        $this->document->preserveWhiteSpace = true;
131 214
        $this->document->formatOutput = true;
132
133 214
        if ($element instanceof SimpleHtmlDomInterface) {
134 99
            $element = $element->getNode();
135
        }
136
137 214
        if ($element instanceof \DOMNode) {
138 99
            $domNode = $this->document->importNode($element, true);
139
140 99
            if ($domNode instanceof \DOMNode) {
141
                /** @noinspection UnusedFunctionResultInspection */
142 99
                $this->document->appendChild($domNode);
143
            }
144
145 99
            return;
146
        }
147
148 214
        if ($element !== null) {
149
            /** @noinspection UnusedFunctionResultInspection */
150 85
            $this->loadHtml($element);
151
        }
152 213
    }
153
154
    /**
155
     * @param string $name
156
     * @param array  $arguments
157
     *
158
     * @return bool|mixed
159
     */
160 76
    public function __call($name, $arguments)
161
    {
162 76
        $name = \strtolower($name);
163
164 76
        if (isset(self::$functionAliases[$name])) {
165 75
            return \call_user_func_array([$this, self::$functionAliases[$name]], $arguments);
166
        }
167
168 1
        throw new \BadMethodCallException('Method does not exist: ' . $name);
169
    }
170
171
    /**
172
     * @param string $name
173
     * @param array  $arguments
174
     *
175
     * @throws \BadMethodCallException
176
     * @throws \RuntimeException
177
     *
178
     * @return HtmlDomParser
179
     */
180 27 View Code Duplication
    public static function __callStatic($name, $arguments)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
181
    {
182 27
        $arguments0 = $arguments[0] ?? '';
183
184 27
        $arguments1 = $arguments[1] ?? null;
185
186 27
        if ($name === 'str_get_html') {
187 21
            $parser = new static();
188
189 21
            return $parser->loadHtml($arguments0, $arguments1);
190
        }
191
192 7
        if ($name === 'file_get_html') {
193 6
            $parser = new static();
194
195 6
            return $parser->loadHtmlFile($arguments0, $arguments1);
196
        }
197
198 1
        throw new \BadMethodCallException('Method does not exist');
199
    }
200
201
    /** @noinspection MagicMethodsValidityInspection */
202
203
    /**
204
     * @param string $name
205
     *
206
     * @return string|null
207
     */
208 15
    public function __get($name)
209
    {
210 15
        $name = \strtolower($name);
211
212 15
        switch ($name) {
213 15
            case 'outerhtml':
214 15
            case 'outertext':
215 5
                return $this->html();
216 11
            case 'innerhtml':
217 5
            case 'innertext':
218 7
                return $this->innerHtml();
219 4
            case 'text':
220 4
            case 'plaintext':
221 3
                return $this->text();
222
        }
223
224 1
        return null;
225
    }
226
227
    /**
228
     * @return string
229
     */
230 20
    public function __toString()
231
    {
232 20
        return $this->html();
233
    }
234
235
    /**
236
     * does nothing (only for api-compatibility-reasons)
237
     *
238
     * @return bool
239
     *
240
     * @deprecated
241
     */
242 6
    public function clear(): bool
243
    {
244 6
        return true;
245
    }
246
247
    /**
248
     * Create DOMDocument from HTML.
249
     *
250
     * @param string   $html
251
     * @param int|null $libXMLExtraOptions
252
     *
253
     * @return \DOMDocument
254
     */
255 198
    protected function createDOMDocument(string $html, $libXMLExtraOptions = null): \DOMDocument
256
    {
257 198
        if ($this->keepBrokenHtml) {
258 3
            $html = $this->keepBrokenHtml(\trim($html));
259
        }
260
261 198
        if (\strpos($html, '<') === false) {
262 11
            $this->isDOMDocumentCreatedWithoutHtml = true;
263 196
        } elseif (\strpos(\ltrim($html), '<') !== 0) {
264 6
            $this->isDOMDocumentCreatedWithoutWrapper = true;
265
        }
266
267 198
        if (\strpos(\ltrim($html), '<!--') === 0) {
268 11
            $this->isDOMDocumentCreatedWithCommentWrapper = true;
269
        }
270
271
        /** @noinspection HtmlRequiredLangAttribute */
272
        if (
273 198
            \strpos($html, '<html ') === false
274
            &&
275 198
            \strpos($html, '<html>') === false
276
        ) {
277 120
            $this->isDOMDocumentCreatedWithoutHtmlWrapper = true;
278
        }
279
280
        if (
281 198
            \strpos($html, '<body ') === false
282
            &&
283 198
            \strpos($html, '<body>') === false
284
        ) {
285 125
            $this->isDOMDocumentCreatedWithoutBodyWrapper = true;
286
        }
287
288
        /** @noinspection HtmlRequiredTitleElement */
289
        if (
290 198
            \strpos($html, '<head ') === false
291
            &&
292 198
            \strpos($html, '<head>') === false
293
        ) {
294 144
            $this->isDOMDocumentCreatedWithoutHeadWrapper = true;
295
        }
296
297
        if (
298 198
            \strpos($html, '<p ') === false
299
            &&
300 198
            \strpos($html, '<p>') === false
301
        ) {
302 107
            $this->isDOMDocumentCreatedWithoutPTagWrapper = true;
303
        }
304
305
        if (
306 198
            \strpos($html, '</script>') === false
307
            &&
308 198
            \strpos($html, '<\/script>') !== false
309
        ) {
310 1
            $this->isDOMDocumentCreatedWithFakeEndScript = true;
311
        }
312
313 198
        if (\stripos($html, '</html>') !== false) {
314
            /** @noinspection NestedPositiveIfStatementsInspection */
315
            if (
316 87
                \preg_match('/<\/html>(.*?)/suiU', $html, $matches_after_html)
317
                &&
318 87
                \trim($matches_after_html[1])
319
            ) {
320 1
                $html = \str_replace($matches_after_html[0], $matches_after_html[1] . '</html>', $html);
321
            }
322
        }
323
324 198
        if (\strpos($html, '<script') !== false) {
325 23
            $this->html5FallbackForScriptTags($html);
326
327 23
            foreach ($this->specialScriptTags as $tag) {
328 23
                if (\strpos($html, $tag) !== false) {
329 6
                    $this->keepSpecialScriptTags($html);
330
                }
331
            }
332
        }
333
334
        // set error level
335 198
        $internalErrors = \libxml_use_internal_errors(true);
336 198
        $disableEntityLoader = \libxml_disable_entity_loader(true);
337 198
        \libxml_clear_errors();
338
339 198
        $optionsXml = \LIBXML_DTDLOAD | \LIBXML_DTDATTR | \LIBXML_NONET;
340
341 198
        if (\defined('LIBXML_BIGLINES')) {
342 198
            $optionsXml |= \LIBXML_BIGLINES;
343
        }
344
345 198
        if (\defined('LIBXML_COMPACT')) {
346 198
            $optionsXml |= \LIBXML_COMPACT;
347
        }
348
349 198
        if (\defined('LIBXML_HTML_NODEFDTD')) {
350 198
            $optionsXml |= \LIBXML_HTML_NODEFDTD;
351
        }
352
353 198
        if ($libXMLExtraOptions !== null) {
354 5
            $optionsXml |= $libXMLExtraOptions;
355
        }
356
357
        if (
358 198
            $this->isDOMDocumentCreatedWithoutWrapper
359
            ||
360 194
            $this->isDOMDocumentCreatedWithCommentWrapper
361
            ||
362 198
            $this->keepBrokenHtml
363
        ) {
364 19
            $html = '<' . self::$domHtmlWrapperHelper . '>' . $html . '</' . self::$domHtmlWrapperHelper . '>';
365
        }
366
367 198
        $html = self::replaceToPreserveHtmlEntities($html);
368
369 198
        $documentFound = false;
370 198
        $sxe = \simplexml_load_string($html, \SimpleXMLElement::class, $optionsXml);
371 198 View Code Duplication
        if ($sxe !== false && \count(\libxml_get_errors()) === 0) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
372 91
            $domElementTmp = \dom_import_simplexml($sxe);
373
            if (
374 91
                $domElementTmp
375
                &&
376 91
                $domElementTmp->ownerDocument
377
            ) {
378 91
                $documentFound = true;
379 91
                $this->document = $domElementTmp->ownerDocument;
380
            }
381
        }
382
383 198 View Code Duplication
        if ($documentFound === false) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
384
385
            // UTF-8 hack: http://php.net/manual/en/domdocument.loadhtml.php#95251
386 116
            $xmlHackUsed = false;
387
            /** @noinspection StringFragmentMisplacedInspection */
388 116
            if (\stripos('<?xml', $html) !== 0) {
389 116
                $xmlHackUsed = true;
390 116
                $html = '<?xml encoding="' . $this->getEncoding() . '" ?>' . $html;
391
            }
392
393 116
            $this->document->loadHTML($html, $optionsXml);
394
395
            // remove the "xml-encoding" hack
396 116
            if ($xmlHackUsed) {
397 116
                foreach ($this->document->childNodes as $child) {
398 116
                    if ($child->nodeType === \XML_PI_NODE) {
399
                        /** @noinspection UnusedFunctionResultInspection */
400 116
                        $this->document->removeChild($child);
401
402 116
                        break;
403
                    }
404
                }
405
            }
406
        }
407
408
        // set encoding
409 198
        $this->document->encoding = $this->getEncoding();
410
411
        // restore lib-xml settings
412 198
        \libxml_clear_errors();
413 198
        \libxml_use_internal_errors($internalErrors);
414 198
        \libxml_disable_entity_loader($disableEntityLoader);
415
416 198
        return $this->document;
417
    }
418
419
    /**
420
     * Find list of nodes with a CSS selector.
421
     *
422
     * @param string   $selector
423
     * @param int|null $idx
424
     *
425
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
426
     */
427 145 View Code Duplication
    public function find(string $selector, $idx = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
428
    {
429 145
        $xPathQuery = SelectorConverter::toXPath($selector);
430
431 145
        $xPath = new \DOMXPath($this->document);
432 145
        $nodesList = $xPath->query($xPathQuery);
433 145
        $elements = new SimpleHtmlDomNode();
434
435 145
        if ($nodesList) {
436 145
            foreach ($nodesList as $node) {
437 135
                $elements[] = new SimpleHtmlDom($node);
438
            }
439
        }
440
441
        // return all elements
442 145
        if ($idx === null) {
443 72
            if (\count($elements) === 0) {
444 16
                return new SimpleHtmlDomNodeBlank();
445
            }
446
447 69
            return $elements;
448
        }
449
450
        // handle negative values
451 91
        if ($idx < 0) {
452 11
            $idx = \count($elements) + $idx;
453
        }
454
455
        // return one element
456 91
        return $elements[$idx] ?? new SimpleHtmlDomBlank();
457
    }
458
459
    /**
460
     * Find nodes with a CSS selector.
461
     *
462
     * @param string $selector
463
     *
464
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface[]...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 51. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
465
     */
466 12
    public function findMulti(string $selector): SimpleHtmlDomNodeInterface
467
    {
468 12
        return $this->find($selector, null);
469
    }
470
471
    /**
472
     * Find nodes with a CSS selector or false, if no element is found.
473
     *
474
     * @param string $selector
475
     *
476
     * @return false|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type false|SimpleHtmlDomInter...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 57. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
477
     */
478 4
    public function findMultiOrFalse(string $selector)
479
    {
480 4
        $return = $this->find($selector, null);
481
482 4
        if ($return instanceof SimpleHtmlDomNodeBlank) {
483 3
            return false;
484
        }
485
486 2
        return $return;
487
    }
488
489
    /**
490
     * Find one node with a CSS selector.
491
     *
492
     * @param string $selector
493
     *
494
     * @return SimpleHtmlDomInterface
495
     */
496 32
    public function findOne(string $selector): SimpleHtmlDomInterface
497
    {
498 32
        return $this->find($selector, 0);
499
    }
500
501
    /**
502
     * Find one node with a CSS selector or false, if no element is found.
503
     *
504
     * @param string $selector
505
     *
506
     * @return false|SimpleHtmlDomInterface
507
     */
508 6
    public function findOneOrFalse(string $selector)
509
    {
510 6
        $return = $this->find($selector, 0);
511
512 6
        if ($return instanceof SimpleHtmlDomBlank) {
513 3
            return false;
514
        }
515
516 4
        return $return;
517
    }
518
519
    /**
520
     * @param string $content
521
     * @param bool   $multiDecodeNewHtmlEntity
522
     *
523
     * @return string
524
     */
525 124
    public function fixHtmlOutput(
526
        string $content,
527
        bool $multiDecodeNewHtmlEntity = false
528
    ): string {
529
        // INFO: DOMDocument will encapsulate plaintext into a e.g. paragraph tag (<p>),
530
        //          so we try to remove it here again ...
531
532 124
        if ($this->getIsDOMDocumentCreatedWithoutHtmlWrapper()) {
533
            /** @noinspection HtmlRequiredLangAttribute */
534 59
            $content = \str_replace(
535
                [
536 59
                    '<html>',
537
                    '</html>',
538
                ],
539 59
                '',
540 59
                $content
541
            );
542
        }
543
544 124
        if ($this->getIsDOMDocumentCreatedWithoutHeadWrapper()) {
545
            /** @noinspection HtmlRequiredTitleElement */
546 63
            $content = \str_replace(
547
                [
548 63
                    '<head>',
549
                    '</head>',
550
                ],
551 63
                '',
552 63
                $content
553
            );
554
        }
555
556 124
        if ($this->getIsDOMDocumentCreatedWithoutBodyWrapper()) {
557 62
            $content = \str_replace(
558
                [
559 62
                    '<body>',
560
                    '</body>',
561
                ],
562 62
                '',
563 62
                $content
564
            );
565
        }
566
567 124
        if ($this->getIsDOMDocumentCreatedWithFakeEndScript()) {
568 1
            $content = \str_replace(
569 1
                '</script>',
570 1
                '',
571 1
                $content
572
            );
573
        }
574
575 124
        if ($this->getIsDOMDocumentCreatedWithoutWrapper()) {
576 4
            $content = (string) \preg_replace('/^<p>/', '', $content);
577 4
            $content = (string) \preg_replace('/<\/p>/', '', $content);
578
        }
579
580 124
        if ($this->getIsDOMDocumentCreatedWithoutPTagWrapper()) {
581 57
            $content = \str_replace(
582
                [
583 57
                    '<p>',
584
                    '</p>',
585
                ],
586 57
                '',
587 57
                $content
588
            );
589
        }
590
591 124
        if ($this->getIsDOMDocumentCreatedWithoutHtml()) {
592 9
            $content = \str_replace(
593 9
                '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">',
594 9
                '',
595 9
                $content
596
            );
597
        }
598
599
        // https://bugs.php.net/bug.php?id=73175
600
        /** @noinspection HtmlRequiredTitleElement */
601 124
        $content = \trim(
602 124
            \str_replace(
603
                [
604 124
                    '</area>',
605
                    '</base>',
606
                    '</br>',
607
                    '</col>',
608
                    '</command>',
609
                    '</embed>',
610
                    '</hr>',
611
                    '</img>',
612
                    '</input>',
613
                    '</keygen>',
614
                    '</link>',
615
                    '</meta>',
616
                    '</param>',
617
                    '</source>',
618
                    '</track>',
619
                    '</wbr>',
620
                    '<simpleHtmlDomHtml>',
621
                    '</simpleHtmlDomHtml>',
622
                    '<simpleHtmlDomP>',
623
                    '</simpleHtmlDomP>',
624
                    '<head><head>',
625
                    '</head></head>',
626
                ],
627
                [
628 124
                    '',
629
                    '',
630
                    '',
631
                    '',
632
                    '',
633
                    '',
634
                    '',
635
                    '',
636
                    '',
637
                    '',
638
                    '',
639
                    '',
640
                    '',
641
                    '',
642
                    '',
643
                    '',
644
                    '',
645
                    '',
646
                    '',
647
                    '',
648
                    '<head>',
649
                    '</head>',
650
                ],
651 124
                $content
652
            )
653
        );
654
655 124
        $content = $this->decodeHtmlEntity($content, $multiDecodeNewHtmlEntity);
656
657 124
        return self::putReplacedBackToPreserveHtmlEntities($content);
658
    }
659
660
    /**
661
     * Return elements by ".class".
662
     *
663
     * @param string $class
664
     *
665
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface[]...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 51. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
666
     */
667
    public function getElementByClass(string $class): SimpleHtmlDomNodeInterface
668
    {
669
        return $this->findMulti(".${class}");
670
    }
671
672
    /**
673
     * Return element by #id.
674
     *
675
     * @param string $id
676
     *
677
     * @return SimpleHtmlDomInterface
678
     */
679 3
    public function getElementById(string $id): SimpleHtmlDomInterface
680
    {
681 3
        return $this->findOne("#${id}");
682
    }
683
684
    /**
685
     * Return element by tag name.
686
     *
687
     * @param string $name
688
     *
689
     * @return SimpleHtmlDomInterface
690
     */
691 1
    public function getElementByTagName(string $name): SimpleHtmlDomInterface
692
    {
693 1
        $node = $this->document->getElementsByTagName($name)->item(0);
694
695 1
        if ($node === null) {
696
            return new SimpleHtmlDomBlank();
697
        }
698
699 1
        return new SimpleHtmlDom($node);
700
    }
701
702
    /**
703
     * Returns elements by "#id".
704
     *
705
     * @param string   $id
706
     * @param int|null $idx
707
     *
708
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
709
     */
710
    public function getElementsById(string $id, $idx = null)
711
    {
712
        return $this->find("#${id}", $idx);
713
    }
714
715
    /**
716
     * Returns elements by tag name.
717
     *
718
     * @param string   $name
719
     * @param int|null $idx
720
     *
721
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
722
     */
723 6
    public function getElementsByTagName(string $name, $idx = null)
724
    {
725 6
        $nodesList = $this->document->getElementsByTagName($name);
726
727 6
        $elements = new SimpleHtmlDomNode();
728
729 6
        foreach ($nodesList as $node) {
730 4
            $elements[] = new SimpleHtmlDom($node);
731
        }
732
733
        // return all elements
734 6
        if ($idx === null) {
735 5
            if (\count($elements) === 0) {
736 2
                return new SimpleHtmlDomNodeBlank();
737
            }
738
739 3
            return $elements;
740
        }
741
742
        // handle negative values
743 1
        if ($idx < 0) {
744
            $idx = \count($elements) + $idx;
745
        }
746
747
        // return one element
748 1
        return $elements[$idx] ?? new SimpleHtmlDomNodeBlank();
749
    }
750
751
    /**
752
     * Get dom node's outer html.
753
     *
754
     * @param bool $multiDecodeNewHtmlEntity
755
     *
756
     * @return string
757
     */
758 91
    public function html(bool $multiDecodeNewHtmlEntity = false): string
759
    {
760 91
        if (static::$callback !== null) {
761
            \call_user_func(static::$callback, [$this]);
762
        }
763
764 91
        if ($this->getIsDOMDocumentCreatedWithoutHtmlWrapper()) {
765 52
            $content = $this->document->saveHTML($this->document->documentElement);
766
        } else {
767 52
            $content = $this->document->saveHTML();
768
        }
769
770 91
        if ($content === false) {
771
            return '';
772
        }
773
774 91
        return $this->fixHtmlOutput($content, $multiDecodeNewHtmlEntity);
775
    }
776
777
    /**
778
     * Load HTML from string.
779
     *
780
     * @param string   $html
781
     * @param int|null $libXMLExtraOptions
782
     *
783
     * @return HtmlDomParser
784
     */
785 198
    public function loadHtml(string $html, $libXMLExtraOptions = null): DomParserInterface
786
    {
787
        // reset
788 198
        self::$domBrokenReplaceHelper = [];
789
790 198
        $this->document = $this->createDOMDocument($html, $libXMLExtraOptions);
791
792 198
        return $this;
0 ignored issues
show
Bug Best Practice introduced by
The return type of return $this; (voku\helper\HtmlDomParser) is incompatible with the return type declared by the interface voku\helper\DomParserInterface::loadHtml of type self.

If you return a value from a function or method, it should be a sub-type of the type that is given by the parent type f.e. an interface, or abstract method. This is more formally defined by the Lizkov substitution principle, and guarantees that classes that depend on the parent type can use any instance of a child type interchangably. This principle also belongs to the SOLID principles for object oriented design.

Let’s take a look at an example:

class Author {
    private $name;

    public function __construct($name) {
        $this->name = $name;
    }

    public function getName() {
        return $this->name;
    }
}

abstract class Post {
    public function getAuthor() {
        return 'Johannes';
    }
}

class BlogPost extends Post {
    public function getAuthor() {
        return new Author('Johannes');
    }
}

class ForumPost extends Post { /* ... */ }

function my_function(Post $post) {
    echo strtoupper($post->getAuthor());
}

Our function my_function expects a Post object, and outputs the author of the post. The base class Post returns a simple string and outputting a simple string will work just fine. However, the child class BlogPost which is a sub-type of Post instead decided to return an object, and is therefore violating the SOLID principles. If a BlogPost were passed to my_function, PHP would not complain, but ultimately fail when executing the strtoupper call in its body.

Loading history...
793
    }
794
795
    /**
796
     * Load HTML from file.
797
     *
798
     * @param string   $filePath
799
     * @param int|null $libXMLExtraOptions
800
     *
801
     * @throws \RuntimeException
802
     *
803
     * @return HtmlDomParser
804
     */
805 13 View Code Duplication
    public function loadHtmlFile(string $filePath, $libXMLExtraOptions = null): DomParserInterface
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
806
    {
807
        // reset
808 13
        self::$domBrokenReplaceHelper = [];
809
810
        if (
811 13
            !\preg_match("/^https?:\/\//i", $filePath)
812
            &&
813 13
            !\file_exists($filePath)
814
        ) {
815 1
            throw new \RuntimeException("File ${filePath} not found");
816
        }
817
818
        try {
819 12
            if (\class_exists('\voku\helper\UTF8')) {
820
                /** @noinspection PhpUndefinedClassInspection */
821
                $html = UTF8::file_get_contents($filePath);
822
            } else {
823 12
                $html = \file_get_contents($filePath);
824
            }
825 1
        } catch (\Exception $e) {
826 1
            throw new \RuntimeException("Could not load file ${filePath}");
827
        }
828
829 11
        if ($html === false) {
830
            throw new \RuntimeException("Could not load file ${filePath}");
831
        }
832
833 11
        return $this->loadHtml($html, $libXMLExtraOptions);
834
    }
835
836
    /**
837
     * Get the HTML as XML or plain XML if needed.
838
     *
839
     * @param bool $multiDecodeNewHtmlEntity
840
     * @param bool $htmlToXml
841
     * @param bool $removeXmlHeader
842
     * @param int  $options
843
     *
844
     * @return string
845
     */
846 2 View Code Duplication
    public function xml(
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
847
        bool $multiDecodeNewHtmlEntity = false,
848
        bool $htmlToXml = true,
849
        bool $removeXmlHeader = true,
850
        int $options = \LIBXML_NOEMPTYTAG
851
    ): string {
852 2
        $xml = $this->document->saveXML(null, $options);
853 2
        if ($xml === false) {
854
            return '';
855
        }
856
857 2
        if ($removeXmlHeader) {
858 2
            $xml = \ltrim((string) \preg_replace('/<\?xml.*\?>/', '', $xml));
859
        }
860
861 2
        if ($htmlToXml) {
862 2
            $return = $this->fixHtmlOutput($xml, $multiDecodeNewHtmlEntity);
863
        } else {
864
            $xml = $this->decodeHtmlEntity($xml, $multiDecodeNewHtmlEntity);
865
866
            $return = self::putReplacedBackToPreserveHtmlEntities($xml);
867
        }
868
869 2
        return $return;
870
    }
871
872
    /**
873
     * @param string $selector
874
     * @param int    $idx
875
     *
876
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
877
     */
878 3
    public function __invoke($selector, $idx = null)
879
    {
880 3
        return $this->find($selector, $idx);
881
    }
882
883
    /**
884
     * @return bool
885
     */
886 124
    public function getIsDOMDocumentCreatedWithoutHeadWrapper(): bool
887
    {
888 124
        return $this->isDOMDocumentCreatedWithoutHeadWrapper;
889
    }
890
891
    /**
892
     * @return bool
893
     */
894 124
    public function getIsDOMDocumentCreatedWithoutPTagWrapper(): bool
895
    {
896 124
        return $this->isDOMDocumentCreatedWithoutPTagWrapper;
897
    }
898
899
    /**
900
     * @return bool
901
     */
902 124
    public function getIsDOMDocumentCreatedWithoutHtml(): bool
903
    {
904 124
        return $this->isDOMDocumentCreatedWithoutHtml;
905
    }
906
907
    /**
908
     * @return bool
909
     */
910 124
    public function getIsDOMDocumentCreatedWithoutBodyWrapper(): bool
911
    {
912 124
        return $this->isDOMDocumentCreatedWithoutBodyWrapper;
913
    }
914
915
    /**
916
     * @return bool
917
     */
918 124
    public function getIsDOMDocumentCreatedWithoutHtmlWrapper(): bool
919
    {
920 124
        return $this->isDOMDocumentCreatedWithoutHtmlWrapper;
921
    }
922
923
    /**
924
     * @return bool
925
     */
926 124
    public function getIsDOMDocumentCreatedWithoutWrapper(): bool
927
    {
928 124
        return $this->isDOMDocumentCreatedWithoutWrapper;
929
    }
930
931
    /**
932
     * @return bool
933
     */
934 124
    public function getIsDOMDocumentCreatedWithFakeEndScript(): bool
935
    {
936 124
        return $this->isDOMDocumentCreatedWithFakeEndScript;
937
    }
938
939
    /**
940
     * @param string $html
941
     *
942
     * @return string
943
     */
944 3
    protected function keepBrokenHtml(string $html): string
945
    {
946
        do {
947 3
            $original = $html;
948
949 3
            $html = (string) \preg_replace_callback(
950 3
                '/(?<start>.*)<(?<element_start>[a-z]+)(?<element_start_addon> [^>]*)?>(?<value>.*?)<\/(?<element_end>\2)>(?<end>.*)/sui',
951
                static function ($matches) {
952 3
                    return $matches['start'] .
953 3
                        '°lt_simple_html_dom__voku_°' . $matches['element_start'] . $matches['element_start_addon'] . '°gt_simple_html_dom__voku_°' .
954 3
                        $matches['value'] .
955 3
                        '°lt/_simple_html_dom__voku_°' . $matches['element_end'] . '°gt_simple_html_dom__voku_°' .
956 3
                        $matches['end'];
957 3
                },
958 3
                $html
959
            );
960 3
        } while ($original !== $html);
961
962
        do {
963 3
            $original = $html;
964
965 3
            $html = (string) \preg_replace_callback(
966 3
                '/(?<start>[^<]*)?(?<broken>(?:(?:<\/\w+(?:\s+\w+=\\"[^\"]+\\")*+)(?:[^<]+)>)+)(?<end>.*)/u',
967
                static function ($matches) {
968 3
                    $matches['broken'] = \str_replace(
969 3
                        ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
970 3
                        ['</', '<', '>'],
971 3
                        $matches['broken']
972
                    );
973
974 3
                    self::$domBrokenReplaceHelper['orig'][] = $matches['broken'];
975 3
                    self::$domBrokenReplaceHelper['tmp'][] = $matchesHash = self::$domHtmlBrokenHtmlHelper . \crc32($matches['broken']);
976
977 3
                    return $matches['start'] . $matchesHash . $matches['end'];
978 3
                },
979 3
                $html
980
            );
981 3
        } while ($original !== $html);
982
983 3
        return \str_replace(
984 3
            ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
985 3
            ['</', '<', '>'],
986 3
            $html
987
        );
988
    }
989
990
    /**
991
     * @param string $html
992
     *
993
     * @return void
994
     */
995 6
    protected function keepSpecialScriptTags(string &$html)
996
    {
997
        // regEx for e.g.: [<script id="elements-image-1" type="text/html">...</script>]
998 6
        $tags = \implode('|', \array_map(
999
            static function ($value) {
1000 6
                return \preg_quote($value, '/');
1001 6
            },
1002 6
            $this->specialScriptTags
1003
        ));
1004 6
        $html = (string) \preg_replace_callback(
1005 6
            '/(?<start>((?:<script) [^>]*type=(?:["\'])?(?:' . $tags . ')+(?:[^>]*)>))(?<innerContent>.*)(?<end><\/script>)/isU',
1006
            function ($matches) {
1007
1008
                // Check for logic in special script tags, like [<% _.each(tierPrices, function(item, key) { %>],
1009
                // because often this looks like non valid html in the template itself.
1010 4
                foreach ($this->templateLogicSyntaxInSpecialScriptTags as $logicSyntaxInSpecialScriptTag) {
1011 4
                    if (\strpos($matches['innerContent'], $logicSyntaxInSpecialScriptTag) !== false) {
1012
                        // remove the html5 fallback
1013 3
                        $matches['innerContent'] = \str_replace('<\/', '</', $matches['innerContent']);
1014
1015 3
                        self::$domBrokenReplaceHelper['orig'][] = $matches['innerContent'];
1016 3
                        self::$domBrokenReplaceHelper['tmp'][] = $matchesHash = '' . self::$domHtmlBrokenHtmlHelper . '' . \crc32($matches['innerContent']);
1017
1018 3
                        return $matches['start'] . $matchesHash . $matches['end'];
1019
                    }
1020
                }
1021
1022
                // remove the html5 fallback
1023 3
                $matches[0] = \str_replace('<\/', '</', $matches[0]);
1024
1025 3
                $specialNonScript = '<' . self::$domHtmlSpecialScriptHelper . \substr($matches[0], \strlen('<script'));
1026
1027 3
                return \substr($specialNonScript, 0, -\strlen('</script>')) . '</' . self::$domHtmlSpecialScriptHelper . '>';
1028 6
            },
1029 6
            $html
1030
        );
1031 6
    }
1032
1033
    /**
1034
     * @param bool $keepBrokenHtml
1035
     *
1036
     * @return HtmlDomParser
1037
     */
1038 3
    public function useKeepBrokenHtml(bool $keepBrokenHtml): DomParserInterface
1039
    {
1040 3
        $this->keepBrokenHtml = $keepBrokenHtml;
1041
1042 3
        return $this;
1043
    }
1044
1045
    /**
1046
     * @param string[] $templateLogicSyntaxInSpecialScriptTags
1047
     *
1048
     * @return HtmlDomParser
1049
     */
1050 2
    public function overwriteTemplateLogicSyntaxInSpecialScriptTags(array $templateLogicSyntaxInSpecialScriptTags): DomParserInterface
1051
    {
1052 2
        foreach ($templateLogicSyntaxInSpecialScriptTags as $tmp) {
1053 2
            if (!\is_string($tmp)) {
1054 1
                throw new \InvalidArgumentException('setTemplateLogicSyntaxInSpecialScriptTags only allows string[]');
1055
            }
1056
        }
1057
1058 1
        $this->templateLogicSyntaxInSpecialScriptTags = $templateLogicSyntaxInSpecialScriptTags;
1059
1060 1
        return $this;
1061
    }
1062
1063
    /**
1064
     * @param string[] $specialScriptTags
1065
     *
1066
     * @return HtmlDomParser
1067
     */
1068
    public function overwriteSpecialScriptTags(array $specialScriptTags): DomParserInterface
1069
    {
1070
        foreach ($specialScriptTags as $tag) {
1071
            if (!\is_string($tag)) {
1072
                throw new \InvalidArgumentException('SpecialScriptTags only allows string[]');
1073
            }
1074
        }
1075
1076
        $this->specialScriptTags = $specialScriptTags;
1077
1078
        return $this;
1079
    }
1080
}
1081