Completed
Push — master ( 5b9105...a863ce )
by Lars
04:32
created

HtmlDomParser   F

Complexity

Total Complexity 117

Size/Duplication

Total Lines 970
Duplicated Lines 17.32 %

Coupling/Cohesion

Components 1
Dependencies 7

Test Coverage

Coverage 95.93%

Importance

Changes 0
Metric Value
dl 168
loc 970
ccs 283
cts 295
cp 0.9593
rs 1.6299
c 0
b 0
f 0
wmc 117
lcom 1
cbo 7

34 Methods

Rating   Name   Duplication   Size   Complexity  
A __construct() 28 28 5
A __call() 0 10 2
A __callStatic() 20 20 3
A __toString() 0 4 1
A clear() 0 4 1
B __get() 0 18 7
F createDOMDocument() 34 157 35
B find() 31 31 6
A findMulti() 0 4 1
A findMultiOrFalse() 0 10 2
A findOne() 0 4 1
A findOneOrFalse() 0 10 2
C fixHtmlOutput() 0 106 8
A getElementByClass() 0 4 1
A getElementById() 0 4 1
A getElementByTagName() 0 10 2
A getElementsById() 0 4 1
A getElementsByTagName() 0 27 5
A html() 0 18 4
A loadHtml() 0 9 1
B loadHtmlFile() 30 30 6
A xml() 25 25 4
A __invoke() 0 4 1
A getIsDOMDocumentCreatedWithoutHeadWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutPTagWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutHtml() 0 4 1
A getIsDOMDocumentCreatedWithoutBodyWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutHtmlWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutWrapper() 0 4 1
A getIsDOMDocumentCreatedWithFakeEndScript() 0 4 1
A keepBrokenHtml() 0 45 3
A keepSpecialScriptTags() 0 31 3
A useKeepBrokenHtml() 0 6 1
A overwriteTemplateLogicSyntaxInSpecialScriptTags() 0 12 3

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complex Class

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like HtmlDomParser often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use HtmlDomParser, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
declare(strict_types=1);
4
5
namespace voku\helper;
6
7
/**
8
 * @property-read string $outerText
9
 *                                 <p>Get dom node's outer html (alias for "outerHtml").</p>
10
 * @property-read string $outerHtml
11
 *                                 <p>Get dom node's outer html.</p>
12
 * @property-read string $innerText
13
 *                                 <p>Get dom node's inner html (alias for "innerHtml").</p>
14
 * @property-read string $innerHtml
15
 *                                 <p>Get dom node's inner html.</p>
16
 * @property-read string $plaintext
17
 *                                 <p>Get dom node's plain text.</p>
18
 *
19
 * @method string outerText()
20
 *                                 <p>Get dom node's outer html (alias for "outerHtml()").</p>
21
 * @method string outerHtml()
22
 *                                 <p>Get dom node's outer html.</p>
23
 * @method string innerText()
24
 *                                 <p>Get dom node's inner html (alias for "innerHtml()").</p>
25
 * @method HtmlDomParser load(string $html)
26
 *                                 <p>Load HTML from string.</p>
27
 * @method HtmlDomParser load_file(string $html)
28
 *                                 <p>Load HTML from file.</p>
29
 * @method static HtmlDomParser file_get_html($filePath, $libXMLExtraOptions = null)
30
 *                                 <p>Load HTML from file.</p>
31
 * @method static HtmlDomParser str_get_html($html, $libXMLExtraOptions = null)
32
 *                                 <p>Load HTML from string.</p>
33
 */
34
class HtmlDomParser extends AbstractDomParser
35
{
36
    /**
37
     * @var string[]
38
     */
39
    protected static $functionAliases = [
40
        'outertext' => 'html',
41
        'outerhtml' => 'html',
42
        'innertext' => 'innerHtml',
43
        'innerhtml' => 'innerHtml',
44
        'load'      => 'loadHtml',
45
        'load_file' => 'loadHtmlFile',
46
    ];
47
48
    /**
49
     * @var string[]
50
     */
51
    protected $templateLogicSyntaxInSpecialScriptTags = [
52
        '+',
53
        '<%',
54
        '{%',
55
        '{{',
56
    ];
57
58
    /**
59
     * @var bool
60
     */
61
    protected $isDOMDocumentCreatedWithoutHtml = false;
62
63
    /**
64
     * @var bool
65
     */
66
    protected $isDOMDocumentCreatedWithoutWrapper = false;
67
68
    /**
69
     * @var bool
70
     */
71
    protected $isDOMDocumentCreatedWithCommentWrapper = false;
72
73
    /**
74
     * @var bool
75
     */
76
    protected $isDOMDocumentCreatedWithoutHeadWrapper = false;
77
78
    /**
79
     * @var bool
80
     */
81
    protected $isDOMDocumentCreatedWithoutPTagWrapper = false;
82
83
    /**
84
     * @var bool
85
     */
86
    protected $isDOMDocumentCreatedWithoutHtmlWrapper = false;
87
88
    /**
89
     * @var bool
90
     */
91
    protected $isDOMDocumentCreatedWithoutBodyWrapper = false;
92
93
    /**
94
     * @var bool
95
     */
96
    protected $isDOMDocumentCreatedWithFakeEndScript = false;
97
98
    /**
99
     * @var bool
100
     */
101
    protected $keepBrokenHtml;
102
103
    /**
104
     * @param \DOMNode|SimpleHtmlDomInterface|string $element HTML code or SimpleHtmlDomInterface, \DOMNode
105
     */
106 212 View Code Duplication
    public function __construct($element = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
107
    {
108 212
        $this->document = new \DOMDocument('1.0', $this->getEncoding());
109
110
        // DOMDocument settings
111 212
        $this->document->preserveWhiteSpace = true;
112 212
        $this->document->formatOutput = true;
113
114 212
        if ($element instanceof SimpleHtmlDomInterface) {
115 98
            $element = $element->getNode();
116
        }
117
118 212
        if ($element instanceof \DOMNode) {
119 98
            $domNode = $this->document->importNode($element, true);
120
121 98
            if ($domNode instanceof \DOMNode) {
122
                /** @noinspection UnusedFunctionResultInspection */
123 98
                $this->document->appendChild($domNode);
124
            }
125
126 98
            return;
127
        }
128
129 212
        if ($element !== null) {
130
            /** @noinspection UnusedFunctionResultInspection */
131 85
            $this->loadHtml($element);
132
        }
133 211
    }
134
135
    /**
136
     * @param string $name
137
     * @param array  $arguments
138
     *
139
     * @return bool|mixed
140
     */
141 76
    public function __call($name, $arguments)
142
    {
143 76
        $name = \strtolower($name);
144
145 76
        if (isset(self::$functionAliases[$name])) {
146 75
            return \call_user_func_array([$this, self::$functionAliases[$name]], $arguments);
147
        }
148
149 1
        throw new \BadMethodCallException('Method does not exist: ' . $name);
150
    }
151
152
    /**
153
     * @param string $name
154
     * @param array  $arguments
155
     *
156
     * @throws \BadMethodCallException
157
     * @throws \RuntimeException
158
     *
159
     * @return HtmlDomParser
160
     */
161 26 View Code Duplication
    public static function __callStatic($name, $arguments)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
162
    {
163 26
        $arguments0 = $arguments[0] ?? '';
164
165 26
        $arguments1 = $arguments[1] ?? null;
166
167 26
        if ($name === 'str_get_html') {
168 20
            $parser = new static();
169
170 20
            return $parser->loadHtml($arguments0, $arguments1);
171
        }
172
173 7
        if ($name === 'file_get_html') {
174 6
            $parser = new static();
175
176 6
            return $parser->loadHtmlFile($arguments0, $arguments1);
177
        }
178
179 1
        throw new \BadMethodCallException('Method does not exist');
180
    }
181
182
    /** @noinspection MagicMethodsValidityInspection */
183
184
    /**
185
     * @param string $name
186
     *
187
     * @return string|null
188
     */
189 15
    public function __get($name)
190
    {
191 15
        $name = \strtolower($name);
192
193
        switch ($name) {
194 15
            case 'outerhtml':
195 15
            case 'outertext':
196 5
                return $this->html();
197 11
            case 'innerhtml':
198 5
            case 'innertext':
199 7
                return $this->innerHtml();
200 4
            case 'text':
201 4
            case 'plaintext':
202 3
                return $this->text();
203
        }
204
205 1
        return null;
206
    }
207
208
    /**
209
     * @return string
210
     */
211 19
    public function __toString()
212
    {
213 19
        return $this->html();
214
    }
215
216
    /**
217
     * does nothing (only for api-compatibility-reasons)
218
     *
219
     * @return bool
220
     *
221
     * @deprecated
222
     */
223 6
    public function clear(): bool
224
    {
225 6
        return true;
226
    }
227
228
    /**
229
     * Create DOMDocument from HTML.
230
     *
231
     * @param string   $html
232
     * @param int|null $libXMLExtraOptions
233
     *
234
     * @return \DOMDocument
235
     */
236 196
    protected function createDOMDocument(string $html, $libXMLExtraOptions = null): \DOMDocument
237
    {
238 196
        if ($this->keepBrokenHtml) {
239 3
            $html = $this->keepBrokenHtml(\trim($html));
240
        }
241
242 196
        if (\strpos($html, '<') === false) {
243 11
            $this->isDOMDocumentCreatedWithoutHtml = true;
244 194
        } elseif (\strpos(\ltrim($html), '<') !== 0) {
245 6
            $this->isDOMDocumentCreatedWithoutWrapper = true;
246
        }
247
248 196
        if (\strpos(\ltrim($html), '<!--') === 0) {
249 11
            $this->isDOMDocumentCreatedWithCommentWrapper = true;
250
        }
251
252
        /** @noinspection HtmlRequiredLangAttribute */
253
        if (
254 196
            \strpos($html, '<html ') === false
255
            &&
256 196
            \strpos($html, '<html>') === false
257
        ) {
258 119
            $this->isDOMDocumentCreatedWithoutHtmlWrapper = true;
259
        }
260
261
        if (
262 196
            \strpos($html, '<body ') === false
263
            &&
264 196
            \strpos($html, '<body>') === false
265
        ) {
266 124
            $this->isDOMDocumentCreatedWithoutBodyWrapper = true;
267
        }
268
269
        /** @noinspection HtmlRequiredTitleElement */
270
        if (
271 196
            \strpos($html, '<head ') === false
272
            &&
273 196
            \strpos($html, '<head>') === false
274
        ) {
275 143
            $this->isDOMDocumentCreatedWithoutHeadWrapper = true;
276
        }
277
278
        /** @noinspection HtmlRequiredTitleElement */
279
        if (
280 196
            \strpos($html, '<p ') === false
281
            &&
282 196
            \strpos($html, '<p>') === false
283
        ) {
284 106
            $this->isDOMDocumentCreatedWithoutPTagWrapper = true;
285
        }
286
287
        if (
288 196
            \strpos($html, '</script>') === false
289
            &&
290 196
            \strpos($html, '<\/script>') !== false
291
        ) {
292 1
            $this->isDOMDocumentCreatedWithFakeEndScript = true;
293
        }
294
295 196
        if (\strpos($html, '<script') !== false) {
296 23
            $this->html5FallbackForScriptTags($html);
297
298
            if (
299 23
                \strpos($html, 'text/html') !== false
300
                ||
301 18
                \strpos($html, 'text/x-custom-template') !== false
302
                ||
303 23
                \strpos($html, 'text/x-handlebars-template') !== false
304
            ) {
305 6
                $this->keepSpecialScriptTags($html);
306
            }
307
        }
308
309
        // set error level
310 196
        $internalErrors = \libxml_use_internal_errors(true);
311 196
        $disableEntityLoader = \libxml_disable_entity_loader(true);
312 196
        \libxml_clear_errors();
313
314 196
        $optionsXml = \LIBXML_DTDLOAD | \LIBXML_DTDATTR | \LIBXML_NONET;
315
316 196
        if (\defined('LIBXML_BIGLINES')) {
317 196
            $optionsXml |= \LIBXML_BIGLINES;
318
        }
319
320 196
        if (\defined('LIBXML_COMPACT')) {
321 196
            $optionsXml |= \LIBXML_COMPACT;
322
        }
323
324 196
        if (\defined('LIBXML_HTML_NODEFDTD')) {
325 196
            $optionsXml |= \LIBXML_HTML_NODEFDTD;
326
        }
327
328 196
        if ($libXMLExtraOptions !== null) {
329 5
            $optionsXml |= $libXMLExtraOptions;
330
        }
331
332
        if (
333 196
            $this->isDOMDocumentCreatedWithoutWrapper
334
            ||
335 192
            $this->isDOMDocumentCreatedWithCommentWrapper
336
            ||
337 196
            $this->keepBrokenHtml
338
        ) {
339 19
            $html = '<' . self::$domHtmlWrapperHelper . '>' . $html . '</' . self::$domHtmlWrapperHelper . '>';
340
        }
341
342 196
        $html = self::replaceToPreserveHtmlEntities($html);
343
344 196
        $documentFound = false;
345 196
        $sxe = \simplexml_load_string($html, \SimpleXMLElement::class, $optionsXml);
346 196 View Code Duplication
        if ($sxe !== false && \count(\libxml_get_errors()) === 0) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
347 90
            $domElementTmp = \dom_import_simplexml($sxe);
348
            if (
349 90
                $domElementTmp
350
                &&
351 90
                $domElementTmp->ownerDocument !== null
352
            ) {
353 90
                $documentFound = true;
354 90
                $this->document = $domElementTmp->ownerDocument;
355
            }
356
        }
357
358 196 View Code Duplication
        if ($documentFound === false) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
359
360
            // UTF-8 hack: http://php.net/manual/en/domdocument.loadhtml.php#95251
361 115
            $xmlHackUsed = false;
362
            /** @noinspection StringFragmentMisplacedInspection */
363 115
            if (\stripos('<?xml', $html) !== 0) {
364 115
                $xmlHackUsed = true;
365 115
                $html = '<?xml encoding="' . $this->getEncoding() . '" ?>' . $html;
366
            }
367
368 115
            $this->document->loadHTML($html, $optionsXml);
369
370
            // remove the "xml-encoding" hack
371 115
            if ($xmlHackUsed) {
372 115
                foreach ($this->document->childNodes as $child) {
373 115
                    if ($child->nodeType === \XML_PI_NODE) {
374
                        /** @noinspection UnusedFunctionResultInspection */
375 115
                        $this->document->removeChild($child);
376
377 115
                        break;
378
                    }
379
                }
380
            }
381
        }
382
383
        // set encoding
384 196
        $this->document->encoding = $this->getEncoding();
385
386
        // restore lib-xml settings
387 196
        \libxml_clear_errors();
388 196
        \libxml_use_internal_errors($internalErrors);
389 196
        \libxml_disable_entity_loader($disableEntityLoader);
390
391 196
        return $this->document;
392
    }
393
394
    /**
395
     * Find list of nodes with a CSS selector.
396
     *
397
     * @param string   $selector
398
     * @param int|null $idx
399
     *
400
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
401
     */
402 144 View Code Duplication
    public function find(string $selector, $idx = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
403
    {
404 144
        $xPathQuery = SelectorConverter::toXPath($selector);
405
406 144
        $xPath = new \DOMXPath($this->document);
407 144
        $nodesList = $xPath->query($xPathQuery);
408 144
        $elements = new SimpleHtmlDomNode();
409
410 144
        if ($nodesList) {
411 144
            foreach ($nodesList as $node) {
412 134
                $elements[] = new SimpleHtmlDom($node);
413
            }
414
        }
415
416
        // return all elements
417 144
        if ($idx === null) {
418 71
            if (\count($elements) === 0) {
419 16
                return new SimpleHtmlDomNodeBlank();
420
            }
421
422 68
            return $elements;
423
        }
424
425
        // handle negative values
426 91
        if ($idx < 0) {
427 11
            $idx = \count($elements) + $idx;
428
        }
429
430
        // return one element
431 91
        return $elements[$idx] ?? new SimpleHtmlDomBlank();
432
    }
433
434
    /**
435
     * Find nodes with a CSS selector.
436
     *
437
     * @param string $selector
438
     *
439
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface[]...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 51. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
440
     */
441 12
    public function findMulti(string $selector): SimpleHtmlDomNodeInterface
442
    {
443 12
        return $this->find($selector, null);
444
    }
445
446
    /**
447
     * Find nodes with a CSS selector or false, if no element is found.
448
     *
449
     * @param string $selector
450
     *
451
     * @return false|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type false|SimpleHtmlDomInter...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 57. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
452
     */
453 4
    public function findMultiOrFalse(string $selector)
454
    {
455 4
        $return = $this->find($selector, null);
456
457 4
        if ($return instanceof SimpleHtmlDomNodeBlank) {
458 3
            return false;
459
        }
460
461 2
        return $return;
462
    }
463
464
    /**
465
     * Find one node with a CSS selector.
466
     *
467
     * @param string $selector
468
     *
469
     * @return SimpleHtmlDomInterface
470
     */
471 32
    public function findOne(string $selector): SimpleHtmlDomInterface
472
    {
473 32
        return $this->find($selector, 0);
474
    }
475
476
    /**
477
     * Find one node with a CSS selector or false, if no element is found.
478
     *
479
     * @param string $selector
480
     *
481
     * @return false|SimpleHtmlDomInterface
482
     */
483 6
    public function findOneOrFalse(string $selector)
484
    {
485 6
        $return = $this->find($selector, 0);
486
487 6
        if ($return instanceof SimpleHtmlDomBlank) {
488 3
            return false;
489
        }
490
491 4
        return $return;
492
    }
493
494
    /**
495
     * @param string $content
496
     * @param bool   $multiDecodeNewHtmlEntity
497
     *
498
     * @return string
499
     */
500 122
    public function fixHtmlOutput(
501
        string $content,
502
        bool $multiDecodeNewHtmlEntity = false
503
    ): string {
504
        // INFO: DOMDocument will encapsulate plaintext into a e.g. paragraph tag (<p>),
505
        //          so we try to remove it here again ...
506
507 122
        if ($this->getIsDOMDocumentCreatedWithoutHtmlWrapper()) {
508
            /** @noinspection HtmlRequiredLangAttribute */
509 58
            $content = \str_replace(
510
                [
511 58
                    '<html>',
512
                    '</html>',
513
                ],
514 58
                '',
515 58
                $content
516
            );
517
        }
518
519 122
        if ($this->getIsDOMDocumentCreatedWithoutHeadWrapper()) {
520
            /** @noinspection HtmlRequiredTitleElement */
521 62
            $content = \str_replace(
522
                [
523 62
                    '<head>',
524
                    '</head>',
525
                ],
526 62
                '',
527 62
                $content
528
            );
529
        }
530
531 122
        if ($this->getIsDOMDocumentCreatedWithoutBodyWrapper()) {
532
            /** @noinspection HtmlRequiredLangAttribute */
533 61
            $content = \str_replace(
534
                [
535 61
                    '<body>',
536
                    '</body>',
537
                ],
538 61
                '',
539 61
                $content
540
            );
541
        }
542
543 122
        if ($this->getIsDOMDocumentCreatedWithFakeEndScript()) {
544 1
            $content = \str_replace(
545 1
                '</script>',
546 1
                '',
547 1
                $content
548
            );
549
        }
550
551 122
        if ($this->getIsDOMDocumentCreatedWithoutWrapper()) {
552 4
            $content = (string) \preg_replace('/^<p>/', '', $content);
553 4
            $content = (string) \preg_replace('/<\/p>/', '', $content);
554
        }
555
556 122
        if ($this->getIsDOMDocumentCreatedWithoutPTagWrapper()) {
557 56
            $content = \str_replace(
558
                [
559 56
                    '<p>',
560
                    '</p>',
561
                ],
562 56
                '',
563 56
                $content
564
            );
565
        }
566
567 122
        if ($this->getIsDOMDocumentCreatedWithoutHtml()) {
568 9
            $content = \str_replace(
569 9
                '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">',
570 9
                '',
571 9
                $content
572
            );
573
        }
574
575
        /** @noinspection CheckTagEmptyBody */
576
        /** @noinspection HtmlExtraClosingTag */
577
        /** @noinspection HtmlRequiredTitleElement */
578 122
        $content = \trim(
579 122
            \str_replace(
580
                [
581 122
                    '<simpleHtmlDomHtml>',
582
                    '</simpleHtmlDomHtml>',
583
                    '<simpleHtmlDomP>',
584
                    '</simpleHtmlDomP>',
585
                    '<head><head>',
586
                    '</head></head>',
587
                    '<br></br>',
588
                ],
589
                [
590 122
                    '',
591
                    '',
592
                    '',
593
                    '',
594
                    '<head>',
595
                    '</head>',
596
                    '<br>',
597
                ],
598 122
                $content
599
            )
600
        );
601
602 122
        $content = $this->decodeHtmlEntity($content, $multiDecodeNewHtmlEntity);
603
604 122
        return self::putReplacedBackToPreserveHtmlEntities($content);
605
    }
606
607
    /**
608
     * Return elements by ".class".
609
     *
610
     * @param string $class
611
     *
612
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface[]...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 51. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
613
     */
614
    public function getElementByClass(string $class): SimpleHtmlDomNodeInterface
615
    {
616
        return $this->findMulti(".${class}");
617
    }
618
619
    /**
620
     * Return element by #id.
621
     *
622
     * @param string $id
623
     *
624
     * @return SimpleHtmlDomInterface
625
     */
626
    public function getElementById(string $id): SimpleHtmlDomInterface
627
    {
628 3
        return $this->findOne("#${id}");
629
    }
630
631
    /**
632
     * Return element by tag name.
633
     *
634
     * @param string $name
635
     *
636
     * @return SimpleHtmlDomInterface
637
     */
638
    public function getElementByTagName(string $name): SimpleHtmlDomInterface
639
    {
640 1
        $node = $this->document->getElementsByTagName($name)->item(0);
641
642 1
        if ($node === null) {
643
            return new SimpleHtmlDomBlank();
644
        }
645
646 1
        return new SimpleHtmlDom($node);
647
    }
648
649
    /**
650
     * Returns elements by "#id".
651
     *
652
     * @param string   $id
653
     * @param int|null $idx
654
     *
655
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
656
     */
657
    public function getElementsById(string $id, $idx = null)
658
    {
659
        return $this->find("#${id}", $idx);
660
    }
661
662
    /**
663
     * Returns elements by tag name.
664
     *
665
     * @param string   $name
666
     * @param int|null $idx
667
     *
668
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
669
     */
670
    public function getElementsByTagName(string $name, $idx = null)
671
    {
672 6
        $nodesList = $this->document->getElementsByTagName($name);
673
674 6
        $elements = new SimpleHtmlDomNode();
675
676 6
        foreach ($nodesList as $node) {
677 4
            $elements[] = new SimpleHtmlDom($node);
678
        }
679
680
        // return all elements
681 6
        if ($idx === null) {
682 5
            if (\count($elements) === 0) {
683 2
                return new SimpleHtmlDomNodeBlank();
684
            }
685
686 3
            return $elements;
687
        }
688
689
        // handle negative values
690 1
        if ($idx < 0) {
691
            $idx = \count($elements) + $idx;
692
        }
693
694
        // return one element
695 1
        return $elements[$idx] ?? new SimpleHtmlDomNodeBlank();
696
    }
697
698
    /**
699
     * Get dom node's outer html.
700
     *
701
     * @param bool $multiDecodeNewHtmlEntity
702
     *
703
     * @return string
704
     */
705
    public function html(bool $multiDecodeNewHtmlEntity = false): string
706
    {
707 89
        if (static::$callback !== null) {
708
            \call_user_func(static::$callback, [$this]);
709
        }
710
711 89
        if ($this->getIsDOMDocumentCreatedWithoutHtmlWrapper()) {
712 51
            $content = $this->document->saveHTML($this->document->documentElement);
713
        } else {
714 51
            $content = $this->document->saveHTML();
715
        }
716
717 89
        if ($content === false) {
718
            return '';
719
        }
720
721 89
        return $this->fixHtmlOutput($content, $multiDecodeNewHtmlEntity);
722
    }
723
724
    /**
725
     * Load HTML from string.
726
     *
727
     * @param string   $html
728
     * @param int|null $libXMLExtraOptions
729
     *
730
     * @return HtmlDomParser
731
     */
732
    public function loadHtml(string $html, $libXMLExtraOptions = null): DomParserInterface
733
    {
734
        // reset
735 196
        self::$domBrokenReplaceHelper = [];
736
737 196
        $this->document = $this->createDOMDocument($html, $libXMLExtraOptions);
738
739 196
        return $this;
0 ignored issues
show
Bug Best Practice introduced by
The return type of return $this; (voku\helper\HtmlDomParser) is incompatible with the return type declared by the interface voku\helper\DomParserInterface::loadHtml of type self.

If you return a value from a function or method, it should be a sub-type of the type that is given by the parent type f.e. an interface, or abstract method. This is more formally defined by the Lizkov substitution principle, and guarantees that classes that depend on the parent type can use any instance of a child type interchangably. This principle also belongs to the SOLID principles for object oriented design.

Let’s take a look at an example:

class Author {
    private $name;

    public function __construct($name) {
        $this->name = $name;
    }

    public function getName() {
        return $this->name;
    }
}

abstract class Post {
    public function getAuthor() {
        return 'Johannes';
    }
}

class BlogPost extends Post {
    public function getAuthor() {
        return new Author('Johannes');
    }
}

class ForumPost extends Post { /* ... */ }

function my_function(Post $post) {
    echo strtoupper($post->getAuthor());
}

Our function my_function expects a Post object, and outputs the author of the post. The base class Post returns a simple string and outputting a simple string will work just fine. However, the child class BlogPost which is a sub-type of Post instead decided to return an object, and is therefore violating the SOLID principles. If a BlogPost were passed to my_function, PHP would not complain, but ultimately fail when executing the strtoupper call in its body.

Loading history...
740
    }
741
742
    /**
743
     * Load HTML from file.
744
     *
745
     * @param string   $filePath
746
     * @param int|null $libXMLExtraOptions
747
     *
748
     * @throws \RuntimeException
749
     *
750
     * @return HtmlDomParser
751
     */
752 View Code Duplication
    public function loadHtmlFile(string $filePath, $libXMLExtraOptions = null): DomParserInterface
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
753
    {
754
        // reset
755 13
        self::$domBrokenReplaceHelper = [];
756
757
        if (
758 13
            !\preg_match("/^https?:\/\//i", $filePath)
759
            &&
760 13
            !\file_exists($filePath)
761
        ) {
762 1
            throw new \RuntimeException("File ${filePath} not found");
763
        }
764
765
        try {
766 12
            if (\class_exists('\voku\helper\UTF8')) {
767
                /** @noinspection PhpUndefinedClassInspection */
768
                $html = UTF8::file_get_contents($filePath);
769
            } else {
770 12
                $html = \file_get_contents($filePath);
771
            }
772 1
        } catch (\Exception $e) {
773 1
            throw new \RuntimeException("Could not load file ${filePath}");
774
        }
775
776 11
        if ($html === false) {
777
            throw new \RuntimeException("Could not load file ${filePath}");
778
        }
779
780 11
        return $this->loadHtml($html, $libXMLExtraOptions);
781
    }
782
783
    /**
784
     * Get the HTML as XML or plain XML if needed.
785
     *
786
     * @param bool $multiDecodeNewHtmlEntity
787
     * @param bool $htmlToXml
788
     * @param bool $removeXmlHeader
789
     * @param int  $options
790
     *
791
     * @return string
792
     */
793 View Code Duplication
    public function xml(
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
794
        bool $multiDecodeNewHtmlEntity = false,
795
        bool $htmlToXml = true,
796
        bool $removeXmlHeader = true,
797
        int $options = \LIBXML_NOEMPTYTAG
798
    ): string {
799 2
        $xml = $this->document->saveXML(null, $options);
800 2
        if ($xml === false) {
801
            return '';
802
        }
803
804 2
        if ($removeXmlHeader) {
805 2
            $xml = \ltrim((string) \preg_replace('/<\?xml.*\?>/', '', $xml));
806
        }
807
808 2
        if ($htmlToXml) {
809 2
            $return = $this->fixHtmlOutput($xml, $multiDecodeNewHtmlEntity);
810
        } else {
811
            $xml = $this->decodeHtmlEntity($xml, $multiDecodeNewHtmlEntity);
812
813
            $return = self::putReplacedBackToPreserveHtmlEntities($xml);
814
        }
815
816 2
        return $return;
817
    }
818
819
    /**
820
     * @param string $selector
821
     * @param int    $idx
822
     *
823
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
824
     */
825
    public function __invoke($selector, $idx = null)
826
    {
827 3
        return $this->find($selector, $idx);
828
    }
829
830
    /**
831
     * @return bool
832
     */
833
    public function getIsDOMDocumentCreatedWithoutHeadWrapper(): bool
834
    {
835 122
        return $this->isDOMDocumentCreatedWithoutHeadWrapper;
836
    }
837
838
    /**
839
     * @return bool
840
     */
841
    public function getIsDOMDocumentCreatedWithoutPTagWrapper(): bool
842
    {
843 122
        return $this->isDOMDocumentCreatedWithoutPTagWrapper;
844
    }
845
846
    /**
847
     * @return bool
848
     */
849
    public function getIsDOMDocumentCreatedWithoutHtml(): bool
850
    {
851 122
        return $this->isDOMDocumentCreatedWithoutHtml;
852
    }
853
854
    /**
855
     * @return bool
856
     */
857
    public function getIsDOMDocumentCreatedWithoutBodyWrapper(): bool
858
    {
859 122
        return $this->isDOMDocumentCreatedWithoutBodyWrapper;
860
    }
861
862
    /**
863
     * @return bool
864
     */
865
    public function getIsDOMDocumentCreatedWithoutHtmlWrapper(): bool
866
    {
867 122
        return $this->isDOMDocumentCreatedWithoutHtmlWrapper;
868
    }
869
870
    /**
871
     * @return bool
872
     */
873
    public function getIsDOMDocumentCreatedWithoutWrapper(): bool
874
    {
875 122
        return $this->isDOMDocumentCreatedWithoutWrapper;
876
    }
877
878
    /**
879
     * @return bool
880
     */
881
    public function getIsDOMDocumentCreatedWithFakeEndScript(): bool
882
    {
883 122
        return $this->isDOMDocumentCreatedWithFakeEndScript;
884
    }
885
886
    /**
887
     * @param string $html
888
     *
889
     * @return string
890
     */
891
    protected function keepBrokenHtml(string $html): string
892
    {
893
        do {
894 3
            $original = $html;
895
896 3
            $html = (string) \preg_replace_callback(
897 3
                '/(?<start>.*)<(?<element_start>[a-z]+)(?<element_start_addon> [^>]*)?>(?<value>.*?)<\/(?<element_end>\2)>(?<end>.*)/sui',
898
                static function ($matches) {
899 3
                    return $matches['start'] .
900 3
                           '°lt_simple_html_dom__voku_°' . $matches['element_start'] . $matches['element_start_addon'] . '°gt_simple_html_dom__voku_°' .
901 3
                           $matches['value'] .
902 3
                           '°lt/_simple_html_dom__voku_°' . $matches['element_end'] . '°gt_simple_html_dom__voku_°' .
903 3
                           $matches['end'];
904 3
                },
905 3
                $html
906
            );
907 3
        } while ($original !== $html);
908
909
        do {
910 3
            $original = $html;
911
912 3
            $html = (string) \preg_replace_callback(
913 3
                '/(?<start>[^<]*)?(?<broken>(?:(?:<\/\w+(?:\s+\w+=\\"[^\"]+\\")*+)(?:[^<]+)>)+)(?<end>.*)/u',
914
                static function ($matches) {
915 3
                    $matches['broken'] = \str_replace(
916 3
                        ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
917 3
                        ['</', '<', '>'],
918 3
                        $matches['broken']
919
                    );
920
921 3
                    self::$domBrokenReplaceHelper['orig'][] = $matches['broken'];
922 3
                    self::$domBrokenReplaceHelper['tmp'][] = $matchesHash = self::$domHtmlBrokenHtmlHelper . \crc32($matches['broken']);
923
924 3
                    return $matches['start'] . $matchesHash . $matches['end'];
925 3
                },
926 3
                $html
927
            );
928 3
        } while ($original !== $html);
929
930 3
        return \str_replace(
931 3
            ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
932 3
            ['</', '<', '>'],
933 3
            $html
934
        );
935
    }
936
937
    /**
938
     * @param string $html
939
     *
940
     * @return void
941
     */
942
    protected function keepSpecialScriptTags(string &$html)
943
    {
944
        // regEx for e.g.: [<script id="elements-image-1" type="text/html">...</script>]
945 6
        $html = (string) \preg_replace_callback(
946 6
            '/(?<start>((?:<script) [^>]*type=(?:["\'])?(?:text\/html|text\/x-custom-template|text\/x-handlebars-template)+(?:[^>]*)>))(?<innerContent>.*)(?<end><\/script>)/isU',
947
            function ($matches) {
948
949
                // Check for logic in special script tags, like [<% _.each(tierPrices, function(item, key) { %>],
950
                // because often this looks like non valid html in the template itself.
951 4
                foreach ($this->templateLogicSyntaxInSpecialScriptTags as $logicSyntaxInSpecialScriptTag) {
952 4
                    if (\strpos($matches['innerContent'], $logicSyntaxInSpecialScriptTag) !== false) {
953
                        // remove the html5 fallback
954 3
                        $matches['innerContent'] = \str_replace('<\/', '</', $matches['innerContent']);
955
956 3
                        self::$domBrokenReplaceHelper['orig'][] = $matches['innerContent'];
957 3
                        self::$domBrokenReplaceHelper['tmp'][] = $matchesHash = '' . self::$domHtmlBrokenHtmlHelper . '' . \crc32($matches['innerContent']);
958
959 4
                        return $matches['start'] . $matchesHash . $matches['end'];
960
                    }
961
                }
962
963
                // remove the html5 fallback
964 3
                $matches[0] = \str_replace('<\/', '</', $matches[0]);
965
966 3
                $specialNonScript = '<' . self::$domHtmlSpecialScriptHelper . \substr($matches[0], \strlen('<script'));
967
968 3
                return \substr($specialNonScript, 0, -\strlen('</script>')) . '</' . self::$domHtmlSpecialScriptHelper . '>';
969 6
            },
970 6
            $html
971
        );
972 6
    }
973
974
    /**
975
     * @param bool $keepBrokenHtml
976
     *
977
     * @return HtmlDomParser
978
     */
979
    public function useKeepBrokenHtml(bool $keepBrokenHtml): DomParserInterface
980
    {
981 3
        $this->keepBrokenHtml = $keepBrokenHtml;
982
983 3
        return $this;
984
    }
985
986
    /**
987
     * @param string[] $templateLogicSyntaxInSpecialScriptTags
988
     *
989
     * @return HtmlDomParser
990
     */
991
    public function overwriteTemplateLogicSyntaxInSpecialScriptTags(array $templateLogicSyntaxInSpecialScriptTags): DomParserInterface
992
    {
993 2
        foreach ($templateLogicSyntaxInSpecialScriptTags as $tmp) {
994 2
            if (!\is_string($tmp)) {
995 2
                throw new \InvalidArgumentException('setTemplateLogicSyntaxInSpecialScriptTags only allows string[]');
996
            }
997
        }
998
999 1
        $this->templateLogicSyntaxInSpecialScriptTags = $templateLogicSyntaxInSpecialScriptTags;
1000
1001 1
        return $this;
1002
    }
1003
}
1004