Completed
Push — master ( eb8cc8...612387 )
by Lars
01:52
created

HtmlDomParser::fixHtmlOutput()   C

Complexity

Conditions 8
Paths 128

Size

Total Lines 110

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 46
CRAP Score 8

Importance

Changes 0
Metric Value
cc 8
nc 128
nop 2
dl 0
loc 110
ccs 46
cts 46
cp 1
crap 8
rs 6.5688
c 0
b 0
f 0

How to fix   Long Method   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
declare(strict_types=1);
4
5
namespace voku\helper;
6
7
/**
8
 * @property-read string $outerText
9
 *                                 <p>Get dom node's outer html (alias for "outerHtml").</p>
10
 * @property-read string $outerHtml
11
 *                                 <p>Get dom node's outer html.</p>
12
 * @property-read string $innerText
13
 *                                 <p>Get dom node's inner html (alias for "innerHtml").</p>
14
 * @property-read string $innerHtml
15
 *                                 <p>Get dom node's inner html.</p>
16
 * @property-read string $plaintext
17
 *                                 <p>Get dom node's plain text.</p>
18
 *
19
 * @method string outerText()
20
 *                                 <p>Get dom node's outer html (alias for "outerHtml()").</p>
21
 * @method string outerHtml()
22
 *                                 <p>Get dom node's outer html.</p>
23
 * @method string innerText()
24
 *                                 <p>Get dom node's inner html (alias for "innerHtml()").</p>
25
 * @method HtmlDomParser load(string $html)
26
 *                                 <p>Load HTML from string.</p>
27
 * @method HtmlDomParser load_file(string $html)
28
 *                                 <p>Load HTML from file.</p>
29
 * @method static HtmlDomParser file_get_html($filePath, $libXMLExtraOptions = null)
30
 *                                 <p>Load HTML from file.</p>
31
 * @method static HtmlDomParser str_get_html($html, $libXMLExtraOptions = null)
32
 *                                 <p>Load HTML from string.</p>
33
 */
34
class HtmlDomParser extends AbstractDomParser
35
{
36
    /**
37
     * @var string[]
38
     */
39
    protected static $functionAliases = [
40
        'outertext' => 'html',
41
        'outerhtml' => 'html',
42
        'innertext' => 'innerHtml',
43
        'innerhtml' => 'innerHtml',
44
        'load'      => 'loadHtml',
45
        'load_file' => 'loadHtmlFile',
46
    ];
47
48
    /**
49
     * @var string[]
50
     */
51
    protected $templateLogicSyntaxInSpecialScriptTags = [
52
        '+',
53
        '<%',
54
        '{%',
55
        '{{',
56
    ];
57
58
    /**
59
     * The properties specified for each special script tag is an array.
60
     *
61
     * ```php
62
     * protected $specialScriptTags = [
63
     *     'text/html',
64
     *     'text/x-custom-template',
65
     *     'text/x-handlebars-template'
66
     * ]
67
     * ```
68
     *
69
     * @var string[]
70
     */
71
    protected $specialScriptTags = [
72
        'text/html',
73
        'text/x-custom-template',
74
        'text/x-handlebars-template',
75
    ];
76
77
    /**
78
     * @var string[]
79
     */
80
    protected $selfClosingTags = [
81
        'area',
82
        'base',
83
        'br',
84
        'col',
85
        'command',
86
        'embed',
87
        'hr',
88
        'img',
89
        'input',
90
        'keygen',
91
        'link',
92
        'meta',
93
        'param',
94
        'source',
95
        'track',
96
        'wbr',
97
    ];
98
99
    /**
100
     * @var bool
101
     */
102
    protected $isDOMDocumentCreatedWithoutHtml = false;
103
104
    /**
105
     * @var bool
106
     */
107
    protected $isDOMDocumentCreatedWithoutWrapper = false;
108
109
    /**
110
     * @var bool
111
     */
112
    protected $isDOMDocumentCreatedWithCommentWrapper = false;
113
114
    /**
115
     * @var bool
116
     */
117
    protected $isDOMDocumentCreatedWithoutHeadWrapper = false;
118
119
    /**
120
     * @var bool
121
     */
122
    protected $isDOMDocumentCreatedWithoutPTagWrapper = false;
123
124
    /**
125
     * @var bool
126
     */
127
    protected $isDOMDocumentCreatedWithoutHtmlWrapper = false;
128
129
    /**
130
     * @var bool
131
     */
132
    protected $isDOMDocumentCreatedWithoutBodyWrapper = false;
133
134
    /**
135
     * @var bool
136
     */
137
    protected $isDOMDocumentCreatedWithFakeEndScript = false;
138
139
    /**
140
     * @var bool
141
     */
142
    protected $keepBrokenHtml;
143
144
    /**
145
     * @param \DOMNode|SimpleHtmlDomInterface|string $element HTML code or SimpleHtmlDomInterface, \DOMNode
146
     */
147 214 View Code Duplication
    public function __construct($element = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
148
    {
149 214
        $this->document = new \DOMDocument('1.0', $this->getEncoding());
150
151
        // DOMDocument settings
152 214
        $this->document->preserveWhiteSpace = true;
153 214
        $this->document->formatOutput = true;
154
155 214
        if ($element instanceof SimpleHtmlDomInterface) {
156 99
            $element = $element->getNode();
157
        }
158
159 214
        if ($element instanceof \DOMNode) {
160 99
            $domNode = $this->document->importNode($element, true);
161
162 99
            if ($domNode instanceof \DOMNode) {
163
                /** @noinspection UnusedFunctionResultInspection */
164 99
                $this->document->appendChild($domNode);
165
            }
166
167 99
            return;
168
        }
169
170 214
        if ($element !== null) {
171
            /** @noinspection UnusedFunctionResultInspection */
172 85
            $this->loadHtml($element);
173
        }
174 213
    }
175
176
    /**
177
     * @param string $name
178
     * @param array  $arguments
179
     *
180
     * @return bool|mixed
181
     */
182 76
    public function __call($name, $arguments)
183
    {
184 76
        $name = \strtolower($name);
185
186 76
        if (isset(self::$functionAliases[$name])) {
187 75
            return \call_user_func_array([$this, self::$functionAliases[$name]], $arguments);
188
        }
189
190 1
        throw new \BadMethodCallException('Method does not exist: ' . $name);
191
    }
192
193
    /**
194
     * @param string $name
195
     * @param array  $arguments
196
     *
197
     * @throws \BadMethodCallException
198
     * @throws \RuntimeException
199
     *
200
     * @return HtmlDomParser
201
     */
202 27 View Code Duplication
    public static function __callStatic($name, $arguments)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
203
    {
204 27
        $arguments0 = $arguments[0] ?? '';
205
206 27
        $arguments1 = $arguments[1] ?? null;
207
208 27
        if ($name === 'str_get_html') {
209 21
            $parser = new static();
210
211 21
            return $parser->loadHtml($arguments0, $arguments1);
212
        }
213
214 7
        if ($name === 'file_get_html') {
215 6
            $parser = new static();
216
217 6
            return $parser->loadHtmlFile($arguments0, $arguments1);
218
        }
219
220 1
        throw new \BadMethodCallException('Method does not exist');
221
    }
222
223
    /** @noinspection MagicMethodsValidityInspection */
224
225
    /**
226
     * @param string $name
227
     *
228
     * @return string|null
229
     */
230 15
    public function __get($name)
231
    {
232 15
        $name = \strtolower($name);
233
234 15
        switch ($name) {
235 15
            case 'outerhtml':
236 15
            case 'outertext':
237 5
                return $this->html();
238 11
            case 'innerhtml':
239 5
            case 'innertext':
240 7
                return $this->innerHtml();
241 4
            case 'text':
242 4
            case 'plaintext':
243 3
                return $this->text();
244
        }
245
246 1
        return null;
247
    }
248
249
    /**
250
     * @return string
251
     */
252 20
    public function __toString()
253
    {
254 20
        return $this->html();
255
    }
256
257
    /**
258
     * does nothing (only for api-compatibility-reasons)
259
     *
260
     * @return bool
261
     *
262
     * @deprecated
263
     */
264 6
    public function clear(): bool
265
    {
266 6
        return true;
267
    }
268
269
    /**
270
     * Create DOMDocument from HTML.
271
     *
272
     * @param string   $html
273
     * @param int|null $libXMLExtraOptions
274
     *
275
     * @return \DOMDocument
276
     */
277 198
    protected function createDOMDocument(string $html, $libXMLExtraOptions = null): \DOMDocument
278
    {
279 198
        if ($this->keepBrokenHtml) {
280 3
            $html = $this->keepBrokenHtml(\trim($html));
281
        }
282
283 198
        if (\strpos($html, '<') === false) {
284 11
            $this->isDOMDocumentCreatedWithoutHtml = true;
285 196
        } elseif (\strpos(\ltrim($html), '<') !== 0) {
286 6
            $this->isDOMDocumentCreatedWithoutWrapper = true;
287
        }
288
289 198
        if (\strpos(\ltrim($html), '<!--') === 0) {
290 11
            $this->isDOMDocumentCreatedWithCommentWrapper = true;
291
        }
292
293
        /** @noinspection HtmlRequiredLangAttribute */
294
        if (
295 198
            \strpos($html, '<html ') === false
296
            &&
297 198
            \strpos($html, '<html>') === false
298
        ) {
299 120
            $this->isDOMDocumentCreatedWithoutHtmlWrapper = true;
300
        }
301
302
        if (
303 198
            \strpos($html, '<body ') === false
304
            &&
305 198
            \strpos($html, '<body>') === false
306
        ) {
307 125
            $this->isDOMDocumentCreatedWithoutBodyWrapper = true;
308
        }
309
310
        /** @noinspection HtmlRequiredTitleElement */
311
        if (
312 198
            \strpos($html, '<head ') === false
313
            &&
314 198
            \strpos($html, '<head>') === false
315
        ) {
316 144
            $this->isDOMDocumentCreatedWithoutHeadWrapper = true;
317
        }
318
319
        if (
320 198
            \strpos($html, '<p ') === false
321
            &&
322 198
            \strpos($html, '<p>') === false
323
        ) {
324 107
            $this->isDOMDocumentCreatedWithoutPTagWrapper = true;
325
        }
326
327
        if (
328 198
            \strpos($html, '</script>') === false
329
            &&
330 198
            \strpos($html, '<\/script>') !== false
331
        ) {
332 1
            $this->isDOMDocumentCreatedWithFakeEndScript = true;
333
        }
334
335 198
        if (\stripos($html, '</html>') !== false) {
336
            /** @noinspection NestedPositiveIfStatementsInspection */
337
            if (
338 87
                \preg_match('/<\/html>(.*?)/suiU', $html, $matches_after_html)
339
                &&
340 87
                \trim($matches_after_html[1])
341
            ) {
342 1
                $html = \str_replace($matches_after_html[0], $matches_after_html[1] . '</html>', $html);
343
            }
344
        }
345
346 198
        if (\strpos($html, '<script') !== false) {
347 23
            $this->html5FallbackForScriptTags($html);
348
349 23
            foreach ($this->specialScriptTags as $tag) {
350 23
                if (\strpos($html, $tag) !== false) {
351 6
                    $this->keepSpecialScriptTags($html);
352
                }
353
            }
354
        }
355
356 198
        $html = \str_replace(
357
            \array_map(static function ($e) {
358 198
                return '<' . $e . '>';
359 198
            }, $this->selfClosingTags),
360
            \array_map(static function ($e) {
361 198
                return '<' . $e . '/>';
362 198
            }, $this->selfClosingTags),
363 198
            $html
364
        );
365
366
        // set error level
367 198
        $internalErrors = \libxml_use_internal_errors(true);
368 198
        $disableEntityLoader = \libxml_disable_entity_loader(true);
369 198
        \libxml_clear_errors();
370
371 198
        $optionsXml = \LIBXML_DTDLOAD | \LIBXML_DTDATTR | \LIBXML_NONET;
372
373 198
        if (\defined('LIBXML_BIGLINES')) {
374 198
            $optionsXml |= \LIBXML_BIGLINES;
375
        }
376
377 198
        if (\defined('LIBXML_COMPACT')) {
378 198
            $optionsXml |= \LIBXML_COMPACT;
379
        }
380
381 198
        if (\defined('LIBXML_HTML_NODEFDTD')) {
382 198
            $optionsXml |= \LIBXML_HTML_NODEFDTD;
383
        }
384
385 198
        if ($libXMLExtraOptions !== null) {
386 5
            $optionsXml |= $libXMLExtraOptions;
387
        }
388
389
        if (
390 198
            $this->isDOMDocumentCreatedWithoutWrapper
391
            ||
392 194
            $this->isDOMDocumentCreatedWithCommentWrapper
393
            ||
394 198
            $this->keepBrokenHtml
395
        ) {
396 19
            $html = '<' . self::$domHtmlWrapperHelper . '>' . $html . '</' . self::$domHtmlWrapperHelper . '>';
397
        }
398
399 198
        $html = self::replaceToPreserveHtmlEntities($html);
400
401 198
        $documentFound = false;
402 198
        $sxe = \simplexml_load_string($html, \SimpleXMLElement::class, $optionsXml);
403 198 View Code Duplication
        if ($sxe !== false && \count(\libxml_get_errors()) === 0) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
404 93
            $domElementTmp = \dom_import_simplexml($sxe);
405
            if (
406 93
                $domElementTmp
407
                &&
408 93
                $domElementTmp->ownerDocument
409
            ) {
410 93
                $documentFound = true;
411 93
                $this->document = $domElementTmp->ownerDocument;
412
            }
413
        }
414
415 198 View Code Duplication
        if ($documentFound === false) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
416
417
            // UTF-8 hack: http://php.net/manual/en/domdocument.loadhtml.php#95251
418 114
            $xmlHackUsed = false;
419
            /** @noinspection StringFragmentMisplacedInspection */
420 114
            if (\stripos('<?xml', $html) !== 0) {
421 114
                $xmlHackUsed = true;
422 114
                $html = '<?xml encoding="' . $this->getEncoding() . '" ?>' . $html;
423
            }
424
425 114
            $this->document->loadHTML($html, $optionsXml);
426
427
            // remove the "xml-encoding" hack
428 114
            if ($xmlHackUsed) {
429 114
                foreach ($this->document->childNodes as $child) {
430 114
                    if ($child->nodeType === \XML_PI_NODE) {
431
                        /** @noinspection UnusedFunctionResultInspection */
432 114
                        $this->document->removeChild($child);
433
434 114
                        break;
435
                    }
436
                }
437
            }
438
        }
439
440
        // set encoding
441 198
        $this->document->encoding = $this->getEncoding();
442
443
        // restore lib-xml settings
444 198
        \libxml_clear_errors();
445 198
        \libxml_use_internal_errors($internalErrors);
446 198
        \libxml_disable_entity_loader($disableEntityLoader);
447
448 198
        return $this->document;
449
    }
450
451
    /**
452
     * Find list of nodes with a CSS selector.
453
     *
454
     * @param string   $selector
455
     * @param int|null $idx
456
     *
457
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
458
     */
459 145 View Code Duplication
    public function find(string $selector, $idx = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
460
    {
461 145
        $xPathQuery = SelectorConverter::toXPath($selector);
462
463 145
        $xPath = new \DOMXPath($this->document);
464 145
        $nodesList = $xPath->query($xPathQuery);
465 145
        $elements = new SimpleHtmlDomNode();
466
467 145
        if ($nodesList) {
468 145
            foreach ($nodesList as $node) {
469 135
                $elements[] = new SimpleHtmlDom($node);
470
            }
471
        }
472
473
        // return all elements
474 145
        if ($idx === null) {
475 72
            if (\count($elements) === 0) {
476 16
                return new SimpleHtmlDomNodeBlank();
477
            }
478
479 69
            return $elements;
480
        }
481
482
        // handle negative values
483 91
        if ($idx < 0) {
484 11
            $idx = \count($elements) + $idx;
485
        }
486
487
        // return one element
488 91
        return $elements[$idx] ?? new SimpleHtmlDomBlank();
489
    }
490
491
    /**
492
     * Find nodes with a CSS selector.
493
     *
494
     * @param string $selector
495
     *
496
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface[]...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 51. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
497
     */
498 12
    public function findMulti(string $selector): SimpleHtmlDomNodeInterface
499
    {
500 12
        return $this->find($selector, null);
501
    }
502
503
    /**
504
     * Find nodes with a CSS selector or false, if no element is found.
505
     *
506
     * @param string $selector
507
     *
508
     * @return false|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type false|SimpleHtmlDomInter...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 57. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
509
     */
510 4
    public function findMultiOrFalse(string $selector)
511
    {
512 4
        $return = $this->find($selector, null);
513
514 4
        if ($return instanceof SimpleHtmlDomNodeBlank) {
515 3
            return false;
516
        }
517
518 2
        return $return;
519
    }
520
521
    /**
522
     * Find one node with a CSS selector.
523
     *
524
     * @param string $selector
525
     *
526
     * @return SimpleHtmlDomInterface
527
     */
528 32
    public function findOne(string $selector): SimpleHtmlDomInterface
529
    {
530 32
        return $this->find($selector, 0);
531
    }
532
533
    /**
534
     * Find one node with a CSS selector or false, if no element is found.
535
     *
536
     * @param string $selector
537
     *
538
     * @return false|SimpleHtmlDomInterface
539
     */
540 6
    public function findOneOrFalse(string $selector)
541
    {
542 6
        $return = $this->find($selector, 0);
543
544 6
        if ($return instanceof SimpleHtmlDomBlank) {
545 3
            return false;
546
        }
547
548 4
        return $return;
549
    }
550
551
    /**
552
     * @param string $content
553
     * @param bool   $multiDecodeNewHtmlEntity
554
     *
555
     * @return string
556
     */
557 124
    public function fixHtmlOutput(
558
        string $content,
559
        bool $multiDecodeNewHtmlEntity = false
560
    ): string {
561
        // INFO: DOMDocument will encapsulate plaintext into a e.g. paragraph tag (<p>),
562
        //          so we try to remove it here again ...
563
564 124
        if ($this->getIsDOMDocumentCreatedWithoutHtmlWrapper()) {
565
            /** @noinspection HtmlRequiredLangAttribute */
566 59
            $content = \str_replace(
567
                [
568 59
                    '<html>',
569
                    '</html>',
570
                ],
571 59
                '',
572 59
                $content
573
            );
574
        }
575
576 124
        if ($this->getIsDOMDocumentCreatedWithoutHeadWrapper()) {
577
            /** @noinspection HtmlRequiredTitleElement */
578 63
            $content = \str_replace(
579
                [
580 63
                    '<head>',
581
                    '</head>',
582
                ],
583 63
                '',
584 63
                $content
585
            );
586
        }
587
588 124
        if ($this->getIsDOMDocumentCreatedWithoutBodyWrapper()) {
589 62
            $content = \str_replace(
590
                [
591 62
                    '<body>',
592
                    '</body>',
593
                ],
594 62
                '',
595 62
                $content
596
            );
597
        }
598
599 124
        if ($this->getIsDOMDocumentCreatedWithFakeEndScript()) {
600 1
            $content = \str_replace(
601 1
                '</script>',
602 1
                '',
603 1
                $content
604
            );
605
        }
606
607 124
        if ($this->getIsDOMDocumentCreatedWithoutWrapper()) {
608 4
            $content = (string) \preg_replace('/^<p>/', '', $content);
609 4
            $content = (string) \preg_replace('/<\/p>/', '', $content);
610
        }
611
612 124
        if ($this->getIsDOMDocumentCreatedWithoutPTagWrapper()) {
613 57
            $content = \str_replace(
614
                [
615 57
                    '<p>',
616
                    '</p>',
617
                ],
618 57
                '',
619 57
                $content
620
            );
621
        }
622
623 124
        if ($this->getIsDOMDocumentCreatedWithoutHtml()) {
624 9
            $content = \str_replace(
625 9
                '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">',
626 9
                '',
627 9
                $content
628
            );
629
        }
630
631
        // https://bugs.php.net/bug.php?id=73175
632 124
        $content = \str_replace(
633
            \array_map(static function ($e) {
634 124
                return '</' . $e . '>';
635 124
            }, $this->selfClosingTags),
636 124
            '',
637 124
            $content
638
        );
639
640
        /** @noinspection HtmlRequiredTitleElement */
641 124
        $content = \trim(
642 124
            \str_replace(
643
                [
644 124
                    '<simpleHtmlDomHtml>',
645
                    '</simpleHtmlDomHtml>',
646
                    '<simpleHtmlDomP>',
647
                    '</simpleHtmlDomP>',
648
                    '<head><head>',
649
                    '</head></head>',
650
                ],
651
                [
652 124
                    '',
653
                    '',
654
                    '',
655
                    '',
656
                    '<head>',
657
                    '</head>',
658
                ],
659 124
                $content
660
            )
661
        );
662
663 124
        $content = $this->decodeHtmlEntity($content, $multiDecodeNewHtmlEntity);
664
665 124
        return self::putReplacedBackToPreserveHtmlEntities($content);
666
    }
667
668
    /**
669
     * Return elements by ".class".
670
     *
671
     * @param string $class
672
     *
673
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface[]...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 51. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
674
     */
675
    public function getElementByClass(string $class): SimpleHtmlDomNodeInterface
676
    {
677
        return $this->findMulti(".${class}");
678
    }
679
680
    /**
681
     * Return element by #id.
682
     *
683
     * @param string $id
684
     *
685
     * @return SimpleHtmlDomInterface
686
     */
687 3
    public function getElementById(string $id): SimpleHtmlDomInterface
688
    {
689 3
        return $this->findOne("#${id}");
690
    }
691
692
    /**
693
     * Return element by tag name.
694
     *
695
     * @param string $name
696
     *
697
     * @return SimpleHtmlDomInterface
698
     */
699 1
    public function getElementByTagName(string $name): SimpleHtmlDomInterface
700
    {
701 1
        $node = $this->document->getElementsByTagName($name)->item(0);
702
703 1
        if ($node === null) {
704
            return new SimpleHtmlDomBlank();
705
        }
706
707 1
        return new SimpleHtmlDom($node);
708
    }
709
710
    /**
711
     * Returns elements by "#id".
712
     *
713
     * @param string   $id
714
     * @param int|null $idx
715
     *
716
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
717
     */
718
    public function getElementsById(string $id, $idx = null)
719
    {
720
        return $this->find("#${id}", $idx);
721
    }
722
723
    /**
724
     * Returns elements by tag name.
725
     *
726
     * @param string   $name
727
     * @param int|null $idx
728
     *
729
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
730
     */
731 6
    public function getElementsByTagName(string $name, $idx = null)
732
    {
733 6
        $nodesList = $this->document->getElementsByTagName($name);
734
735 6
        $elements = new SimpleHtmlDomNode();
736
737 6
        foreach ($nodesList as $node) {
738 4
            $elements[] = new SimpleHtmlDom($node);
739
        }
740
741
        // return all elements
742 6
        if ($idx === null) {
743 5
            if (\count($elements) === 0) {
744 2
                return new SimpleHtmlDomNodeBlank();
745
            }
746
747 3
            return $elements;
748
        }
749
750
        // handle negative values
751 1
        if ($idx < 0) {
752
            $idx = \count($elements) + $idx;
753
        }
754
755
        // return one element
756 1
        return $elements[$idx] ?? new SimpleHtmlDomNodeBlank();
757
    }
758
759
    /**
760
     * Get dom node's outer html.
761
     *
762
     * @param bool $multiDecodeNewHtmlEntity
763
     *
764
     * @return string
765
     */
766 91
    public function html(bool $multiDecodeNewHtmlEntity = false): string
767
    {
768 91
        if (static::$callback !== null) {
769
            \call_user_func(static::$callback, [$this]);
770
        }
771
772 91
        if ($this->getIsDOMDocumentCreatedWithoutHtmlWrapper()) {
773 52
            $content = $this->document->saveHTML($this->document->documentElement);
774
        } else {
775 52
            $content = $this->document->saveHTML();
776
        }
777
778 91
        if ($content === false) {
779
            return '';
780
        }
781
782 91
        return $this->fixHtmlOutput($content, $multiDecodeNewHtmlEntity);
783
    }
784
785
    /**
786
     * Load HTML from string.
787
     *
788
     * @param string   $html
789
     * @param int|null $libXMLExtraOptions
790
     *
791
     * @return HtmlDomParser
792
     */
793 198
    public function loadHtml(string $html, $libXMLExtraOptions = null): DomParserInterface
794
    {
795
        // reset
796 198
        self::$domBrokenReplaceHelper = [];
797
798 198
        $this->document = $this->createDOMDocument($html, $libXMLExtraOptions);
799
800 198
        return $this;
0 ignored issues
show
Bug Best Practice introduced by
The return type of return $this; (voku\helper\HtmlDomParser) is incompatible with the return type declared by the interface voku\helper\DomParserInterface::loadHtml of type self.

If you return a value from a function or method, it should be a sub-type of the type that is given by the parent type f.e. an interface, or abstract method. This is more formally defined by the Lizkov substitution principle, and guarantees that classes that depend on the parent type can use any instance of a child type interchangably. This principle also belongs to the SOLID principles for object oriented design.

Let’s take a look at an example:

class Author {
    private $name;

    public function __construct($name) {
        $this->name = $name;
    }

    public function getName() {
        return $this->name;
    }
}

abstract class Post {
    public function getAuthor() {
        return 'Johannes';
    }
}

class BlogPost extends Post {
    public function getAuthor() {
        return new Author('Johannes');
    }
}

class ForumPost extends Post { /* ... */ }

function my_function(Post $post) {
    echo strtoupper($post->getAuthor());
}

Our function my_function expects a Post object, and outputs the author of the post. The base class Post returns a simple string and outputting a simple string will work just fine. However, the child class BlogPost which is a sub-type of Post instead decided to return an object, and is therefore violating the SOLID principles. If a BlogPost were passed to my_function, PHP would not complain, but ultimately fail when executing the strtoupper call in its body.

Loading history...
801
    }
802
803
    /**
804
     * Load HTML from file.
805
     *
806
     * @param string   $filePath
807
     * @param int|null $libXMLExtraOptions
808
     *
809
     * @throws \RuntimeException
810
     *
811
     * @return HtmlDomParser
812
     */
813 13 View Code Duplication
    public function loadHtmlFile(string $filePath, $libXMLExtraOptions = null): DomParserInterface
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
814
    {
815
        // reset
816 13
        self::$domBrokenReplaceHelper = [];
817
818
        if (
819 13
            !\preg_match("/^https?:\/\//i", $filePath)
820
            &&
821 13
            !\file_exists($filePath)
822
        ) {
823 1
            throw new \RuntimeException("File ${filePath} not found");
824
        }
825
826
        try {
827 12
            if (\class_exists('\voku\helper\UTF8')) {
828
                /** @noinspection PhpUndefinedClassInspection */
829
                $html = UTF8::file_get_contents($filePath);
830
            } else {
831 12
                $html = \file_get_contents($filePath);
832
            }
833 1
        } catch (\Exception $e) {
834 1
            throw new \RuntimeException("Could not load file ${filePath}");
835
        }
836
837 11
        if ($html === false) {
838
            throw new \RuntimeException("Could not load file ${filePath}");
839
        }
840
841 11
        return $this->loadHtml($html, $libXMLExtraOptions);
842
    }
843
844
    /**
845
     * Get the HTML as XML or plain XML if needed.
846
     *
847
     * @param bool $multiDecodeNewHtmlEntity
848
     * @param bool $htmlToXml
849
     * @param bool $removeXmlHeader
850
     * @param int  $options
851
     *
852
     * @return string
853
     */
854 2 View Code Duplication
    public function xml(
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
855
        bool $multiDecodeNewHtmlEntity = false,
856
        bool $htmlToXml = true,
857
        bool $removeXmlHeader = true,
858
        int $options = \LIBXML_NOEMPTYTAG
859
    ): string {
860 2
        $xml = $this->document->saveXML(null, $options);
861 2
        if ($xml === false) {
862
            return '';
863
        }
864
865 2
        if ($removeXmlHeader) {
866 2
            $xml = \ltrim((string) \preg_replace('/<\?xml.*\?>/', '', $xml));
867
        }
868
869 2
        if ($htmlToXml) {
870 2
            $return = $this->fixHtmlOutput($xml, $multiDecodeNewHtmlEntity);
871
        } else {
872
            $xml = $this->decodeHtmlEntity($xml, $multiDecodeNewHtmlEntity);
873
874
            $return = self::putReplacedBackToPreserveHtmlEntities($xml);
875
        }
876
877 2
        return $return;
878
    }
879
880
    /**
881
     * @param string $selector
882
     * @param int    $idx
883
     *
884
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
885
     */
886 3
    public function __invoke($selector, $idx = null)
887
    {
888 3
        return $this->find($selector, $idx);
889
    }
890
891
    /**
892
     * @return bool
893
     */
894 124
    public function getIsDOMDocumentCreatedWithoutHeadWrapper(): bool
895
    {
896 124
        return $this->isDOMDocumentCreatedWithoutHeadWrapper;
897
    }
898
899
    /**
900
     * @return bool
901
     */
902 124
    public function getIsDOMDocumentCreatedWithoutPTagWrapper(): bool
903
    {
904 124
        return $this->isDOMDocumentCreatedWithoutPTagWrapper;
905
    }
906
907
    /**
908
     * @return bool
909
     */
910 124
    public function getIsDOMDocumentCreatedWithoutHtml(): bool
911
    {
912 124
        return $this->isDOMDocumentCreatedWithoutHtml;
913
    }
914
915
    /**
916
     * @return bool
917
     */
918 124
    public function getIsDOMDocumentCreatedWithoutBodyWrapper(): bool
919
    {
920 124
        return $this->isDOMDocumentCreatedWithoutBodyWrapper;
921
    }
922
923
    /**
924
     * @return bool
925
     */
926 124
    public function getIsDOMDocumentCreatedWithoutHtmlWrapper(): bool
927
    {
928 124
        return $this->isDOMDocumentCreatedWithoutHtmlWrapper;
929
    }
930
931
    /**
932
     * @return bool
933
     */
934 124
    public function getIsDOMDocumentCreatedWithoutWrapper(): bool
935
    {
936 124
        return $this->isDOMDocumentCreatedWithoutWrapper;
937
    }
938
939
    /**
940
     * @return bool
941
     */
942 124
    public function getIsDOMDocumentCreatedWithFakeEndScript(): bool
943
    {
944 124
        return $this->isDOMDocumentCreatedWithFakeEndScript;
945
    }
946
947
    /**
948
     * @param string $html
949
     *
950
     * @return string
951
     */
952 3
    protected function keepBrokenHtml(string $html): string
953
    {
954
        do {
955 3
            $original = $html;
956
957 3
            $html = (string) \preg_replace_callback(
958 3
                '/(?<start>.*)<(?<element_start>[a-z]+)(?<element_start_addon> [^>]*)?>(?<value>.*?)<\/(?<element_end>\2)>(?<end>.*)/sui',
959
                static function ($matches) {
960 3
                    return $matches['start'] .
961 3
                        '°lt_simple_html_dom__voku_°' . $matches['element_start'] . $matches['element_start_addon'] . '°gt_simple_html_dom__voku_°' .
962 3
                        $matches['value'] .
963 3
                        '°lt/_simple_html_dom__voku_°' . $matches['element_end'] . '°gt_simple_html_dom__voku_°' .
964 3
                        $matches['end'];
965 3
                },
966 3
                $html
967
            );
968 3
        } while ($original !== $html);
969
970
        do {
971 3
            $original = $html;
972
973 3
            $html = (string) \preg_replace_callback(
974 3
                '/(?<start>[^<]*)?(?<broken>(?:(?:<\/\w+(?:\s+\w+=\\"[^\"]+\\")*+)(?:[^<]+)>)+)(?<end>.*)/u',
975
                static function ($matches) {
976 3
                    $matches['broken'] = \str_replace(
977 3
                        ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
978 3
                        ['</', '<', '>'],
979 3
                        $matches['broken']
980
                    );
981
982 3
                    self::$domBrokenReplaceHelper['orig'][] = $matches['broken'];
983 3
                    self::$domBrokenReplaceHelper['tmp'][] = $matchesHash = self::$domHtmlBrokenHtmlHelper . \crc32($matches['broken']);
984
985 3
                    return $matches['start'] . $matchesHash . $matches['end'];
986 3
                },
987 3
                $html
988
            );
989 3
        } while ($original !== $html);
990
991 3
        return \str_replace(
992 3
            ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
993 3
            ['</', '<', '>'],
994 3
            $html
995
        );
996
    }
997
998
    /**
999
     * @param string $html
1000
     *
1001
     * @return void
1002
     */
1003 6
    protected function keepSpecialScriptTags(string &$html)
1004
    {
1005
        // regEx for e.g.: [<script id="elements-image-1" type="text/html">...</script>]
1006 6
        $tags = \implode('|', \array_map(
1007
            static function ($value) {
1008 6
                return \preg_quote($value, '/');
1009 6
            },
1010 6
            $this->specialScriptTags
1011
        ));
1012 6
        $html = (string) \preg_replace_callback(
1013 6
            '/(?<start>((?:<script) [^>]*type=(?:["\'])?(?:' . $tags . ')+(?:[^>]*)>))(?<innerContent>.*)(?<end><\/script>)/isU',
1014
            function ($matches) {
1015
1016
                // Check for logic in special script tags, like [<% _.each(tierPrices, function(item, key) { %>],
1017
                // because often this looks like non valid html in the template itself.
1018 4
                foreach ($this->templateLogicSyntaxInSpecialScriptTags as $logicSyntaxInSpecialScriptTag) {
1019 4
                    if (\strpos($matches['innerContent'], $logicSyntaxInSpecialScriptTag) !== false) {
1020
                        // remove the html5 fallback
1021 3
                        $matches['innerContent'] = \str_replace('<\/', '</', $matches['innerContent']);
1022
1023 3
                        self::$domBrokenReplaceHelper['orig'][] = $matches['innerContent'];
1024 3
                        self::$domBrokenReplaceHelper['tmp'][] = $matchesHash = '' . self::$domHtmlBrokenHtmlHelper . '' . \crc32($matches['innerContent']);
1025
1026 3
                        return $matches['start'] . $matchesHash . $matches['end'];
1027
                    }
1028
                }
1029
1030
                // remove the html5 fallback
1031 3
                $matches[0] = \str_replace('<\/', '</', $matches[0]);
1032
1033 3
                $specialNonScript = '<' . self::$domHtmlSpecialScriptHelper . \substr($matches[0], \strlen('<script'));
1034
1035 3
                return \substr($specialNonScript, 0, -\strlen('</script>')) . '</' . self::$domHtmlSpecialScriptHelper . '>';
1036 6
            },
1037 6
            $html
1038
        );
1039 6
    }
1040
1041
    /**
1042
     * @param bool $keepBrokenHtml
1043
     *
1044
     * @return HtmlDomParser
1045
     */
1046 3
    public function useKeepBrokenHtml(bool $keepBrokenHtml): DomParserInterface
1047
    {
1048 3
        $this->keepBrokenHtml = $keepBrokenHtml;
1049
1050 3
        return $this;
1051
    }
1052
1053
    /**
1054
     * @param string[] $templateLogicSyntaxInSpecialScriptTags
1055
     *
1056
     * @return HtmlDomParser
1057
     */
1058 2
    public function overwriteTemplateLogicSyntaxInSpecialScriptTags(array $templateLogicSyntaxInSpecialScriptTags): DomParserInterface
1059
    {
1060 2
        foreach ($templateLogicSyntaxInSpecialScriptTags as $tmp) {
1061 2
            if (!\is_string($tmp)) {
1062 1
                throw new \InvalidArgumentException('setTemplateLogicSyntaxInSpecialScriptTags only allows string[]');
1063
            }
1064
        }
1065
1066 1
        $this->templateLogicSyntaxInSpecialScriptTags = $templateLogicSyntaxInSpecialScriptTags;
1067
1068 1
        return $this;
1069
    }
1070
1071
    /**
1072
     * @param string[] $specialScriptTags
1073
     *
1074
     * @return HtmlDomParser
1075
     */
1076
    public function overwriteSpecialScriptTags(array $specialScriptTags): DomParserInterface
1077
    {
1078
        foreach ($specialScriptTags as $tag) {
1079
            if (!\is_string($tag)) {
1080
                throw new \InvalidArgumentException('SpecialScriptTags only allows string[]');
1081
            }
1082
        }
1083
1084
        $this->specialScriptTags = $specialScriptTags;
1085
1086
        return $this;
1087
    }
1088
}
1089