Completed
Push — master ( fb5227...2dd329 )
by Lars
01:34
created

HtmlDomParser::createDOMDocument()   F

Complexity

Conditions 35
Paths > 20000

Size

Total Lines 153

Duplication

Lines 30
Ratio 19.61 %

Code Coverage

Tests 70
CRAP Score 35

Importance

Changes 0
Metric Value
dl 30
loc 153
ccs 70
cts 70
cp 1
rs 0
c 0
b 0
f 0
cc 35
nc 497664
nop 2
crap 35

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
declare(strict_types=1);
4
5
namespace voku\helper;
6
7
/**
8
 * @property-read string $outerText
9
 *                                 <p>Get dom node's outer html (alias for "outerHtml").</p>
10
 * @property-read string $outerHtml
11
 *                                 <p>Get dom node's outer html.</p>
12
 * @property-read string $innerText
13
 *                                 <p>Get dom node's inner html (alias for "innerHtml").</p>
14
 * @property-read string $innerHtml
15
 *                                 <p>Get dom node's inner html.</p>
16
 * @property-read string $plaintext
17
 *                                 <p>Get dom node's plain text.</p>
18
 *
19
 * @method string outerText()
20
 *                                 <p>Get dom node's outer html (alias for "outerHtml()").</p>
21
 * @method string outerHtml()
22
 *                                 <p>Get dom node's outer html.</p>
23
 * @method string innerText()
24
 *                                 <p>Get dom node's inner html (alias for "innerHtml()").</p>
25
 * @method HtmlDomParser load(string $html)
26
 *                                 <p>Load HTML from string.</p>
27
 * @method HtmlDomParser load_file(string $html)
28
 *                                 <p>Load HTML from file.</p>
29
 * @method static HtmlDomParser file_get_html($filePath, $libXMLExtraOptions = null)
30
 *                                 <p>Load HTML from file.</p>
31
 * @method static HtmlDomParser str_get_html($html, $libXMLExtraOptions = null)
32
 *                                 <p>Load HTML from string.</p>
33
 */
34
class HtmlDomParser extends AbstractDomParser
35
{
36
    /**
37
     * @var string[]
38
     */
39
    protected static $functionAliases = [
40
        'outertext' => 'html',
41
        'outerhtml' => 'html',
42
        'innertext' => 'innerHtml',
43
        'innerhtml' => 'innerHtml',
44
        'load'      => 'loadHtml',
45
        'load_file' => 'loadHtmlFile',
46
    ];
47
48
    /**
49
     * @var bool
50
     */
51
    protected $isDOMDocumentCreatedWithoutHtml = false;
52
53
    /**
54
     * @var bool
55
     */
56
    protected $isDOMDocumentCreatedWithoutWrapper = false;
57
58
    /**
59
     * @var bool
60
     */
61
    protected $isDOMDocumentCreatedWithoutHeadWrapper = false;
62
63
    /**
64
     * @var bool
65
     */
66
    protected $isDOMDocumentCreatedWithoutPTagWrapper = false;
67
68
    /**
69
     * @var bool
70
     */
71
    protected $isDOMDocumentCreatedWithoutHtmlWrapper = false;
72
73
    /**
74
     * @var bool
75
     */
76
    protected $isDOMDocumentCreatedWithoutBodyWrapper = false;
77
78
    /**
79
     * @var bool
80
     */
81
    protected $isDOMDocumentCreatedWithFakeEndScript = false;
82
83
    /**
84
     * @var bool
85
     */
86
    protected $keepBrokenHtml;
87
88
    /**
89
     * @param \DOMNode|SimpleHtmlDomInterface|string $element HTML code or SimpleHtmlDomInterface, \DOMNode
90
     */
91 187 View Code Duplication
    public function __construct($element = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
92
    {
93 187
        $this->document = new \DOMDocument('1.0', $this->getEncoding());
94
95
        // DOMDocument settings
96 187
        $this->document->preserveWhiteSpace = true;
97 187
        $this->document->formatOutput = true;
98
99 187
        if ($element instanceof SimpleHtmlDomInterface) {
100 86
            $element = $element->getNode();
101
        }
102
103 187
        if ($element instanceof \DOMNode) {
104 86
            $domNode = $this->document->importNode($element, true);
105
106 86
            if ($domNode instanceof \DOMNode) {
107
                /** @noinspection UnusedFunctionResultInspection */
108 86
                $this->document->appendChild($domNode);
109
            }
110
111 86
            return;
112
        }
113
114 187
        if ($element !== null) {
115
            /** @noinspection UnusedFunctionResultInspection */
116 82
            $this->loadHtml($element);
117
        }
118 186
    }
119
120
    /**
121
     * @param string $name
122
     * @param array  $arguments
123
     *
124
     * @return bool|mixed
125
     */
126 61
    public function __call($name, $arguments)
127
    {
128 61
        $name = \strtolower($name);
129
130 61
        if (isset(self::$functionAliases[$name])) {
131 60
            return \call_user_func_array([$this, self::$functionAliases[$name]], $arguments);
132
        }
133
134 1
        throw new \BadMethodCallException('Method does not exist: ' . $name);
135
    }
136
137
    /**
138
     * @param string $name
139
     * @param array  $arguments
140
     *
141
     * @throws \BadMethodCallException
142
     * @throws \RuntimeException
143
     *
144
     * @return HtmlDomParser
145
     */
146 23 View Code Duplication
    public static function __callStatic($name, $arguments)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
147
    {
148 23
        $arguments0 = $arguments[0] ?? '';
149
150 23
        $arguments1 = $arguments[1] ?? null;
151
152 23
        if ($name === 'str_get_html') {
153 18
            $parser = new static();
154
155 18
            return $parser->loadHtml($arguments0, $arguments1);
156
        }
157
158 5
        if ($name === 'file_get_html') {
159 4
            $parser = new static();
160
161 4
            return $parser->loadHtmlFile($arguments0, $arguments1);
162
        }
163
164 1
        throw new \BadMethodCallException('Method does not exist');
165
    }
166
167
    /** @noinspection MagicMethodsValidityInspection */
168
169
    /**
170
     * @param string $name
171
     *
172
     * @return string|null
173
     */
174 15
    public function __get($name)
175
    {
176 15
        $name = \strtolower($name);
177
178
        switch ($name) {
179 15
            case 'outerhtml':
180 15
            case 'outertext':
181 5
                return $this->html();
182 11
            case 'innerhtml':
183 5
            case 'innertext':
184 7
                return $this->innerHtml();
185 4
            case 'text':
186 4
            case 'plaintext':
187 3
                return $this->text();
188
        }
189
190 1
        return null;
191
    }
192
193
    /**
194
     * @return string
195
     */
196 19
    public function __toString()
197
    {
198 19
        return $this->html();
199
    }
200
201
    /**
202
     * does nothing (only for api-compatibility-reasons)
203
     *
204
     * @return bool
205
     *
206
     * @deprecated
207
     */
208 1
    public function clear(): bool
209
    {
210 1
        return true;
211
    }
212
213
    /**
214
     * Create DOMDocument from HTML.
215
     *
216
     * @param string   $html
217
     * @param int|null $libXMLExtraOptions
218
     *
219
     * @return \DOMDocument
220
     */
221 175
    protected function createDOMDocument(string $html, $libXMLExtraOptions = null): \DOMDocument
222
    {
223 175
        if ($this->keepBrokenHtml) {
224 3
            $html = $this->keepBrokenHtml(\trim($html));
225
        }
226
227 175
        if (\strpos($html, '<') === false) {
228 10
            $this->isDOMDocumentCreatedWithoutHtml = true;
229 173
        } elseif (\strpos(\ltrim($html), '<') !== 0) {
230 5
            $this->isDOMDocumentCreatedWithoutWrapper = true;
231
        }
232
233
        /** @noinspection HtmlRequiredLangAttribute */
234
        if (
235 175
            \strpos($html, '<html ') === false
236
            &&
237 175
            \strpos($html, '<html>') === false
238
        ) {
239 100
            $this->isDOMDocumentCreatedWithoutHtmlWrapper = true;
240
        }
241
242
        if (
243 175
            \strpos($html, '<body ') === false
244
            &&
245 175
            \strpos($html, '<body>') === false
246
        ) {
247 104
            $this->isDOMDocumentCreatedWithoutBodyWrapper = true;
248
        }
249
250
        /** @noinspection HtmlRequiredTitleElement */
251
        if (
252 175
            \strpos($html, '<head ') === false
253
            &&
254 175
            \strpos($html, '<head>') === false
255
        ) {
256 122
            $this->isDOMDocumentCreatedWithoutHeadWrapper = true;
257
        }
258
259
        /** @noinspection HtmlRequiredTitleElement */
260
        if (
261 175
            \strpos($html, '<p ') === false
262
            &&
263 175
            \strpos($html, '<p>') === false
264
        ) {
265 89
            $this->isDOMDocumentCreatedWithoutPTagWrapper = true;
266
        }
267
268
        if (
269 175
            \strpos($html, '</script>') === false
270
            &&
271 175
            \strpos($html, '<\/script>') !== false
272
        ) {
273 1
            $this->isDOMDocumentCreatedWithFakeEndScript = true;
274
        }
275
276 175
        if (\strpos($html, '<script') !== false) {
277 20
            $this->html5FallbackForScriptTags($html);
278
279
            if (
280 20
                \strpos($html, 'type="text/html"') !== false
281
                ||
282 19
                \strpos($html, 'type=\'text/html\'') !== false
283
                ||
284 19
                \strpos($html, 'type=text/html') !== false
285
                ||
286 19
                \strpos($html, 'type="text/x-custom-template"') !== false
287
                ||
288 18
                \strpos($html, 'type=\'text/x-custom-template\'') !== false
289
                ||
290 20
                \strpos($html, 'type=text/x-custom-template') !== false
291
            ) {
292 2
                $this->keepSpecialScriptTags($html);
293
            }
294
        }
295
296
        // set error level
297 175
        $internalErrors = \libxml_use_internal_errors(true);
298 175
        $disableEntityLoader = \libxml_disable_entity_loader(true);
299 175
        \libxml_clear_errors();
300
301 175
        $optionsXml = \LIBXML_DTDLOAD | \LIBXML_DTDATTR | \LIBXML_NONET;
302
303 175
        if (\defined('LIBXML_BIGLINES')) {
304 175
            $optionsXml |= \LIBXML_BIGLINES;
305
        }
306
307 175
        if (\defined('LIBXML_COMPACT')) {
308 175
            $optionsXml |= \LIBXML_COMPACT;
309
        }
310
311 175
        if (\defined('LIBXML_HTML_NODEFDTD')) {
312 175
            $optionsXml |= \LIBXML_HTML_NODEFDTD;
313
        }
314
315 175
        if ($libXMLExtraOptions !== null) {
316 5
            $optionsXml |= $libXMLExtraOptions;
317
        }
318
319
        if (
320 175
            $this->isDOMDocumentCreatedWithoutWrapper
321
            ||
322 175
            $this->keepBrokenHtml
323
        ) {
324 7
            $html = '<' . self::$domHtmlWrapperHelper . '>' . $html . '</' . self::$domHtmlWrapperHelper . '>';
325
        }
326
327 175
        $html = self::replaceToPreserveHtmlEntities($html);
328
329 175
        $documentFound = false;
330 175
        $sxe = \simplexml_load_string($html, \SimpleXMLElement::class, $optionsXml);
331 175 View Code Duplication
        if ($sxe !== false && \count(\libxml_get_errors()) === 0) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
332 77
            $domElementTmp = \dom_import_simplexml($sxe);
333 77
            if ($domElementTmp) {
334 77
                $documentFound = true;
335 77
                $this->document = $domElementTmp->ownerDocument;
336
            }
337
        }
338
339 175 View Code Duplication
        if ($documentFound === false) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
340
341
            // UTF-8 hack: http://php.net/manual/en/domdocument.loadhtml.php#95251
342 107
            $xmlHackUsed = false;
343
            /** @noinspection StringFragmentMisplacedInspection */
344 107
            if (\stripos('<?xml', $html) !== 0) {
345 107
                $xmlHackUsed = true;
346 107
                $html = '<?xml encoding="' . $this->getEncoding() . '" ?>' . $html;
347
            }
348
349 107
            $this->document->loadHTML($html, $optionsXml);
350
351
            // remove the "xml-encoding" hack
352 107
            if ($xmlHackUsed) {
353 107
                foreach ($this->document->childNodes as $child) {
354 107
                    if ($child->nodeType === \XML_PI_NODE) {
355
                        /** @noinspection UnusedFunctionResultInspection */
356 107
                        $this->document->removeChild($child);
357
358 107
                        break;
359
                    }
360
                }
361
            }
362
        }
363
364
        // set encoding
365 175
        $this->document->encoding = $this->getEncoding();
366
367
        // restore lib-xml settings
368 175
        \libxml_clear_errors();
369 175
        \libxml_use_internal_errors($internalErrors);
370 175
        \libxml_disable_entity_loader($disableEntityLoader);
371
372 175
        return $this->document;
373
    }
374
375
    /**
376
     * Find list of nodes with a CSS selector.
377
     *
378
     * @param string   $selector
379
     * @param int|null $idx
380
     *
381
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
382
     */
383 125 View Code Duplication
    public function find(string $selector, $idx = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
384
    {
385 125
        $xPathQuery = SelectorConverter::toXPath($selector);
386
387 125
        $xPath = new \DOMXPath($this->document);
388 125
        $nodesList = $xPath->query($xPathQuery);
389 125
        $elements = new SimpleHtmlDomNode();
390
391 125
        if ($nodesList) {
392 125
            foreach ($nodesList as $node) {
393 117
                $elements[] = new SimpleHtmlDom($node);
394
            }
395
        }
396
397
        // return all elements
398 125
        if ($idx === null) {
399 69
            if (\count($elements) === 0) {
400 16
                return new SimpleHtmlDomNodeBlank();
401
            }
402
403 66
            return $elements;
404
        }
405
406
        // handle negative values
407 74
        if ($idx < 0) {
408 11
            $idx = \count($elements) + $idx;
409
        }
410
411
        // return one element
412 74
        return $elements[$idx] ?? new SimpleHtmlDomBlank();
413
    }
414
415
    /**
416
     * Find nodes with a CSS selector.
417
     *
418
     * @param string $selector
419
     *
420
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface[]...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 51. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
421
     */
422 12
    public function findMulti(string $selector): SimpleHtmlDomNodeInterface
423
    {
424 12
        return $this->find($selector, null);
425
    }
426
427
    /**
428
     * Find nodes with a CSS selector or false, if no element is found.
429
     *
430
     * @param string $selector
431
     *
432
     * @return false|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type false|SimpleHtmlDomInter...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 57. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
433
     */
434 3
    public function findMultiOrFalse(string $selector)
435
    {
436 3
        $return = $this->find($selector, null);
437
438 3
        if ($return instanceof SimpleHtmlDomNodeBlank) {
439 3
            return false;
440
        }
441
442 1
        return $return;
443
    }
444
445
    /**
446
     * Find one node with a CSS selector.
447
     *
448
     * @param string $selector
449
     *
450
     * @return SimpleHtmlDomInterface
451
     */
452 32
    public function findOne(string $selector): SimpleHtmlDomInterface
453
    {
454 32
        return $this->find($selector, 0);
455
    }
456
457
    /**
458
     * Find one node with a CSS selector or false, if no element is found.
459
     *
460
     * @param string $selector
461
     *
462
     * @return false|SimpleHtmlDomInterface
463
     */
464 3
    public function findOneOrFalse(string $selector)
465
    {
466 3
        $return = $this->find($selector, 0);
467
468 3
        if ($return instanceof SimpleHtmlDomBlank) {
469 2
            return false;
470
        }
471
472 2
        return $return;
473
    }
474
475
    /**
476
     * @param string $content
477
     * @param bool   $multiDecodeNewHtmlEntity
478
     *
479
     * @return string
480
     */
481 104
    public function fixHtmlOutput(
482
        string $content,
483
        bool $multiDecodeNewHtmlEntity = false
484
    ): string {
485
        // INFO: DOMDocument will encapsulate plaintext into a e.g. paragraph tag (<p>),
486
        //          so we try to remove it here again ...
487
488 104
        if ($this->getIsDOMDocumentCreatedWithoutHtmlWrapper()) {
489
            /** @noinspection HtmlRequiredLangAttribute */
490 44
            $content = \str_replace(
491
                [
492 44
                    '<html>',
493
                    '</html>',
494
                ],
495 44
                '',
496 44
                $content
497
            );
498
        }
499
500 104
        if ($this->getIsDOMDocumentCreatedWithoutHeadWrapper()) {
501
            /** @noinspection HtmlRequiredTitleElement */
502 48
            $content = \str_replace(
503
                [
504 48
                    '<head>',
505
                    '</head>',
506
                ],
507 48
                '',
508 48
                $content
509
            );
510
        }
511
512 104
        if ($this->getIsDOMDocumentCreatedWithoutBodyWrapper()) {
513
            /** @noinspection HtmlRequiredLangAttribute */
514 47
            $content = \str_replace(
515
                [
516 47
                    '<body>',
517
                    '</body>',
518
                ],
519 47
                '',
520 47
                $content
521
            );
522
        }
523
524 104
        if ($this->getIsDOMDocumentCreatedWithFakeEndScript()) {
525 1
            $content = \str_replace(
526 1
                '</script>',
527 1
                '',
528 1
                $content
529
            );
530
        }
531
532 104
        if ($this->getIsDOMDocumentCreatedWithoutWrapper()) {
533 4
            $content = (string) \preg_replace('/^<p>/', '', $content);
534 4
            $content = (string) \preg_replace('/<\/p>/', '', $content);
535
        }
536
537 104
        if ($this->getIsDOMDocumentCreatedWithoutPTagWrapper()) {
538 46
            $content = \str_replace(
539
                [
540 46
                    '<p>',
541
                    '</p>',
542
                ],
543 46
                '',
544 46
                $content
545
            );
546
        }
547
548 104
        if ($this->getIsDOMDocumentCreatedWithoutHtml()) {
549 8
            $content = \str_replace(
550 8
                '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">',
551 8
                '',
552 8
                $content
553
            );
554
        }
555
556
        /** @noinspection CheckTagEmptyBody */
557
        /** @noinspection HtmlExtraClosingTag */
558
        /** @noinspection HtmlRequiredTitleElement */
559 104
        $content = \trim(
560 104
            \str_replace(
561
                [
562 104
                    '<simpleHtmlDomHtml>',
563
                    '</simpleHtmlDomHtml>',
564
                    '<simpleHtmlDomP>',
565
                    '</simpleHtmlDomP>',
566
                    '<head><head>',
567
                    '</head></head>',
568
                    '<br></br>',
569
                ],
570
                [
571 104
                    '',
572
                    '',
573
                    '',
574
                    '',
575
                    '<head>',
576
                    '</head>',
577
                    '<br>',
578
                ],
579 104
                $content
580
            )
581
        );
582
583 104
        $content = $this->decodeHtmlEntity($content, $multiDecodeNewHtmlEntity);
584
585 104
        return self::putReplacedBackToPreserveHtmlEntities($content);
586
    }
587
588
    /**
589
     * Return elements by ".class".
590
     *
591
     * @param string $class
592
     *
593
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface[]...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 51. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
594
     */
595
    public function getElementByClass(string $class): SimpleHtmlDomNodeInterface
596
    {
597
        return $this->findMulti(".${class}");
598
    }
599
600
    /**
601
     * Return element by #id.
602
     *
603
     * @param string $id
604
     *
605
     * @return SimpleHtmlDomInterface
606
     */
607 3
    public function getElementById(string $id): SimpleHtmlDomInterface
608
    {
609 3
        return $this->findOne("#${id}");
610
    }
611
612
    /**
613
     * Return element by tag name.
614
     *
615
     * @param string $name
616
     *
617
     * @return SimpleHtmlDomInterface
618
     */
619 1
    public function getElementByTagName(string $name): SimpleHtmlDomInterface
620
    {
621 1
        $node = $this->document->getElementsByTagName($name)->item(0);
622
623 1
        if ($node === null) {
624
            return new SimpleHtmlDomBlank();
625
        }
626
627 1
        return new SimpleHtmlDom($node);
628
    }
629
630
    /**
631
     * Returns elements by "#id".
632
     *
633
     * @param string   $id
634
     * @param int|null $idx
635
     *
636
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
637
     */
638
    public function getElementsById(string $id, $idx = null)
639
    {
640
        return $this->find("#${id}", $idx);
641
    }
642
643
    /**
644
     * Returns elements by tag name.
645
     *
646
     * @param string   $name
647
     * @param int|null $idx
648
     *
649
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
650
     */
651 6
    public function getElementsByTagName(string $name, $idx = null)
652
    {
653 6
        $nodesList = $this->document->getElementsByTagName($name);
654
655 6
        $elements = new SimpleHtmlDomNode();
656
657 6
        foreach ($nodesList as $node) {
658 4
            $elements[] = new SimpleHtmlDom($node);
659
        }
660
661
        // return all elements
662 6
        if ($idx === null) {
663 5
            if (\count($elements) === 0) {
664 2
                return new SimpleHtmlDomNodeBlank();
665
            }
666
667 3
            return $elements;
668
        }
669
670
        // handle negative values
671 1
        if ($idx < 0) {
672
            $idx = \count($elements) + $idx;
673
        }
674
675
        // return one element
676 1
        return $elements[$idx] ?? new SimpleHtmlDomNodeBlank();
677
    }
678
679
    /**
680
     * Get dom node's outer html.
681
     *
682
     * @param bool $multiDecodeNewHtmlEntity
683
     *
684
     * @return string
685
     */
686 71
    public function html(bool $multiDecodeNewHtmlEntity = false): string
687
    {
688 71
        if (static::$callback !== null) {
689
            \call_user_func(static::$callback, [$this]);
690
        }
691
692 71
        if ($this->getIsDOMDocumentCreatedWithoutHtmlWrapper()) {
693 37
            $content = $this->document->saveHTML($this->document->documentElement);
694
        } else {
695 46
            $content = $this->document->saveHTML();
696
        }
697
698 71
        if ($content === false) {
699
            return '';
700
        }
701
702 71
        return $this->fixHtmlOutput($content, $multiDecodeNewHtmlEntity);
703
    }
704
705
    /**
706
     * Load HTML from string.
707
     *
708
     * @param string   $html
709
     * @param int|null $libXMLExtraOptions
710
     *
711
     * @return HtmlDomParser
712
     */
713 175
    public function loadHtml(string $html, $libXMLExtraOptions = null): DomParserInterface
714
    {
715
        // reset
716 175
        self::$domBrokenReplaceHelper = [];
717
718 175
        $this->document = $this->createDOMDocument($html, $libXMLExtraOptions);
719
720 175
        return $this;
0 ignored issues
show
Bug Best Practice introduced by
The return type of return $this; (voku\helper\HtmlDomParser) is incompatible with the return type declared by the interface voku\helper\DomParserInterface::loadHtml of type self.

If you return a value from a function or method, it should be a sub-type of the type that is given by the parent type f.e. an interface, or abstract method. This is more formally defined by the Lizkov substitution principle, and guarantees that classes that depend on the parent type can use any instance of a child type interchangably. This principle also belongs to the SOLID principles for object oriented design.

Let’s take a look at an example:

class Author {
    private $name;

    public function __construct($name) {
        $this->name = $name;
    }

    public function getName() {
        return $this->name;
    }
}

abstract class Post {
    public function getAuthor() {
        return 'Johannes';
    }
}

class BlogPost extends Post {
    public function getAuthor() {
        return new Author('Johannes');
    }
}

class ForumPost extends Post { /* ... */ }

function my_function(Post $post) {
    echo strtoupper($post->getAuthor());
}

Our function my_function expects a Post object, and outputs the author of the post. The base class Post returns a simple string and outputting a simple string will work just fine. However, the child class BlogPost which is a sub-type of Post instead decided to return an object, and is therefore violating the SOLID principles. If a BlogPost were passed to my_function, PHP would not complain, but ultimately fail when executing the strtoupper call in its body.

Loading history...
721
    }
722
723
    /**
724
     * Load HTML from file.
725
     *
726
     * @param string   $filePath
727
     * @param int|null $libXMLExtraOptions
728
     *
729
     * @throws \RuntimeException
730
     *
731
     * @return HtmlDomParser
732
     */
733 11 View Code Duplication
    public function loadHtmlFile(string $filePath, $libXMLExtraOptions = null): DomParserInterface
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
734
    {
735
        // reset
736 11
        self::$domBrokenReplaceHelper = [];
737
738
        if (
739 11
            !\preg_match("/^https?:\/\//i", $filePath)
740
            &&
741 11
            !\file_exists($filePath)
742
        ) {
743 1
            throw new \RuntimeException("File ${filePath} not found");
744
        }
745
746
        try {
747 10
            if (\class_exists('\voku\helper\UTF8')) {
748
                /** @noinspection PhpUndefinedClassInspection */
749
                $html = UTF8::file_get_contents($filePath);
750
            } else {
751 10
                $html = \file_get_contents($filePath);
752
            }
753 1
        } catch (\Exception $e) {
754 1
            throw new \RuntimeException("Could not load file ${filePath}");
755
        }
756
757 9
        if ($html === false) {
758
            throw new \RuntimeException("Could not load file ${filePath}");
759
        }
760
761 9
        return $this->loadHtml($html, $libXMLExtraOptions);
762
    }
763
764
    /**
765
     * Get the HTML as XML or plain XML if needed.
766
     *
767
     * @param bool $multiDecodeNewHtmlEntity
768
     * @param bool $htmlToXml
769
     * @param bool $removeXmlHeader
770
     * @param int  $options
771
     *
772
     * @return string
773
     */
774 2 View Code Duplication
    public function xml(
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
775
        bool $multiDecodeNewHtmlEntity = false,
776
        bool $htmlToXml = true,
777
        bool $removeXmlHeader = true,
778
        int $options = \LIBXML_NOEMPTYTAG
779
    ): string {
780 2
        $xml = $this->document->saveXML(null, $options);
781
782 2
        if ($removeXmlHeader) {
783 2
            $xml = \ltrim((string) \preg_replace('/<\?xml.*\?>/', '', $xml));
784
        }
785
786 2
        if ($htmlToXml) {
787 2
            $return = $this->fixHtmlOutput($xml, $multiDecodeNewHtmlEntity);
788
        } else {
789
            $xml = $this->decodeHtmlEntity($xml, $multiDecodeNewHtmlEntity);
790
791
            $return = self::putReplacedBackToPreserveHtmlEntities($xml);
792
        }
793
794 2
        return $return;
795
    }
796
797
    /**
798
     * @param string $selector
799
     * @param int    $idx
800
     *
801
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface<SimpleHtmlDomInterface>
0 ignored issues
show
Documentation introduced by
The doc-type SimpleHtmlDomInterface|S...SimpleHtmlDomInterface> could not be parsed: Expected "|" or "end of type", but got "<" at position 74. (view supported doc-types)

This check marks PHPDoc comments that could not be parsed by our parser. To see which comment annotations we can parse, please refer to our documentation on supported doc-types.

Loading history...
802
     */
803 3
    public function __invoke($selector, $idx = null)
804
    {
805 3
        return $this->find($selector, $idx);
806
    }
807
808
    /**
809
     * @return bool
810
     */
811 104
    public function getIsDOMDocumentCreatedWithoutHeadWrapper(): bool
812
    {
813 104
        return $this->isDOMDocumentCreatedWithoutHeadWrapper;
814
    }
815
816
    /**
817
     * @return bool
818
     */
819 104
    public function getIsDOMDocumentCreatedWithoutPTagWrapper(): bool
820
    {
821 104
        return $this->isDOMDocumentCreatedWithoutPTagWrapper;
822
    }
823
824
    /**
825
     * @return bool
826
     */
827 104
    public function getIsDOMDocumentCreatedWithoutHtml(): bool
828
    {
829 104
        return $this->isDOMDocumentCreatedWithoutHtml;
830
    }
831
832
    /**
833
     * @return bool
834
     */
835 104
    public function getIsDOMDocumentCreatedWithoutBodyWrapper(): bool
836
    {
837 104
        return $this->isDOMDocumentCreatedWithoutBodyWrapper;
838
    }
839
840
    /**
841
     * @return bool
842
     */
843 104
    public function getIsDOMDocumentCreatedWithoutHtmlWrapper(): bool
844
    {
845 104
        return $this->isDOMDocumentCreatedWithoutHtmlWrapper;
846
    }
847
848
    /**
849
     * @return bool
850
     */
851 104
    public function getIsDOMDocumentCreatedWithoutWrapper(): bool
852
    {
853 104
        return $this->isDOMDocumentCreatedWithoutWrapper;
854
    }
855
856
    /**
857
     * @return bool
858
     */
859 104
    public function getIsDOMDocumentCreatedWithFakeEndScript(): bool
860
    {
861 104
        return $this->isDOMDocumentCreatedWithFakeEndScript;
862
    }
863
864
    /**
865
     * @param string $html
866
     *
867
     * @return string
868
     */
869 3
    protected function keepBrokenHtml(string $html): string
870
    {
871
        do {
872 3
            $original = $html;
873
874 3
            $html = (string) \preg_replace_callback(
875 3
                '/(?<start>.*)<(?<element_start>[a-z]+)(?<element_start_addon> [^>]*)?>(?<value>.*?)<\/(?<element_end>\2)>(?<end>.*)/sui',
876
                static function ($matches) {
877 3
                    return $matches['start'] .
878 3
                           '°lt_simple_html_dom__voku_°' . $matches['element_start'] . $matches['element_start_addon'] . '°gt_simple_html_dom__voku_°' .
879 3
                           $matches['value'] .
880 3
                           '°lt/_simple_html_dom__voku_°' . $matches['element_end'] . '°gt_simple_html_dom__voku_°' .
881 3
                           $matches['end'];
882 3
                },
883 3
                $html
884
            );
885 3
        } while ($original !== $html);
886
887
        do {
888 3
            $original = $html;
889
890 3
            $html = (string) \preg_replace_callback(
891 3
                '/(?<start>[^<]*)?(?<broken>(?:(?:<\/\w+(?:\s+\w+=\\"[^\"]+\\")*+)(?:[^<]+)>)+)(?<end>.*)/u',
892
                static function ($matches) {
893 3
                    $matches['broken'] = \str_replace(
894 3
                        ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
895 3
                        ['</', '<', '>'],
896 3
                        $matches['broken']
897
                    );
898
899 3
                    self::$domBrokenReplaceHelper['orig'][] = $matches['broken'];
900 3
                    self::$domBrokenReplaceHelper['tmp'][] = $matchesHash = self::$domHtmlBrokenHtmlHelper . \crc32($matches['broken']);
901
902 3
                    return $matches['start'] . $matchesHash . $matches['end'];
903 3
                },
904 3
                $html
905
            );
906 3
        } while ($original !== $html);
907
908 3
        return \str_replace(
909 3
            ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
910 3
            ['</', '<', '>'],
911 3
            $html
912
        );
913
    }
914
915
    /**
916
     * @param string $html
917
     *
918
     * @return void
919
     */
920 2
    protected function keepSpecialScriptTags(string &$html)
921
    {
922
        // regEx for e.g.: [<script id="elements-image-1" type="text/html">...</script>]
923 2
        $html = (string) \preg_replace_callback(
924 2
            '/(?<start>((?:<script) [^>]*type=(?:["\'])?(?:text\/html|text\/x-custom-template)+(?:[^>]*)>))(?<innerContent>.*)(?<end><\/script>)/isU',
925
            static function ($matches) {
926
                if (
927 2
                    \strpos($matches['innerContent'], '+') === false
928
                    &&
929 2
                    \strpos($matches['innerContent'], '<%') === false
930
                    &&
931 2
                    \strpos($matches['innerContent'], '{%') === false
932
                    &&
933 2
                    \strpos($matches['innerContent'], '{{') === false
934
                ) {
935
                    // remove the html5 fallback
936 1
                    $matches[0] = \str_replace('<\/', '</', $matches[0]);
937
938 1
                    $specialNonScript = '<' . self::$domHtmlSpecialScriptHelper . \substr($matches[0], \strlen('<script'));
939
940 1
                    return \substr($specialNonScript, 0, -\strlen('</script>')) . '</' . self::$domHtmlSpecialScriptHelper . '>';
941
                }
942
943
                // remove the html5 fallback
944 1
                $matches['innerContent'] = \str_replace('<\/', '</', $matches['innerContent']);
945
946 1
                self::$domBrokenReplaceHelper['orig'][] = $matches['innerContent'];
947 1
                self::$domBrokenReplaceHelper['tmp'][] = $matchesHash = '' . self::$domHtmlBrokenHtmlHelper . '' . \crc32($matches['innerContent']);
948
949 1
                return $matches['start'] . $matchesHash . $matches['end'];
950 2
            },
951 2
            $html
952
        );
953 2
    }
954
955
    /**
956
     * @param bool $keepBrokenHtml
957
     *
958
     * @return HtmlDomParser
959
     */
960 3
    public function useKeepBrokenHtml(bool $keepBrokenHtml): DomParserInterface
961
    {
962 3
        $this->keepBrokenHtml = $keepBrokenHtml;
963
964 3
        return $this;
965
    }
966
}
967