Completed
Push — master ( 3206d0...a4cc25 )
by Lars
01:53
created

HtmlDomParser   F

Complexity

Total Complexity 111

Size/Duplication

Total Lines 910
Duplicated Lines 25.38 %

Coupling/Cohesion

Components 1
Dependencies 7

Test Coverage

Coverage 95.28%

Importance

Changes 0
Metric Value
wmc 111
lcom 1
cbo 7
dl 231
loc 910
ccs 303
cts 318
cp 0.9528
rs 1.69
c 0
b 0
f 0

32 Methods

Rating   Name   Duplication   Size   Complexity  
A __construct() 31 31 5
A __call() 0 10 2
A __callStatic() 20 20 3
A __toString() 0 4 1
A clear() 0 4 1
F createDOMDocument() 30 126 29
A find() 29 29 5
A findMulti() 0 4 1
B __get() 0 18 7
A findMultiOrFalse() 0 10 2
A findOne() 0 4 1
A findOneOrFalse() 0 10 2
B fixHtmlOutput() 0 83 6
A getElementByClass() 0 4 1
A getElementById() 0 4 1
A getElementByTagName() 0 10 2
A getElementsById() 0 4 1
A getElementsByTagName() 0 27 5
A html() 0 18 4
A loadHtml() 0 6 1
B loadHtmlFile() 27 27 6
A putReplacedBackToPreserveHtmlEntities() 37 37 4
B replaceToPreserveHtmlEntities() 35 35 6
A xml() 22 22 3
A __invoke() 0 4 1
A getIsDOMDocumentCreatedWithoutHeadWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutHtml() 0 4 1
A getIsDOMDocumentCreatedWithoutHtmlWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutWrapper() 0 4 1
A keepBrokenHtml() 0 45 3
A keepSpecialScriptTags() 0 18 3
A useKeepBrokenHtml() 0 6 1

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complex Class

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like HtmlDomParser often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use HtmlDomParser, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
declare(strict_types=1);
4
5
namespace voku\helper;
6
7
/**
8
 * @property-read string $outerText
9
 *                                 <p>Get dom node's outer html (alias for "outerHtml").</p>
10
 * @property-read string $outerHtml
11
 *                                 <p>Get dom node's outer html.</p>
12
 * @property-read string $innerText
13
 *                                 <p>Get dom node's inner html (alias for "innerHtml").</p>
14
 * @property-read string $innerHtml
15
 *                                 <p>Get dom node's inner html.</p>
16
 * @property-read string $plaintext
17
 *                                 <p>Get dom node's plain text.</p>
18
 *
19
 * @method string outerText()
20
 *                                 <p>Get dom node's outer html (alias for "outerHtml()").</p>
21
 * @method string outerHtml()
22
 *                                 <p>Get dom node's outer html.</p>
23
 * @method string innerText()
24
 *                                 <p>Get dom node's inner html (alias for "innerHtml()").</p>
25
 * @method HtmlDomParser load(string $html)
26
 *                                 <p>Load HTML from string.</p>
27
 * @method HtmlDomParser load_file(string $html)
28
 *                                 <p>Load HTML from file.</p>
29
 * @method static HtmlDomParser file_get_html($html, $libXMLExtraOptions = null)
30
 *                                 <p>Load HTML from file.</p>
31
 * @method static HtmlDomParser str_get_html($html, $libXMLExtraOptions = null)
32
 *                                 <p>Load HTML from string.</p>
33
 */
34
class HtmlDomParser extends AbstractDomParser
35
{
36
    /**
37
     * @var string[]
38
     */
39
    protected static $functionAliases = [
40
        'outertext' => 'html',
41
        'outerhtml' => 'html',
42
        'innertext' => 'innerHtml',
43
        'innerhtml' => 'innerHtml',
44
        'load'      => 'loadHtml',
45
        'load_file' => 'loadHtmlFile',
46
    ];
47
48
    /**
49
     * @var bool
50
     */
51
    protected $isDOMDocumentCreatedWithoutHtml = false;
52
53
    /**
54
     * @var bool
55
     */
56
    protected $isDOMDocumentCreatedWithoutWrapper = false;
57
58
    /**
59
     * @var bool
60
     */
61
    protected $isDOMDocumentCreatedWithoutHeadWrapper = false;
62
63
    /**
64
     * @var bool
65
     */
66
    protected $isDOMDocumentCreatedWithoutHtmlWrapper = false;
67
68
    /**
69
     * @var bool
70
     */
71
    protected $isDOMDocumentCreatedWithFakeEndScript = false;
72
73
    /**
74
     * @var bool
75
     */
76
    protected $keepBrokenHtml;
77
78
    /**
79
     * @param \DOMNode|SimpleHtmlDomInterface|string $element HTML code or SimpleHtmlDomInterface, \DOMNode
80
     */
81 145 View Code Duplication
    public function __construct($element = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
82
    {
83 145
        $this->document = new \DOMDocument('1.0', $this->getEncoding());
84
85
        // reset
86 145
        self::$domBrokenReplaceHelper = [];
87
88
        // DOMDocument settings
89 145
        $this->document->preserveWhiteSpace = true;
90 145
        $this->document->formatOutput = true;
91
92 145
        if ($element instanceof SimpleHtmlDomInterface) {
93 73
            $element = $element->getNode();
94
        }
95
96 145
        if ($element instanceof \DOMNode) {
97 73
            $domNode = $this->document->importNode($element, true);
98
99 73
            if ($domNode instanceof \DOMNode) {
100
                /** @noinspection UnusedFunctionResultInspection */
101 73
                $this->document->appendChild($domNode);
102
            }
103
104 73
            return;
105
        }
106
107 145
        if ($element !== null) {
108
            /** @noinspection UnusedFunctionResultInspection */
109 79
            $this->loadHtml($element);
110
        }
111 144
    }
112
113
    /**
114
     * @param string $name
115
     * @param array  $arguments
116
     *
117
     * @return bool|mixed
118
     */
119 55
    public function __call($name, $arguments)
120
    {
121 55
        $name = \strtolower($name);
122
123 55
        if (isset(self::$functionAliases[$name])) {
124 54
            return \call_user_func_array([$this, self::$functionAliases[$name]], $arguments);
125
        }
126
127 1
        throw new \BadMethodCallException('Method does not exist: ' . $name);
128
    }
129
130
    /**
131
     * @param string $name
132
     * @param array  $arguments
133
     *
134
     * @throws \BadMethodCallException
135
     * @throws \RuntimeException
136
     *
137
     * @return HtmlDomParser
138
     */
139 21 View Code Duplication
    public static function __callStatic($name, $arguments)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
140
    {
141 21
        $arguments0 = $arguments[0] ?? '';
142
143 21
        $arguments1 = $arguments[1] ?? null;
144
145 21
        if ($name === 'str_get_html') {
146 16
            $parser = new static();
147
148 16
            return $parser->loadHtml($arguments0, $arguments1);
149
        }
150
151 5
        if ($name === 'file_get_html') {
152 4
            $parser = new static();
153
154 4
            return $parser->loadHtmlFile($arguments0, $arguments1);
155
        }
156
157 1
        throw new \BadMethodCallException('Method does not exist');
158
    }
159
160
    /** @noinspection MagicMethodsValidityInspection */
161
162
    /**
163
     * @param string $name
164
     *
165
     * @return string|null
166
     */
167 14
    public function __get($name)
168
    {
169 14
        $name = \strtolower($name);
170
171 14
        switch ($name) {
172 14
            case 'outerhtml':
173 14
            case 'outertext':
174 5
                return $this->html();
175 10
            case 'innerhtml':
176 4
            case 'innertext':
177 7
                return $this->innerHtml();
178 3
            case 'text':
179 3
            case 'plaintext':
180 2
                return $this->text();
181
        }
182
183 1
        return null;
184
    }
185
186
    /**
187
     * @return string
188
     */
189 17
    public function __toString()
190
    {
191 17
        return $this->html();
192
    }
193
194
    /**
195
     * does nothing (only for api-compatibility-reasons)
196
     *
197
     * @return bool
198
     *
199
     * @deprecated
200
     */
201 1
    public function clear(): bool
202
    {
203 1
        return true;
204
    }
205
206
    /**
207
     * Create DOMDocument from HTML.
208
     *
209
     * @param string   $html
210
     * @param int|null $libXMLExtraOptions
211
     *
212
     * @return \DOMDocument
213
     */
214 133
    protected function createDOMDocument(string $html, $libXMLExtraOptions = null): \DOMDocument
215
    {
216 133
        if ($this->keepBrokenHtml) {
217 2
            $html = $this->keepBrokenHtml(\trim($html));
218
        }
219
220 133
        if (\strpos($html, '<') === false) {
221 7
            $this->isDOMDocumentCreatedWithoutHtml = true;
222 131
        } elseif (\strpos(\ltrim($html), '<') !== 0) {
223 5
            $this->isDOMDocumentCreatedWithoutWrapper = true;
224
        }
225
226 133
        if (\strpos($html, '<html') === false) {
227 81
            $this->isDOMDocumentCreatedWithoutHtmlWrapper = true;
228
        }
229
230
        /** @noinspection HtmlRequiredTitleElement */
231 133
        if (\strpos($html, '<head>') === false) {
232 84
            $this->isDOMDocumentCreatedWithoutHeadWrapper = true;
233
        }
234
235
        if (
236 133
            \strpos($html, '</script>') === false
237
            &&
238 133
            \strpos($html, '<\/script>') !== false
239
        ) {
240 1
            $this->isDOMDocumentCreatedWithFakeEndScript = true;
241
        }
242
243 133
        if (\strpos($html, '<script') !== false) {
244 15
            $this->html5FallbackForScriptTags($html);
245
246
            if (
247 15
                \strpos($html, 'type="text/html"') !== false
248
                ||
249 14
                \strpos($html, 'type=\'text/html\'') !== false
250
                ||
251 14
                \strpos($html, 'type=text/html') !== false
252
                ||
253 14
                \strpos($html, 'type="text/x-custom-template"') !== false
254
                ||
255 14
                \strpos($html, 'type=\'text/x-custom-template\'') !== false
256
                ||
257 15
                \strpos($html, 'type=text/x-custom-template') !== false
258
            ) {
259 1
                $this->keepSpecialScriptTags($html);
260
            }
261
        }
262
263
        // set error level
264 133
        $internalErrors = \libxml_use_internal_errors(true);
265 133
        $disableEntityLoader = \libxml_disable_entity_loader(true);
266 133
        \libxml_clear_errors();
267
268 133
        $optionsXml = \LIBXML_DTDLOAD | \LIBXML_DTDATTR | \LIBXML_NONET;
269
270 133
        if (\defined('LIBXML_BIGLINES')) {
271 133
            $optionsXml |= \LIBXML_BIGLINES;
272
        }
273
274 133
        if (\defined('LIBXML_COMPACT')) {
275 133
            $optionsXml |= \LIBXML_COMPACT;
276
        }
277
278 133
        if (\defined('LIBXML_HTML_NODEFDTD')) {
279 133
            $optionsXml |= \LIBXML_HTML_NODEFDTD;
280
        }
281
282 133
        if ($libXMLExtraOptions !== null) {
283 1
            $optionsXml |= $libXMLExtraOptions;
284
        }
285
286
        if (
287 133
            $this->isDOMDocumentCreatedWithoutWrapper
288
            ||
289 133
            $this->keepBrokenHtml
290
        ) {
291 6
            $html = '<' . self::$domHtmlWrapperHelper . '>' . $html . '</' . self::$domHtmlWrapperHelper . '>';
292
        }
293
294 133
        $html = self::replaceToPreserveHtmlEntities($html);
295
296 133
        $documentFound = false;
297 133
        $sxe = \simplexml_load_string($html, \SimpleXMLElement::class, $optionsXml);
298 133 View Code Duplication
        if ($sxe !== false && \count(\libxml_get_errors()) === 0) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
299 48
            $domElementTmp = \dom_import_simplexml($sxe);
300 48
            if ($domElementTmp) {
301 48
                $documentFound = true;
302 48
                $this->document = $domElementTmp->ownerDocument;
303
            }
304
        }
305
306 133 View Code Duplication
        if ($documentFound === false) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
307
308
            // UTF-8 hack: http://php.net/manual/en/domdocument.loadhtml.php#95251
309 90
            $xmlHackUsed = false;
310 90
            if (\stripos('<?xml', $html) !== 0) {
311 90
                $xmlHackUsed = true;
312 90
                $html = '<?xml encoding="' . $this->getEncoding() . '" ?>' . $html;
313
            }
314
315 90
            $this->document->loadHTML($html, $optionsXml);
316
317
            // remove the "xml-encoding" hack
318 90
            if ($xmlHackUsed) {
319 90
                foreach ($this->document->childNodes as $child) {
320 90
                    if ($child->nodeType === \XML_PI_NODE) {
321
                        /** @noinspection UnusedFunctionResultInspection */
322 90
                        $this->document->removeChild($child);
323
324 90
                        break;
325
                    }
326
                }
327
            }
328
        }
329
330
        // set encoding
331 133
        $this->document->encoding = $this->getEncoding();
332
333
        // restore lib-xml settings
334 133
        \libxml_clear_errors();
335 133
        \libxml_use_internal_errors($internalErrors);
336 133
        \libxml_disable_entity_loader($disableEntityLoader);
337
338 133
        return $this->document;
339
    }
340
341
    /**
342
     * Find list of nodes with a CSS selector.
343
     *
344
     * @param string   $selector
345
     * @param int|null $idx
346
     *
347
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface
348
     */
349 95 View Code Duplication
    public function find(string $selector, $idx = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
350
    {
351 95
        $xPathQuery = SelectorConverter::toXPath($selector);
352
353 95
        $xPath = new \DOMXPath($this->document);
354 95
        $nodesList = $xPath->query($xPathQuery);
355 95
        $elements = new SimpleHtmlDomNode();
356
357 95
        foreach ($nodesList as $node) {
358 87
            $elements[] = new SimpleHtmlDom($node);
359
        }
360
361
        // return all elements
362 95
        if ($idx === null) {
363 62
            if (\count($elements) === 0) {
364 14
                return new SimpleHtmlDomNodeBlank();
365
            }
366
367 59
            return $elements;
368
        }
369
370
        // handle negative values
371 47
        if ($idx < 0) {
372 11
            $idx = \count($elements) + $idx;
373
        }
374
375
        // return one element
376 47
        return $elements[$idx] ?? new SimpleHtmlDomBlank();
377
    }
378
379
    /**
380
     * Find nodes with a CSS selector.
381
     *
382
     * @param string $selector
383
     *
384
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface
385
     */
386 5
    public function findMulti(string $selector): SimpleHtmlDomNodeInterface
387
    {
388 5
        return $this->find($selector, null);
389
    }
390
391
    /**
392
     * Find nodes with a CSS selector or false, if no element is found.
393
     *
394
     * @param string $selector
395
     *
396
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface|false
397
     */
398 1
    public function findMultiOrFalse(string $selector)
399
    {
400 1
        $return = $this->find($selector, null);
401
402 1
        if ($return instanceof SimpleHtmlDomNodeBlank) {
403 1
            return false;
404
        }
405
406 1
        return $return;
407
    }
408
409
    /**
410
     * Find one node with a CSS selector.
411
     *
412
     * @param string $selector
413
     *
414
     * @return SimpleHtmlDomInterface
415
     */
416 6
    public function findOne(string $selector): SimpleHtmlDomInterface
417
    {
418 6
        return $this->find($selector, 0);
419
    }
420
421
    /**
422
     * Find one node with a CSS selector or false, if no element is found.
423
     *
424
     * @param string $selector
425
     *
426
     * @return SimpleHtmlDomInterface|false
427
     */
428 1
    public function findOneOrFalse(string $selector)
429
    {
430 1
        $return = $this->find($selector, 0);
431
432 1
        if ($return instanceof SimpleHtmlDomBlank) {
433 1
            return false;
434
        }
435
436 1
        return $return;
437
    }
438
439
    /**
440
     * @param string $content
441
     * @param bool   $multiDecodeNewHtmlEntity
442
     *
443
     * @return string
444
     */
445 77
    public function fixHtmlOutput(string $content, bool $multiDecodeNewHtmlEntity = false): string
446
    {
447
        // INFO: DOMDocument will encapsulate plaintext into a e.g. paragraph tag (<p>),
448
        //          so we try to remove it here again ...
449
450 77
        if ($this->isDOMDocumentCreatedWithoutHtmlWrapper) {
451
            /** @noinspection HtmlRequiredLangAttribute */
452 30
            $content = \str_replace(
453
                [
454 30
                    '<body>',
455
                    '</body>',
456
                    '<html>',
457
                    '</html>',
458
                ],
459 30
                '',
460 30
                $content
461
            );
462
        }
463
464 77
        if ($this->isDOMDocumentCreatedWithoutHeadWrapper) {
465
            /** @noinspection HtmlRequiredTitleElement */
466 31
            $content = \str_replace(
467
                [
468 31
                    '<head>',
469
                    '</head>',
470
                ],
471 31
                '',
472 31
                $content
473
            );
474
        }
475
476 77
        if ($this->isDOMDocumentCreatedWithFakeEndScript) {
477 1
            $content = \str_replace(
478 1
                '</script>',
479 1
                '',
480 1
                $content
481
            );
482
        }
483
484 77
        if ($this->isDOMDocumentCreatedWithoutWrapper) {
485 4
            $content = (string) \preg_replace('/^<p>/', '', $content);
486 4
            $content = (string) \preg_replace('/<\/p>/', '', $content);
487
        }
488
489 77
        if ($this->isDOMDocumentCreatedWithoutHtml) {
490 5
            $content = \str_replace(
491
                [
492 5
                    '<p>',
493
                    '</p>',
494
                    '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">',
495
                ],
496 5
                '',
497 5
                $content
498
            );
499
        }
500
501
        /** @noinspection CheckTagEmptyBody */
502
        /** @noinspection HtmlExtraClosingTag */
503
        /** @noinspection HtmlRequiredTitleElement */
504 77
        $content = \trim(
505 77
            \str_replace(
506
                [
507 77
                    '<simpleHtmlDomP>',
508
                    '</simpleHtmlDomP>',
509
                    '<head><head>',
510
                    '</head></head>',
511
                    '<br></br>',
512
                ],
513
                [
514 77
                    '',
515
                    '',
516
                    '<head>',
517
                    '</head>',
518
                    '<br>',
519
                ],
520 77
                $content
521
            )
522
        );
523
524 77
        $content = $this->decodeHtmlEntity($content, $multiDecodeNewHtmlEntity);
525
526 77
        return self::putReplacedBackToPreserveHtmlEntities($content);
527
    }
528
529
    /**
530
     * Return elements by .class.
531
     *
532
     * @param string $class
533
     *
534
     * @return SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface
535
     */
536
    public function getElementByClass(string $class): SimpleHtmlDomNodeInterface
537
    {
538
        return $this->findMulti(".${class}");
539
    }
540
541
    /**
542
     * Return element by #id.
543
     *
544
     * @param string $id
545
     *
546
     * @return SimpleHtmlDomInterface
547
     */
548 2
    public function getElementById(string $id): SimpleHtmlDomInterface
549
    {
550 2
        return $this->findOne("#${id}");
551
    }
552
553
    /**
554
     * Return element by tag name.
555
     *
556
     * @param string $name
557
     *
558
     * @return SimpleHtmlDomInterface
559
     */
560 1
    public function getElementByTagName(string $name): SimpleHtmlDomInterface
561
    {
562 1
        $node = $this->document->getElementsByTagName($name)->item(0);
563
564 1
        if ($node === null) {
565
            return new SimpleHtmlDomBlank();
566
        }
567
568 1
        return new SimpleHtmlDom($node);
569
    }
570
571
    /**
572
     * Returns elements by #id.
573
     *
574
     * @param string   $id
575
     * @param int|null $idx
576
     *
577
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface
578
     */
579
    public function getElementsById(string $id, $idx = null)
580
    {
581
        return $this->find("#${id}", $idx);
582
    }
583
584
    /**
585
     * Returns elements by tag name.
586
     *
587
     * @param string   $name
588
     * @param int|null $idx
589
     *
590
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface
591
     */
592 4
    public function getElementsByTagName(string $name, $idx = null)
593
    {
594 4
        $nodesList = $this->document->getElementsByTagName($name);
595
596 4
        $elements = new SimpleHtmlDomNode();
597
598 4
        foreach ($nodesList as $node) {
599 4
            $elements[] = new SimpleHtmlDom($node);
600
        }
601
602
        // return all elements
603 4
        if ($idx === null) {
604 3
            if (\count($elements) === 0) {
605
                return new SimpleHtmlDomNodeBlank();
606
            }
607
608 3
            return $elements;
609
        }
610
611
        // handle negative values
612 1
        if ($idx < 0) {
613
            $idx = \count($elements) + $idx;
614
        }
615
616
        // return one element
617 1
        return $elements[$idx] ?? new SimpleHtmlDomNodeBlank();
618
    }
619
620
    /**
621
     * Get dom node's outer html.
622
     *
623
     * @param bool $multiDecodeNewHtmlEntity
624
     *
625
     * @return string
626
     */
627 50
    public function html(bool $multiDecodeNewHtmlEntity = false): string
628
    {
629 50
        if ($this::$callback !== null) {
630
            \call_user_func($this::$callback, [$this]);
631
        }
632
633 50
        if ($this->getIsDOMDocumentCreatedWithoutHtmlWrapper()) {
634 23
            $content = $this->document->saveHTML($this->document->documentElement);
635
        } else {
636 35
            $content = $this->document->saveHTML();
637
        }
638
639 50
        if ($content === false) {
640
            return '';
641
        }
642
643 50
        return $this->fixHtmlOutput($content, $multiDecodeNewHtmlEntity);
644
    }
645
646
    /**
647
     * Load HTML from string.
648
     *
649
     * @param string   $html
650
     * @param int|null $libXMLExtraOptions
651
     *
652
     * @return HtmlDomParser
653
     */
654 133
    public function loadHtml(string $html, $libXMLExtraOptions = null): DomParserInterface
655
    {
656 133
        $this->document = $this->createDOMDocument($html, $libXMLExtraOptions);
657
658 133
        return $this;
0 ignored issues
show
Bug Best Practice introduced by
The return type of return $this; (voku\helper\HtmlDomParser) is incompatible with the return type declared by the interface voku\helper\DomParserInterface::loadHtml of type self.

If you return a value from a function or method, it should be a sub-type of the type that is given by the parent type f.e. an interface, or abstract method. This is more formally defined by the Lizkov substitution principle, and guarantees that classes that depend on the parent type can use any instance of a child type interchangably. This principle also belongs to the SOLID principles for object oriented design.

Let’s take a look at an example:

class Author {
    private $name;

    public function __construct($name) {
        $this->name = $name;
    }

    public function getName() {
        return $this->name;
    }
}

abstract class Post {
    public function getAuthor() {
        return 'Johannes';
    }
}

class BlogPost extends Post {
    public function getAuthor() {
        return new Author('Johannes');
    }
}

class ForumPost extends Post { /* ... */ }

function my_function(Post $post) {
    echo strtoupper($post->getAuthor());
}

Our function my_function expects a Post object, and outputs the author of the post. The base class Post returns a simple string and outputting a simple string will work just fine. However, the child class BlogPost which is a sub-type of Post instead decided to return an object, and is therefore violating the SOLID principles. If a BlogPost were passed to my_function, PHP would not complain, but ultimately fail when executing the strtoupper call in its body.

Loading history...
659
    }
660
661
    /**
662
     * Load HTML from file.
663
     *
664
     * @param string   $filePath
665
     * @param int|null $libXMLExtraOptions
666
     *
667
     * @throws \RuntimeException
668
     *
669
     * @return HtmlDomParser
670
     */
671 11 View Code Duplication
    public function loadHtmlFile(string $filePath, $libXMLExtraOptions = null): DomParserInterface
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
672
    {
673
        if (
674 11
            !\preg_match("/^https?:\/\//i", $filePath)
675
            &&
676 11
            !\file_exists($filePath)
677
        ) {
678 1
            throw new \RuntimeException("File ${filePath} not found");
679
        }
680
681
        try {
682 10
            if (\class_exists('\voku\helper\UTF8')) {
683
                /** @noinspection PhpUndefinedClassInspection */
684
                $html = UTF8::file_get_contents($filePath);
685
            } else {
686 10
                $html = \file_get_contents($filePath);
687
            }
688 1
        } catch (\Exception $e) {
689 1
            throw new \RuntimeException("Could not load file ${filePath}");
690
        }
691
692 9
        if ($html === false) {
693
            throw new \RuntimeException("Could not load file ${filePath}");
694
        }
695
696 9
        return $this->loadHtml($html, $libXMLExtraOptions);
697
    }
698
699
    /**
700
     * @param string $html
701
     *
702
     * @return string
703
     */
704 85 View Code Duplication
    public static function putReplacedBackToPreserveHtmlEntities(string $html): string
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
705
    {
706 85
        static $DOM_REPLACE__HELPER_CACHE = null;
707
708 85
        if ($DOM_REPLACE__HELPER_CACHE === null) {
709 1
            $DOM_REPLACE__HELPER_CACHE['tmp'] = \array_merge(
710 1
                self::$domLinkReplaceHelper['tmp'],
711 1
                self::$domReplaceHelper['tmp']
712
            );
713 1
            $DOM_REPLACE__HELPER_CACHE['orig'] = \array_merge(
714 1
                self::$domLinkReplaceHelper['orig'],
715 1
                self::$domReplaceHelper['orig']
716
            );
717
718 1
            $DOM_REPLACE__HELPER_CACHE['tmp']['html_wrapper__start'] = '<' . self::$domHtmlWrapperHelper . '>';
719 1
            $DOM_REPLACE__HELPER_CACHE['tmp']['html_wrapper__end'] = '</' . self::$domHtmlWrapperHelper . '>';
720
721 1
            $DOM_REPLACE__HELPER_CACHE['orig']['html_wrapper__start'] = '';
722 1
            $DOM_REPLACE__HELPER_CACHE['orig']['html_wrapper__end'] = '';
723
724 1
            $DOM_REPLACE__HELPER_CACHE['tmp']['html_special_script__start'] = '<' . self::$domHtmlSpecialScriptHelper;
725 1
            $DOM_REPLACE__HELPER_CACHE['tmp']['html_special_script__end'] = '</' . self::$domHtmlSpecialScriptHelper . '>';
726
727 1
            $DOM_REPLACE__HELPER_CACHE['orig']['html_special_script__start'] = '<script';
728 1
            $DOM_REPLACE__HELPER_CACHE['orig']['html_special_script__end'] = '</script>';
729
        }
730
731
        if (
732 85
            isset(self::$domBrokenReplaceHelper['tmp'])
733
            &&
734 85
            \count(self::$domBrokenReplaceHelper['tmp']) > 0
735
        ) {
736 2
            $html = \str_replace(self::$domBrokenReplaceHelper['tmp'], self::$domBrokenReplaceHelper['orig'], $html);
737
        }
738
739 85
        return \str_replace($DOM_REPLACE__HELPER_CACHE['tmp'], $DOM_REPLACE__HELPER_CACHE['orig'], $html);
740
    }
741
742
    /**
743
     * @param string $html
744
     *
745
     * @return string
746
     */
747 134 View Code Duplication
    public static function replaceToPreserveHtmlEntities(string $html): string
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
748
    {
749
        // init
750 134
        $linksNew = [];
751 134
        $linksOld = [];
752
753 134
        if (\strpos($html, 'http') !== false) {
754
755
            // regEx for e.g.: [https://www.domain.de/foo.php?foobar=1&email=lars%40moelleken.org&guid=test1233312&{{foo}}#foo]
756 60
            $regExUrl = '/(\[?\bhttps?:\/\/[^\s<>]+(?:\([\w]+\)|[^[:punct:]\s]|\/|\}|\]))/i';
757 60
            \preg_match_all($regExUrl, $html, $linksOld);
758
759 60
            if (!empty($linksOld[1])) {
760 57
                $linksOld = $linksOld[1];
761 57
                foreach ((array) $linksOld as $linkKey => $linkOld) {
762 57
                    $linksNew[$linkKey] = \str_replace(
763 57
                        self::$domLinkReplaceHelper['orig'],
764 57
                        self::$domLinkReplaceHelper['tmp'],
765 57
                        $linkOld
766
                    );
767
                }
768
            }
769
        }
770
771 134
        $linksNewCount = \count($linksNew);
772 134
        if ($linksNewCount > 0 && \count($linksOld) === $linksNewCount) {
773 57
            $search = \array_merge($linksOld, self::$domReplaceHelper['orig']);
774 57
            $replace = \array_merge($linksNew, self::$domReplaceHelper['tmp']);
775
        } else {
776 82
            $search = self::$domReplaceHelper['orig'];
777 82
            $replace = self::$domReplaceHelper['tmp'];
778
        }
779
780 134
        return \str_replace($search, $replace, $html);
781
    }
782
783
    /**
784
     * Get the HTML as XML or plain XML if needed.
785
     *
786
     * @param bool $multiDecodeNewHtmlEntity
787
     * @param bool $htmlToXml
788
     * @param bool $removeXmlHeader
789
     * @param int  $options
790
     *
791
     * @return string
792
     */
793 2 View Code Duplication
    public function xml(
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
794
        bool $multiDecodeNewHtmlEntity = false,
795
        bool $htmlToXml = true,
796
        bool $removeXmlHeader = true,
797
        int $options = \LIBXML_NOEMPTYTAG
798
    ): string {
799 2
        $xml = $this->document->saveXML(null, $options);
800
801 2
        if ($removeXmlHeader) {
802 2
            $xml = \ltrim((string) \preg_replace('/<\?xml.*\?>/', '', $xml));
803
        }
804
805 2
        if ($htmlToXml) {
806 2
            $return = $this->fixHtmlOutput($xml, $multiDecodeNewHtmlEntity);
807
        } else {
808
            $xml = $this->decodeHtmlEntity($xml, $multiDecodeNewHtmlEntity);
809
810
            $return = self::putReplacedBackToPreserveHtmlEntities($xml);
811
        }
812
813 2
        return $return;
814
    }
815
816
    /**
817
     * @param string $selector
818
     * @param int    $idx
819
     *
820
     * @return SimpleHtmlDomInterface|SimpleHtmlDomInterface[]|SimpleHtmlDomNodeInterface
821
     */
822 3
    public function __invoke($selector, $idx = null)
823
    {
824 3
        return $this->find($selector, $idx);
825
    }
826
827
    /**
828
     * @return bool
829
     */
830 9
    public function getIsDOMDocumentCreatedWithoutHeadWrapper(): bool
831
    {
832 9
        return $this->isDOMDocumentCreatedWithoutHeadWrapper;
833
    }
834
835
    /**
836
     * @return bool
837
     */
838 9
    public function getIsDOMDocumentCreatedWithoutHtml(): bool
839
    {
840 9
        return $this->isDOMDocumentCreatedWithoutHtml;
841
    }
842
843
    /**
844
     * @return bool
845
     */
846 50
    public function getIsDOMDocumentCreatedWithoutHtmlWrapper(): bool
847
    {
848 50
        return $this->isDOMDocumentCreatedWithoutHtmlWrapper;
849
    }
850
851
    /**
852
     * @return bool
853
     */
854
    public function getIsDOMDocumentCreatedWithoutWrapper(): bool
855
    {
856
        return $this->isDOMDocumentCreatedWithoutWrapper;
857
    }
858
859
    /**
860
     * @param string $html
861
     *
862
     * @return string
863
     */
864 2
    protected function keepBrokenHtml(string $html): string
865
    {
866
        do {
867 2
            $original = $html;
868
869 2
            $html = (string) \preg_replace_callback(
870 2
                '/(?<start>.*)<(?<element_start>[a-z]+)(?<element_start_addon> [^>]*)?>(?<value>.*?)<\/(?<element_end>\2)>(?<end>.*)/sui',
871
                static function ($matches) {
872 2
                    return $matches['start'] .
873 2
                           '°lt_simple_html_dom__voku_°' . $matches['element_start'] . $matches['element_start_addon'] . '°gt_simple_html_dom__voku_°' .
874 2
                           $matches['value'] .
875 2
                           '°lt/_simple_html_dom__voku_°' . $matches['element_end'] . '°gt_simple_html_dom__voku_°' .
876 2
                           $matches['end'];
877 2
                },
878 2
                $html
879
            );
880 2
        } while ($original !== $html);
881
882
        do {
883 2
            $original = $html;
884
885 2
            $html = (string) \preg_replace_callback(
886 2
                '/(?<start>[^<]*)?(?<broken>(?:(?:<\/\w+(?:\s+\w+=\\"[^\"]+\\")*+)(?:[^<]+)>)+)(?<end>.*)/u',
887
                static function ($matches) {
888 2
                    $matches['broken'] = \str_replace(
889 2
                        ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
890 2
                        ['</', '<', '>'],
891 2
                        $matches['broken']
892
                    );
893
894 2
                    self::$domBrokenReplaceHelper['orig'][] = $matches['broken'];
895 2
                    self::$domBrokenReplaceHelper['tmp'][] = $matchesHash = '____simple_html_dom__voku__broken_html____' . \crc32($matches['broken']);
896
897 2
                    return $matches['start'] . $matchesHash . $matches['end'];
898 2
                },
899 2
                $html
900
            );
901 2
        } while ($original !== $html);
902
903 2
        return \str_replace(
904 2
            ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
905 2
            ['</', '<', '>'],
906 2
            $html
907
        );
908
    }
909
910
    /**
911
     * @param string $html
912
     */
913 1
    protected function keepSpecialScriptTags(string &$html)
914
    {
915 1
        $specialScripts = [];
916
        // regEx for e.g.: [<script id="elements-image-1" type="text/html">...</script>]
917 1
        $regExSpecialScript = '/<(script) [^>]*type=(["\']){0,1}(text\/html|text\/x-custom-template)\2{0,1}([^>]*)>.*<\/\1>/isU';
918 1
        \preg_match_all($regExSpecialScript, $html, $specialScripts);
919
920 1
        if (isset($specialScripts[0])) {
921 1
            foreach ($specialScripts[0] as $specialScript) {
922 1
                $specialNonScript = '<' . self::$domHtmlSpecialScriptHelper . \substr($specialScript, \strlen('<script'));
923 1
                $specialNonScript = \substr($specialNonScript, 0, -\strlen('</script>')) . '</' . self::$domHtmlSpecialScriptHelper . '>';
924
                // remove the html5 fallback
925 1
                $specialNonScript = \str_replace('<\/', '</', $specialNonScript);
926
927 1
                $html = \str_replace($specialScript, $specialNonScript, $html);
928
            }
929
        }
930 1
    }
931
932
    /**
933
     * @param bool $keepBrokenHtml
934
     *
935
     * @return HtmlDomParser
936
     */
937 2
    public function useKeepBrokenHtml(bool $keepBrokenHtml): DomParserInterface
938
    {
939 2
        $this->keepBrokenHtml = $keepBrokenHtml;
940
941 2
        return $this;
942
    }
943
}
944