Completed
Push — master ( 7c5d47...308454 )
by Lars
01:48 queued 12s
created

HtmlDomParser   F

Complexity

Total Complexity 108

Size/Duplication

Total Lines 986
Duplicated Lines 3.45 %

Coupling/Cohesion

Components 1
Dependencies 4

Test Coverage

Coverage 94.74%

Importance

Changes 0
Metric Value
wmc 108
lcom 1
cbo 4
dl 34
loc 986
ccs 306
cts 323
cp 0.9474
rs 1.614
c 0
b 0
f 0

36 Methods

Rating   Name   Duplication   Size   Complexity  
A __construct() 0 31 5
A __call() 10 10 2
A __callStatic() 0 20 3
B __get() 0 18 7
A __invoke() 0 4 1
A __toString() 0 4 1
A clear() 0 4 1
B replaceToPreserveHtmlEntities() 0 35 6
A putReplacedBackToPreserveHtmlEntities() 0 37 4
F createDOMDocument() 0 113 24
A html5FallbackForScriptTags() 0 8 1
A keepSpecialScriptTags() 0 19 3
A keepBrokenHtml() 0 45 3
A getElementById() 0 4 1
A getElementByTagName() 0 10 2
A getElementsById() 0 4 1
A getElementsByTagName() 23 23 4
A findOne() 0 4 1
A find() 0 25 4
C fixHtmlOutput() 0 107 9
A getDocument() 0 4 1
A getEncoding() 0 4 1
A getIsDOMDocumentCreatedWithoutHtml() 0 4 1
A getIsDOMDocumentCreatedWithoutHtmlWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutHeadWrapper() 0 4 1
A getIsDOMDocumentCreatedWithoutWrapper() 0 4 1
A html() 0 14 3
A useKeepBrokenHtml() 0 6 1
A xml() 0 9 1
A innerHtml() 0 11 2
A loadHtml() 0 6 1
B loadHtmlFile() 0 27 6
A save() 0 9 2
A set_callback() 0 4 1
A text() 0 4 1
A __clone() 0 4 1

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complex Class

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like HtmlDomParser often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use HtmlDomParser, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
declare(strict_types=1);
4
5
namespace voku\helper;
6
7
/**
8
 * @property-read string outerText <p>Get dom node's outer html (alias for "outerHtml").</p>
9
 * @property-read string outerHtml <p>Get dom node's outer html.</p>
10
 * @property-read string innerText <p>Get dom node's inner html (alias for "innerHtml").</p>
11
 * @property-read string innerHtml <p>Get dom node's inner html.</p>
12
 * @property-read string plaintext <p>Get dom node's plain text.</p>
13
 *
14
 * @method string outerText() <p>Get dom node's outer html (alias for "outerHtml()").</p>
15
 * @method string outerHtml() <p>Get dom node's outer html.</p>
16
 * @method string innerText() <p>Get dom node's inner html (alias for "innerHtml()").</p>
17
 * @method HtmlDomParser load() load($html) <p>Load HTML from string.</p>
18
 * @method HtmlDomParser load_file() load_file($html) <p>Load HTML from file.</p>
19
 * @method static HtmlDomParser file_get_html() file_get_html($html, $libXMLExtraOptions = null) <p>Load HTML from file.</p>
20
 * @method static HtmlDomParser str_get_html() str_get_html($html, $libXMLExtraOptions = null) <p>Load HTML from string.</p>
21
 */
22
class HtmlDomParser
23
{
24
    /**
25
     * @var array
26
     */
27
    protected static $functionAliases = [
28
        'outertext' => 'html',
29
        'outerhtml' => 'html',
30
        'innertext' => 'innerHtml',
31
        'innerhtml' => 'innerHtml',
32
        'load'      => 'loadHtml',
33
        'load_file' => 'loadHtmlFile',
34
    ];
35
36
    /**
37
     * @var string[][]
38
     */
39
    protected static $domLinkReplaceHelper = [
40
        'orig' => ['[', ']', '{', '}'],
41
        'tmp'  => [
42
            '____SIMPLE_HTML_DOM__VOKU__SQUARE_BRACKET_LEFT____',
43
            '____SIMPLE_HTML_DOM__VOKU__SQUARE_BRACKET_RIGHT____',
44
            '____SIMPLE_HTML_DOM__VOKU__BRACKET_LEFT____',
45
            '____SIMPLE_HTML_DOM__VOKU__BRACKET_RIGHT____',
46
        ],
47
    ];
48
49
    /**
50
     * @var array
51
     */
52
    protected static $domReplaceHelper = [
53
        'orig' => ['&', '|', '+', '%', '@'],
54
        'tmp'  => [
55
            '____SIMPLE_HTML_DOM__VOKU__AMP____',
56
            '____SIMPLE_HTML_DOM__VOKU__PIPE____',
57
            '____SIMPLE_HTML_DOM__VOKU__PLUS____',
58
            '____SIMPLE_HTML_DOM__VOKU__PERCENT____',
59
            '____SIMPLE_HTML_DOM__VOKU__AT____',
60
        ],
61
    ];
62
63
    protected static $domHtmlWrapperHelper = '____simple_html_dom__voku__html_wrapper____';
64
65
    protected static $domHtmlSpecialScriptHelper = '____simple_html_dom__voku__html_special_sctipt____';
66
67
    /**
68
     * @var array
69
     */
70
    protected static $domBrokenReplaceHelper = [];
71
72
    /**
73
     * @var callable
74
     */
75
    protected static $callback;
76
77
    /**
78
     * @var \DOMDocument
79
     */
80
    protected $document;
81
82
    /**
83
     * @var string
84
     */
85
    protected $encoding = 'UTF-8';
86
87
    /**
88
     * @var bool
89
     */
90
    protected $isDOMDocumentCreatedWithoutHtml = false;
91
92
    /**
93
     * @var bool
94
     */
95
    protected $isDOMDocumentCreatedWithoutWrapper = false;
96
97
    /**
98
     * @var bool
99
     */
100
    protected $isDOMDocumentCreatedWithoutHeadWrapper = false;
101
102
    /**
103
     * @var bool
104
     */
105
    protected $isDOMDocumentCreatedWithoutHtmlWrapper = false;
106
107
    /**
108
     * @var bool
109
     */
110
    protected $isDOMDocumentCreatedWithFakeEndScript = false;
111
112
    /**
113
     * @var bool
114
     */
115
    protected $keepBrokenHtml;
116
117
    /**
118
     * Constructor
119
     *
120
     * @param \DOMNode|SimpleHtmlDom|string $element HTML code or SimpleHtmlDom, \DOMNode
121
     *
122
     * @throws \InvalidArgumentException
123
     */
124 135
    public function __construct($element = null)
125
    {
126 135
        $this->document = new \DOMDocument('1.0', $this->getEncoding());
127
128
        // reset
129 135
        self::$domBrokenReplaceHelper = [];
130
131
        // DOMDocument settings
132 135
        $this->document->preserveWhiteSpace = true;
133 135
        $this->document->formatOutput = true;
134
135 135
        if ($element instanceof SimpleHtmlDom) {
136 67
            $element = $element->getNode();
137
        }
138
139 135
        if ($element instanceof \DOMNode) {
140 67
            $domNode = $this->document->importNode($element, true);
141
142 67
            if ($domNode instanceof \DOMNode) {
143
                /** @noinspection UnusedFunctionResultInspection */
144 67
                $this->document->appendChild($domNode);
145
            }
146
147 67
            return;
148
        }
149
150 135
        if ($element !== null) {
151
            /** @noinspection UnusedFunctionResultInspection */
152 76
            $this->loadHtml($element);
153
        }
154 134
    }
155
156
    /**
157
     * @param $name
158
     * @param $arguments
159
     *
160
     * @return bool|mixed
161
     */
162 50 View Code Duplication
    public function __call($name, $arguments)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
163
    {
164 50
        $name = \strtolower($name);
165
166 50
        if (isset(self::$functionAliases[$name])) {
167 49
            return \call_user_func_array([$this, self::$functionAliases[$name]], $arguments);
168
        }
169
170 1
        throw new \BadMethodCallException('Method does not exist: ' . $name);
171
    }
172
173
    /**
174
     * @param $name
175
     * @param $arguments
176
     *
177
     * @throws \BadMethodCallException
178
     * @throws \RuntimeException
179
     * @throws \InvalidArgumentException
180
     *
181
     * @return HtmlDomParser
182
     */
183 19
    public static function __callStatic($name, $arguments)
184
    {
185 19
        $arguments0 = $arguments[0] ?? '';
186
187 19
        $arguments1 = $arguments[1] ?? null;
188
189 19
        if ($name === 'str_get_html') {
190 14
            $parser = new self();
191
192 14
            return $parser->loadHtml($arguments0, $arguments1);
193
        }
194
195 5
        if ($name === 'file_get_html') {
196 4
            $parser = new self();
197
198 4
            return $parser->loadHtmlFile($arguments0, $arguments1);
0 ignored issues
show
Bug Best Practice introduced by
The return type of return $parser->loadHtml...guments0, $arguments1); (self) is incompatible with the return type documented by voku\helper\HtmlDomParser::__callStatic of type voku\helper\HtmlDomParser.

If you return a value from a function or method, it should be a sub-type of the type that is given by the parent type f.e. an interface, or abstract method. This is more formally defined by the Lizkov substitution principle, and guarantees that classes that depend on the parent type can use any instance of a child type interchangably. This principle also belongs to the SOLID principles for object oriented design.

Let’s take a look at an example:

class Author {
    private $name;

    public function __construct($name) {
        $this->name = $name;
    }

    public function getName() {
        return $this->name;
    }
}

abstract class Post {
    public function getAuthor() {
        return 'Johannes';
    }
}

class BlogPost extends Post {
    public function getAuthor() {
        return new Author('Johannes');
    }
}

class ForumPost extends Post { /* ... */ }

function my_function(Post $post) {
    echo strtoupper($post->getAuthor());
}

Our function my_function expects a Post object, and outputs the author of the post. The base class Post returns a simple string and outputting a simple string will work just fine. However, the child class BlogPost which is a sub-type of Post instead decided to return an object, and is therefore violating the SOLID principles. If a BlogPost were passed to my_function, PHP would not complain, but ultimately fail when executing the strtoupper call in its body.

Loading history...
199
        }
200
201 1
        throw new \BadMethodCallException('Method does not exist');
202
    }
203
204
    /** @noinspection MagicMethodsValidityInspection */
205
206
    /**
207
     * @param $name
208
     *
209
     * @return string
210
     */
211 14
    public function __get($name)
212
    {
213 14
        $name = \strtolower($name);
214
215
        switch ($name) {
216 14
            case 'outerhtml':
217 14
            case 'outertext':
218 5
                return $this->html();
219 10
            case 'innerhtml':
220 4
            case 'innertext':
221 7
                return $this->innerHtml();
222 3
            case 'text':
223 3
            case 'plaintext':
224 2
                return $this->text();
225
        }
226
227 1
        return null;
228
    }
229
230
    /**
231
     * @param string $selector
232
     * @param int    $idx
233
     *
234
     * @return SimpleHtmlDom|SimpleHtmlDom[]|SimpleHtmlDomNodeInterface
235
     */
236 3
    public function __invoke($selector, $idx = null)
237
    {
238 3
        return $this->find($selector, $idx);
239
    }
240
241
    /**
242
     * @return string
243
     */
244 17
    public function __toString()
245
    {
246 17
        return $this->html();
247
    }
248
249
    /**
250
     * does nothing (only for api-compatibility-reasons)
251
     *
252
     * @deprecated
253
     *
254
     * @return bool
255
     */
256 1
    public function clear(): bool
257
    {
258 1
        return true;
259
    }
260
261
    /**
262
     * @param string $html
263
     *
264
     * @return string
265
     */
266 124
    public static function replaceToPreserveHtmlEntities(string $html): string
267
    {
268
        // init
269 124
        $linksNew = [];
270 124
        $linksOld = [];
271
272 124
        if (\strpos($html, 'http') !== false) {
273
274
            // regEx for e.g.: [https://www.domain.de/foo.php?foobar=1&email=lars%40moelleken.org&guid=test1233312&{{foo}}#foo]
275 58
            $regExUrl = '/(\[?\bhttps?:\/\/[^\s<>]+(?:\([\w]+\)|[^[:punct:]\s]|\/|\}|\]))/i';
276 58
            \preg_match_all($regExUrl, $html, $linksOld);
277
278 58
            if (!empty($linksOld[1])) {
279 56
                $linksOld = $linksOld[1];
280 56
                foreach ((array) $linksOld as $linkKey => $linkOld) {
281 56
                    $linksNew[$linkKey] = \str_replace(
282 56
                        self::$domLinkReplaceHelper['orig'],
283 56
                        self::$domLinkReplaceHelper['tmp'],
284 56
                        $linkOld
285
                    );
286
                }
287
            }
288
        }
289
290 124
        $linksNewCount = \count($linksNew);
291 124
        if ($linksNewCount > 0 && \count($linksOld) === $linksNewCount) {
292 56
            $search = \array_merge($linksOld, self::$domReplaceHelper['orig']);
293 56
            $replace = \array_merge($linksNew, self::$domReplaceHelper['tmp']);
294
        } else {
295 72
            $search = self::$domReplaceHelper['orig'];
296 72
            $replace = self::$domReplaceHelper['tmp'];
297
        }
298
299 124
        return \str_replace($search, $replace, $html);
300
    }
301
302
    /**
303
     * @param string $html
304
     *
305
     * @return string
306
     */
307 80
    public static function putReplacedBackToPreserveHtmlEntities(string $html): string
308
    {
309 80
        static $DOM_REPLACE__HELPER_CACHE = null;
310
311 80
        if ($DOM_REPLACE__HELPER_CACHE === null) {
312 1
            $DOM_REPLACE__HELPER_CACHE['tmp'] = \array_merge(
313 1
                self::$domLinkReplaceHelper['tmp'],
314 1
                self::$domReplaceHelper['tmp']
315
            );
316 1
            $DOM_REPLACE__HELPER_CACHE['orig'] = \array_merge(
317 1
                self::$domLinkReplaceHelper['orig'],
318 1
                self::$domReplaceHelper['orig']
319
            );
320
321 1
            $DOM_REPLACE__HELPER_CACHE['tmp']['html_wrapper__start'] = '<' . self::$domHtmlWrapperHelper . '>';
322 1
            $DOM_REPLACE__HELPER_CACHE['tmp']['html_wrapper__end'] = '</' . self::$domHtmlWrapperHelper . '>';
323
324 1
            $DOM_REPLACE__HELPER_CACHE['orig']['html_wrapper__start'] = '';
325 1
            $DOM_REPLACE__HELPER_CACHE['orig']['html_wrapper__end'] = '';
326
327 1
            $DOM_REPLACE__HELPER_CACHE['tmp']['html_special_script__start'] = '<' . self::$domHtmlSpecialScriptHelper;
328 1
            $DOM_REPLACE__HELPER_CACHE['tmp']['html_special_script__end'] = '</' . self::$domHtmlSpecialScriptHelper . '>';
329
330 1
            $DOM_REPLACE__HELPER_CACHE['orig']['html_special_script__start'] = '<script';
331 1
            $DOM_REPLACE__HELPER_CACHE['orig']['html_special_script__end'] = '</script>';
332
        }
333
334
        if (
335 80
            isset(self::$domBrokenReplaceHelper['tmp'])
336
            &&
337 80
            \count(self::$domBrokenReplaceHelper['tmp']) > 0
338
        ) {
339 2
            $html = \str_replace(self::$domBrokenReplaceHelper['tmp'], self::$domBrokenReplaceHelper['orig'], $html);
340
        }
341
342 80
        return \str_replace($DOM_REPLACE__HELPER_CACHE['tmp'], $DOM_REPLACE__HELPER_CACHE['orig'], $html);
343
    }
344
345
    /**
346
     * Create DOMDocument from HTML.
347
     *
348
     * @param string   $html
349
     * @param int|null $libXMLExtraOptions
350
     *
351
     * @return \DOMDocument
352
     */
353 123
    private function createDOMDocument(string $html, $libXMLExtraOptions = null): \DOMDocument
354
    {
355 123
        if ($this->keepBrokenHtml) {
356 2
            $html = $this->keepBrokenHtml(\trim($html));
357
        }
358
359 123
        if (\strpos($html, '<') === false) {
360 6
            $this->isDOMDocumentCreatedWithoutHtml = true;
361 122
        } elseif (\strpos(\ltrim($html), '<') !== 0) {
362 5
            $this->isDOMDocumentCreatedWithoutWrapper = true;
363
        }
364
365 123
        if (\strpos($html, '<html') === false) {
366 72
            $this->isDOMDocumentCreatedWithoutHtmlWrapper = true;
367
        }
368
369
        /** @noinspection HtmlRequiredTitleElement */
370 123
        if (\strpos($html, '<head>') === false) {
371 74
            $this->isDOMDocumentCreatedWithoutHeadWrapper = true;
372
        }
373
374
        if (
375 123
            \strpos($html, '</script>') === false
376
            &&
377 123
            \strpos($html, '<\/script>') !== false
378
        ) {
379 1
            $this->isDOMDocumentCreatedWithFakeEndScript = true;
380
        }
381
382 123
        if (\strpos($html, '<script') !== false) {
383 13
            $this->html5FallbackForScriptTags($html);
384
385
            if (
386 13
                \strpos($html, 'type="text/html"') !== false
387
                ||
388 12
                \strpos($html, 'type=\'text/html\'') !== false
389
                ||
390 13
                \strpos($html, 'type=text/html') !== false
391
            ) {
392 1
                $this->keepSpecialScriptTags($html);
393
            }
394
        }
395
396
        // set error level
397 123
        $internalErrors = \libxml_use_internal_errors(true);
398 123
        $disableEntityLoader = \libxml_disable_entity_loader(true);
399 123
        \libxml_clear_errors();
400
401 123
        $optionsXml = \LIBXML_DTDLOAD | \LIBXML_DTDATTR | \LIBXML_NONET;
402
403 123
        if (\defined('LIBXML_BIGLINES')) {
404 123
            $optionsXml |= \LIBXML_BIGLINES;
405
        }
406
407 123
        if (\defined('LIBXML_COMPACT')) {
408 123
            $optionsXml |= \LIBXML_COMPACT;
409
        }
410
411 123
        if (\defined('LIBXML_HTML_NODEFDTD')) {
412 123
            $optionsXml |= \LIBXML_HTML_NODEFDTD;
413
        }
414
415 123
        if ($libXMLExtraOptions !== null) {
416 1
            $optionsXml |= $libXMLExtraOptions;
417
        }
418
419
        if (
420 123
            $this->isDOMDocumentCreatedWithoutWrapper
421
            ||
422 123
            $this->keepBrokenHtml
423
        ) {
424 6
            $html = '<' . self::$domHtmlWrapperHelper . '>' . $html . '</' . self::$domHtmlWrapperHelper . '>';
425
        }
426
427 123
        $html = self::replaceToPreserveHtmlEntities($html);
428
429 123
        $sxe = \simplexml_load_string($html, \SimpleXMLElement::class, $optionsXml);
430 123
        if ($sxe !== false && \count(\libxml_get_errors()) === 0) {
431 42
            $this->document = \dom_import_simplexml($sxe)->ownerDocument;
432
        } else {
433
434
            // UTF-8 hack: http://php.net/manual/en/domdocument.loadhtml.php#95251
435 85
            $xmlHackUsed = false;
436 85
            if (\stripos('<?xml', $html) !== 0) {
437 85
                $xmlHackUsed = true;
438 85
                $html = '<?xml encoding="' . $this->getEncoding() . '" ?>' . $html;
439
            }
440
441 85
            $this->document->loadHTML($html, $optionsXml);
442
443
            // remove the "xml-encoding" hack
444 85
            if ($xmlHackUsed) {
445 85
                foreach ($this->document->childNodes as $child) {
446 85
                    if ($child->nodeType === \XML_PI_NODE) {
447
                        /** @noinspection UnusedFunctionResultInspection */
448 85
                        $this->document->removeChild($child);
449
450 85
                        break;
451
                    }
452
                }
453
            }
454
        }
455
456
        // set encoding
457 123
        $this->document->encoding = $this->getEncoding();
458
459
        // restore lib-xml settings
460 123
        \libxml_clear_errors();
461 123
        \libxml_use_internal_errors($internalErrors);
462 123
        \libxml_disable_entity_loader($disableEntityLoader);
463
464 123
        return $this->document;
465
    }
466
467
    /**
468
     * workaround for bug: https://bugs.php.net/bug.php?id=74628
469
     *
470
     * @param string $html
471
     */
472 13
    protected function html5FallbackForScriptTags(string &$html)
473
    {
474
        // regEx for e.g.: [<script id="elements-image-2">...<script>]
475 13
        $regExSpecialScript = '/<(script)(?<attr>[^>]*)>(?<content>.*)<\/\1>/isU';
476
        $html = \preg_replace_callback($regExSpecialScript, function ($scripts) {
477 12
            return '<script' . $scripts['attr'] . '>' . \str_replace('</', '<\/', $scripts['content']) . '</script>';
478 13
        }, $html);
479 13
    }
480
481
    /**
482
     * @param string $html
483
     */
484 1
    protected function keepSpecialScriptTags(string &$html)
485
    {
486 1
        $specialScripts = [];
487
        // regEx for e.g.: [<script id="elements-image-1" type="text/html">...</script>]
488 1
        $regExSpecialScript = '/<(script) [^>]*type=(["\']){0,1}text\/html\2{0,1}([^>]*)>.*<\/\1>/isU';
489 1
        \preg_match_all($regExSpecialScript, $html, $specialScripts);
490
491 1
        if (isset($specialScripts[0])) {
492 1
            foreach ($specialScripts[0] as $specialScript) {
493
494 1
                $specialNonScript = '<' . self::$domHtmlSpecialScriptHelper . \substr($specialScript, \strlen('<script'));
495 1
                $specialNonScript = \substr($specialNonScript, 0, -\strlen('</script>')) . '</' . self::$domHtmlSpecialScriptHelper . '>';
496
                // remove the html5 fallback
497 1
                $specialNonScript = \str_replace('<\/', '</', $specialNonScript);
498
499 1
                $html = \str_replace($specialScript, $specialNonScript, $html);
500
            }
501
        }
502 1
    }
503
504
    /**
505
     * @param string $html
506
     *
507
     * @return string
508
     */
509 2
    protected function keepBrokenHtml(string $html): string
510
    {
511
        do {
512 2
            $original = $html;
513
514 2
            $html = (string) \preg_replace_callback(
515 2
                '/(?<start>.*)<(?<element_start>[a-z]+)(?<element_start_addon> [^>]*)?>(?<value>.*?)<\/(?<element_end>\2)>(?<end>.*)/sui',
516
                function ($matches) {
517 2
                    return $matches['start'] .
518 2
                           '°lt_simple_html_dom__voku_°' . $matches['element_start'] . $matches['element_start_addon'] . '°gt_simple_html_dom__voku_°' .
519 2
                           $matches['value'] .
520 2
                           '°lt/_simple_html_dom__voku_°' . $matches['element_end'] . '°gt_simple_html_dom__voku_°' .
521 2
                           $matches['end'];
522 2
                },
523 2
                $html
524
            );
525 2
        } while ($original !== $html);
526
527
        do {
528 2
            $original = $html;
529
530 2
            $html = (string) \preg_replace_callback(
531 2
                '/(?<start>[^<]*)?(?<broken>(?:(?:<\/\w+(?:\s+\w+=\\"[^\"]+\\")*+)(?:[^<]+)>)+)(?<end>.*)/u',
532
                function ($matches) {
533 2
                    $matches['broken'] = \str_replace(
534 2
                        ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
535 2
                        ['</', '<', '>'],
536 2
                        $matches['broken']
537
                    );
538
539 2
                    self::$domBrokenReplaceHelper['orig'][] = $matches['broken'];
540 2
                    self::$domBrokenReplaceHelper['tmp'][] = $matchesHash = '____simple_html_dom__voku__broken_html____' . \crc32($matches['broken']);
541
542 2
                    return $matches['start'] . $matchesHash . $matches['end'];
543 2
                },
544 2
                $html
545
            );
546 2
        } while ($original !== $html);
547
548 2
        return \str_replace(
549 2
            ['°lt/_simple_html_dom__voku_°', '°lt_simple_html_dom__voku_°', '°gt_simple_html_dom__voku_°'],
550 2
            ['</', '<', '>'],
551 2
            $html
552
        );
553
    }
554
555
    /**
556
     * Return element by #id.
557
     *
558
     * @param string $id
559
     *
560
     * @return SimpleHtmlDom|SimpleHtmlDomNodeBlank
561
     */
562 2
    public function getElementById(string $id)
563
    {
564 2
        return $this->find("#${id}", 0);
565
    }
566
567
    /**
568
     * Return element by tag name.
569
     *
570
     * @param string $name
571
     *
572
     * @return SimpleHtmlDom|SimpleHtmlDomNodeBlank
573
     */
574 1
    public function getElementByTagName(string $name)
575
    {
576 1
        $node = $this->document->getElementsByTagName($name)->item(0);
577
578 1
        if ($node === null) {
579
            return new SimpleHtmlDomNodeBlank();
580
        }
581
582 1
        return new SimpleHtmlDom($node);
583
    }
584
585
    /**
586
     * Returns elements by #id.
587
     *
588
     * @param string   $id
589
     * @param int|null $idx
590
     *
591
     * @return SimpleHtmlDom|SimpleHtmlDom[]|SimpleHtmlDomNodeInterface
592
     */
593
    public function getElementsById(string $id, $idx = null)
594
    {
595
        return $this->find("#${id}", $idx);
596
    }
597
598
    /**
599
     * Returns elements by tag name.
600
     *
601
     * @param string   $name
602
     * @param int|null $idx
603
     *
604
     * @return SimpleHtmlDom|SimpleHtmlDom[]|SimpleHtmlDomNode|SimpleHtmlDomNodeBlank
605
     */
606 3 View Code Duplication
    public function getElementsByTagName(string $name, $idx = null)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
607
    {
608 3
        $nodesList = $this->document->getElementsByTagName($name);
609
610 3
        $elements = new SimpleHtmlDomNode();
611
612 3
        foreach ($nodesList as $node) {
613 3
            $elements[] = new SimpleHtmlDom($node);
614
        }
615
616
        // return all elements
617 3
        if ($idx === null) {
618 2
            return $elements;
619
        }
620
621
        // handle negative values
622 1
        if ($idx < 0) {
623
            $idx = \count($elements) + $idx;
624
        }
625
626
        // return one element
627 1
        return $elements[$idx] ?? new SimpleHtmlDomNodeBlank();
628
    }
629
630
    /**
631
     * Find one node with a CSS selector.
632
     *
633
     * @param string $selector
634
     *
635
     * @return SimpleHtmlDom|SimpleHtmlDomNodeInterface
636
     */
637 2
    public function findOne(string $selector)
638
    {
639 2
        return $this->find($selector, 0);
640
    }
641
642
    /**
643
     * Find list of nodes with a CSS selector.
644
     *
645
     * @param string $selector
646
     * @param int    $idx
647
     *
648
     * @return SimpleHtmlDom|SimpleHtmlDom[]|SimpleHtmlDomNodeInterface
649
     */
650 86
    public function find(string $selector, $idx = null)
651
    {
652 86
        $xPathQuery = SelectorConverter::toXPath($selector);
653
654 86
        $xPath = new \DOMXPath($this->document);
655 86
        $nodesList = $xPath->query($xPathQuery);
656 86
        $elements = new SimpleHtmlDomNode();
657
658 86
        foreach ($nodesList as $node) {
659 82
            $elements[] = new SimpleHtmlDom($node);
660
        }
661
662
        // return all elements
663 86
        if ($idx === null) {
664 56
            return $elements;
665
        }
666
667
        // handle negative values
668 42
        if ($idx < 0) {
669 11
            $idx = \count($elements) + $idx;
670
        }
671
672
        // return one element
673 42
        return $elements[$idx] ?? new SimpleHtmlDomNodeBlank();
674
    }
675
676
    /**
677
     * @param string $content
678
     * @param bool   $multiDecodeNewHtmlEntity
679
     *
680
     * @return string
681
     */
682 71
    public function fixHtmlOutput(string $content, bool $multiDecodeNewHtmlEntity = false): string
683
    {
684
        // INFO: DOMDocument will encapsulate plaintext into a e.g. paragraph tag (<p>),
685
        //          so we try to remove it here again ...
686
687 71
        if ($this->isDOMDocumentCreatedWithoutHtmlWrapper) {
688
            /** @noinspection HtmlRequiredLangAttribute */
689 30
            $content = \str_replace(
690
                [
691 30
                    '<body>',
692
                    '</body>',
693
                    '<html>',
694
                    '</html>',
695
                ],
696 30
                '',
697 30
                $content
698
            );
699
        }
700
701 71
        if ($this->isDOMDocumentCreatedWithoutHeadWrapper) {
702
            /** @noinspection HtmlRequiredTitleElement */
703 31
            $content = \str_replace(
704
                [
705 31
                    '<head>',
706
                    '</head>',
707
                ],
708 31
                '',
709 31
                $content
710
            );
711
        }
712
713 71
        if ($this->isDOMDocumentCreatedWithFakeEndScript) {
714 1
            $content = \str_replace(
715 1
                '</script>',
716 1
                '',
717 1
                $content
718
            );
719
        }
720
721 71
        if ($this->isDOMDocumentCreatedWithoutWrapper) {
722 4
            $content = (string) \preg_replace('/^<p>/', '', $content);
723 4
            $content = (string) \preg_replace('/<\/p>/', '', $content);
724
        }
725
726 71
        if ($this->isDOMDocumentCreatedWithoutHtml) {
727 5
            $content = \str_replace(
728
                [
729 5
                    '<p>',
730
                    '</p>',
731
                    '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">',
732
                ],
733 5
                '',
734 5
                $content
735
            );
736
        }
737
738
        /** @noinspection CheckTagEmptyBody */
739
        /** @noinspection HtmlExtraClosingTag */
740
        /** @noinspection HtmlRequiredTitleElement */
741 71
        $content = \trim(
742 71
            \str_replace(
743
                [
744 71
                    '<simpleHtmlDomP>',
745
                    '</simpleHtmlDomP>',
746
                    '<head><head>',
747
                    '</head></head>',
748
                    '<br></br>',
749
                ],
750
                [
751 71
                    '',
752
                    '',
753
                    '<head>',
754
                    '</head>',
755
                    '<br>',
756
                ],
757 71
                $content
758
            )
759
        );
760
761 71
        if ($multiDecodeNewHtmlEntity) {
762 3
            if (\class_exists('\voku\helper\UTF8')) {
763
764
                /** @noinspection PhpUndefinedClassInspection */
765
                $content = UTF8::rawurldecode($content);
766
            } else {
767
                do {
768 3
                    $content_compare = $content;
769
770 3
                    $content = \rawurldecode(
771 3
                        \html_entity_decode(
772 3
                            $content,
773 3
                            \ENT_QUOTES | \ENT_HTML5
774
                        )
775
                    );
776 3
                } while ($content_compare !== $content);
777
            }
778
        } else {
779 70
            $content = \rawurldecode(
780 70
                \html_entity_decode(
781 70
                    $content,
782 70
                    \ENT_QUOTES | \ENT_HTML5
783
                )
784
            );
785
        }
786
787 71
        return self::putReplacedBackToPreserveHtmlEntities($content);
788
    }
789
790
    /**
791
     * @return \DOMDocument
792
     */
793 39
    public function getDocument(): \DOMDocument
794
    {
795 39
        return $this->document;
796
    }
797
798
    /**
799
     * Get the encoding to use.
800
     *
801
     * @return string
802
     */
803 135
    private function getEncoding(): string
804
    {
805 135
        return $this->encoding;
806
    }
807
808
    /**
809
     * @return bool
810
     */
811 9
    public function getIsDOMDocumentCreatedWithoutHtml(): bool
812
    {
813 9
        return $this->isDOMDocumentCreatedWithoutHtml;
814
    }
815
816
    /**
817
     * @return bool
818
     */
819 46
    public function getIsDOMDocumentCreatedWithoutHtmlWrapper(): bool
820
    {
821 46
        return $this->isDOMDocumentCreatedWithoutHtmlWrapper;
822
    }
823
824
    /**
825
     * @return bool
826
     */
827 9
    public function getIsDOMDocumentCreatedWithoutHeadWrapper(): bool
828
    {
829 9
        return $this->isDOMDocumentCreatedWithoutHeadWrapper;
830
    }
831
832
    /**
833
     * @return bool
834
     */
835
    public function getIsDOMDocumentCreatedWithoutWrapper(): bool
836
    {
837
        return $this->isDOMDocumentCreatedWithoutWrapper;
838
    }
839
840
    /**
841
     * Get dom node's outer html.
842
     *
843
     * @param bool $multiDecodeNewHtmlEntity
844
     *
845
     * @return string
846
     */
847 46
    public function html(bool $multiDecodeNewHtmlEntity = false): string
848
    {
849 46
        if ($this::$callback !== null) {
850
            \call_user_func($this::$callback, [$this]);
851
        }
852
853 46
        if ($this->getIsDOMDocumentCreatedWithoutHtmlWrapper()) {
854 23
            $content = $this->document->saveHTML($this->document->documentElement);
855
        } else {
856 31
            $content = $this->document->saveHTML();
857
        }
858
859 46
        return $this->fixHtmlOutput($content, $multiDecodeNewHtmlEntity);
860
    }
861
862
    /**
863
     * @param bool $keepBrokenHtml
864
     *
865
     * @return HtmlDomParser
866
     */
867 2
    public function useKeepBrokenHtml(bool $keepBrokenHtml): self
868
    {
869 2
        $this->keepBrokenHtml = $keepBrokenHtml;
870
871 2
        return $this;
872
    }
873
874
    /**
875
     * Get the HTML as XML.
876
     *
877
     * @param bool $multiDecodeNewHtmlEntity
878
     *
879
     * @return string
880
     */
881 2
    public function xml(bool $multiDecodeNewHtmlEntity = false): string
882
    {
883 2
        $xml = $this->document->saveXML(null, \LIBXML_NOEMPTYTAG);
884
885
        // remove the XML-header
886 2
        $xml = \ltrim((string) \preg_replace('/<\?xml.*\?>/', '', $xml));
887
888 2
        return $this->fixHtmlOutput($xml, $multiDecodeNewHtmlEntity);
889
    }
890
891
    /**
892
     * Get dom node's inner html.
893
     *
894
     * @param bool $multiDecodeNewHtmlEntity
895
     *
896
     * @return string
897
     */
898 19
    public function innerHtml(bool $multiDecodeNewHtmlEntity = false): string
899
    {
900
        // init
901 19
        $text = '';
902
903 19
        foreach ($this->document->documentElement->childNodes as $node) {
904 19
            $text .= $this->document->saveHTML($node);
905
        }
906
907 19
        return $this->fixHtmlOutput($text, $multiDecodeNewHtmlEntity);
908
    }
909
910
    /**
911
     * Load HTML from string.
912
     *
913
     * @param string   $html
914
     * @param int|null $libXMLExtraOptions
915
     *
916
     * @throws \InvalidArgumentException if argument is not string
917
     *
918
     * @return HtmlDomParser
919
     */
920 123
    public function loadHtml(string $html, $libXMLExtraOptions = null): self
921
    {
922 123
        $this->document = $this->createDOMDocument($html, $libXMLExtraOptions);
923
924 123
        return $this;
925
    }
926
927
    /**
928
     * Load HTML from file.
929
     *
930
     * @param string   $filePath
931
     * @param int|null $libXMLExtraOptions
932
     *
933
     * @throws \RuntimeException
934
     * @throws \InvalidArgumentException
935
     *
936
     * @return HtmlDomParser
937
     */
938 11
    public function loadHtmlFile(string $filePath, $libXMLExtraOptions = null): self
939
    {
940
        if (
941 11
            !\preg_match("/^https?:\/\//i", $filePath)
942
            &&
943 11
            !\file_exists($filePath)
944
        ) {
945 1
            throw new \RuntimeException("File ${filePath} not found");
946
        }
947
948
        try {
949 10
            if (\class_exists('\voku\helper\UTF8')) {
950
                /** @noinspection PhpUndefinedClassInspection */
951
                $html = UTF8::file_get_contents($filePath);
952
            } else {
953 10
                $html = \file_get_contents($filePath);
954
            }
955 1
        } catch (\Exception $e) {
956 1
            throw new \RuntimeException("Could not load file ${filePath}");
957
        }
958
959 9
        if ($html === false) {
960
            throw new \RuntimeException("Could not load file ${filePath}");
961
        }
962
963 9
        return $this->loadHtml($html, $libXMLExtraOptions);
964
    }
965
966
    /**
967
     * Save the html-dom as string.
968
     *
969
     * @param string $filepath
970
     *
971
     * @return string
972
     */
973 1
    public function save(string $filepath = ''): string
974
    {
975 1
        $string = $this->innerHtml();
976 1
        if ($filepath !== '') {
977
            \file_put_contents($filepath, $string, \LOCK_EX);
978
        }
979
980 1
        return $string;
981
    }
982
983
    /**
984
     * @param $functionName
985
     */
986
    public function set_callback($functionName)
987
    {
988
        $this::$callback = $functionName;
989
    }
990
991
    /**
992
     * Get dom node's plain text.
993
     *
994
     * @param bool $multiDecodeNewHtmlEntity
995
     *
996
     * @return string
997
     */
998 3
    public function text(bool $multiDecodeNewHtmlEntity = false): string
999
    {
1000 3
        return $this->fixHtmlOutput($this->document->textContent, $multiDecodeNewHtmlEntity);
1001
    }
1002
1003
    public function __clone()
1004
    {
1005
        $this->document = clone $this->document;
1006
    }
1007
}
1008