Passed
Push — master ( 7b193d...4f1a57 )
by axel
02:03
created

simple_html_dom::parse_charset()   F

Complexity

Conditions 20
Paths 3120

Size

Total Lines 70
Code Lines 38

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 20
eloc 38
nc 3120
nop 0
dl 0
loc 70
rs 0
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
namespace DomParser;
4
5
/*
6
 * Website: http://sourceforge.net/projects/simplehtmldom/
7
 * Acknowledge: Jose Solorzano (https://sourceforge.net/projects/php-html/)
8
 * Contributions by:
9
 *     Yousuke Kumakura (Attribute filters)
10
 *     Vadim Voituk (Negative indexes supports of "find" method)
11
 *     Antcs (Constructor with automatically load contents either text or file/url)
12
 *
13
 * all affected sections have comments starting with "PaperG"
14
 *
15
 * Paperg - Added case insensitive testing of the value of the selector.
16
 * Paperg - Added tag_start for the starting index of tags - NOTE: This works but not accurately.
17
 *  This tag_start gets counted AFTER \r\n have been crushed out, and after the remove_noice calls so it will not reflect the REAL position of the tag in the source,
18
 *  it will almost always be smaller by some amount.
19
 *  We use this to determine how far into the file the tag in question is.  This "percentage will never be accurate as the $dom->size is the "real" number of bytes the dom was created from.
20
 *  but for most purposes, it's a really good estimation.
21
 * Paperg - Added the forceTagsClosed to the dom constructor.  Forcing tags closed is great for malformed html, but it CAN lead to parsing errors.
22
 * Allow the user to tell us how much they trust the html.
23
 * Paperg add the text and plaintext to the selectors for the find syntax.  plaintext implies text in the innertext of a node.  text implies that the tag is a text node.
24
 * This allows for us to find tags based on the text they contain.
25
 * Create find_ancestor_tag to see if a tag is - at any level - inside of another specific tag.
26
 * Paperg: added parse_charset so that we know about the character set of the source document.
27
 *  NOTE:  If the user's system has a routine called get_last_retrieve_url_contents_content_type availalbe, we will assume it's returning the content-type header from the
28
 *  last transfer or curl_exec, and we will parse that and use it in preference to any other method of charset detection.
29
 *
30
 * Found infinite loop in the case of broken html in restore_noise.  Rewrote to protect from that.
31
 * PaperG (John Schlick) Added get_display_size for "IMG" tags.
32
 *
33
 * Licensed under The MIT License
34
 * Redistributions of files must retain the above copyright notice.
35
 *
36
 * @author S.C. Chen <[email protected]>
37
 * @author John Schlick
38
 * @author Rus Carroll
39
 * @version 1.5 ($Rev: 196 $)
40
 * @package PlaceLocalInclude
41
 * @subpackage simple_html_dom
42
 */
43
44
/*
45
 * All of the Defines for the classes below.
46
 * @author S.C. Chen <[email protected]>
47
 */
48
define('HDOM_TYPE_ELEMENT', 1);
49
define('HDOM_TYPE_COMMENT', 2);
50
define('HDOM_TYPE_TEXT', 3);
51
define('HDOM_TYPE_ENDTAG', 4);
52
define('HDOM_TYPE_ROOT', 5);
53
define('HDOM_TYPE_UNKNOWN', 6);
54
define('HDOM_QUOTE_DOUBLE', 0);
55
define('HDOM_QUOTE_SINGLE', 1);
56
define('HDOM_QUOTE_NO', 3);
57
define('HDOM_INFO_BEGIN', 0);
58
define('HDOM_INFO_END', 1);
59
define('HDOM_INFO_QUOTE', 2);
60
define('HDOM_INFO_SPACE', 3);
61
define('HDOM_INFO_TEXT', 4);
62
define('HDOM_INFO_INNER', 5);
63
define('HDOM_INFO_OUTER', 6);
64
define('HDOM_INFO_ENDSPACE', 7);
65
define('DEFAULT_TARGET_CHARSET', 'UTF-8');
66
define('DEFAULT_BR_TEXT', "\r\n");
67
define('DEFAULT_SPAN_TEXT', ' ');
68
if (!defined('MAX_FILE_SIZE')) {
69
    define('MAX_FILE_SIZE', 600000);
70
}
71
// helper functions
72
// -----------------------------------------------------------------------------
73
// get html dom from file
74
// $maxlen is defined in the code as PHP_STREAM_COPY_ALL which is defined as -1.
75
function file_get_html($url, $use_include_path = false, $context = null, $offset = 0, $maxLen = -1, $lowercase = true, $forceTagsClosed = true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN = true, $defaultBRText = DEFAULT_BR_TEXT, $defaultSpanText = DEFAULT_SPAN_TEXT)
0 ignored issues
show
Unused Code introduced by
The parameter $maxLen is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

75
function file_get_html($url, $use_include_path = false, $context = null, $offset = 0, /** @scrutinizer ignore-unused */ $maxLen = -1, $lowercase = true, $forceTagsClosed = true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN = true, $defaultBRText = DEFAULT_BR_TEXT, $defaultSpanText = DEFAULT_SPAN_TEXT)

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
76
{
77
    // We DO force the tags to be terminated.
78
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
79
    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
80
    $contents = file_get_contents($url, $use_include_path, $context, $offset);
81
    // Paperg - use our own mechanism for getting the contents as we want to control the timeout.
82
    //$contents = retrieve_url_contents($url);
83
    if (empty($contents) || strlen($contents) > MAX_FILE_SIZE) {
84
        return false;
85
    }
86
    // The second parameter can force the selectors to all be lowercase.
87
    $dom->load($contents, $lowercase, $stripRN);
88
89
    return $dom;
90
}
91
92
// get html dom from string
93
function str_get_html($str, $lowercase = true, $forceTagsClosed = true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN = true, $defaultBRText = DEFAULT_BR_TEXT, $defaultSpanText = DEFAULT_SPAN_TEXT)
94
{
95
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
96
    if (empty($str) || strlen($str) > MAX_FILE_SIZE) {
97
        $dom->clear();
98
99
        return false;
100
    }
101
    $dom->load($str, $lowercase, $stripRN);
102
103
    return $dom;
104
}
105
106
// dump html dom tree
107
function dump_html_tree($node, $show_attr = true, $deep = 0)
0 ignored issues
show
Unused Code introduced by
The parameter $show_attr is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

107
function dump_html_tree($node, /** @scrutinizer ignore-unused */ $show_attr = true, $deep = 0)

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
Unused Code introduced by
The parameter $deep is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

107
function dump_html_tree($node, $show_attr = true, /** @scrutinizer ignore-unused */ $deep = 0)

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
108
{
109
    $node->dump($node);
110
}
111
112
/**
113
 * simple html dom node
114
 * PaperG - added ability for "find" routine to lowercase the value of the selector.
115
 * PaperG - added $tag_start to track the start position of the tag in the total byte index.
116
 */
117
class simple_html_dom_node
118
{
119
    public $nodetype = HDOM_TYPE_TEXT;
120
    public $tag = 'text';
121
    public $attr = [];
122
    /** @var simple_html_dom_node[] $children */
123
    public $children = [];
124
    public $nodes = [];
125
    public $parent = null;
126
    // The "info" array - see HDOM_INFO_... for what each element contains.
127
    public $_ = [];
128
    public $tag_start = 0;
129
    private $dom = null;
130
131
    public function __construct(simple_html_dom $dom)
132
    {
133
        $this->dom = $dom;
134
        $dom->nodes[] = $this;
135
    }
136
137
    public function __destruct()
138
    {
139
        $this->clear();
140
    }
141
142
    public function __toString()
143
    {
144
        return $this->outertext();
145
    }
146
147
    // clean up memory due to php5 circular references memory leak...
148
    public function clear()
149
    {
150
        $this->dom = null;
151
        $this->nodes = null;
152
        $this->parent = null;
153
        $this->children = null;
154
    }
155
156
    // dump node's tree
157
    public function dump($show_attr = true, $deep = 0)
158
    {
159
        $lead = str_repeat('    ', $deep);
160
161
        echo $lead.$this->tag;
162
        if ($show_attr && count($this->attr) > 0) {
163
            echo '(';
164
            foreach ($this->attr as $k=>$v) {
165
                echo "[$k]=>\"".$this->$k.'", ';
166
            }
167
            echo ')';
168
        }
169
        echo "\n";
170
171
        if ($this->nodes) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->nodes of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
172
            foreach ($this->nodes as $c) {
173
                $c->dump($show_attr, $deep + 1);
174
            }
175
        }
176
    }
177
178
    // Debugging function to dump a single dom node with a bunch of information about it.
179
    public function dump_node($echo = true)
180
    {
181
        $string = $this->tag;
182
        if (count($this->attr) > 0) {
183
            $string .= '(';
184
            foreach ($this->attr as $k=>$v) {
185
                $string .= "[$k]=>\"".$this->$k.'", ';
186
            }
187
            $string .= ')';
188
        }
189
        if (count($this->_) > 0) {
190
            $string .= ' $_ (';
191
            foreach ($this->_ as $k=>$v) {
192
                if (is_array($v)) {
193
                    $string .= "[$k]=>(";
194
                    foreach ($v as $k2=>$v2) {
195
                        $string .= "[$k2]=>\"".$v2.'", ';
196
                    }
197
                    $string .= ')';
198
                } else {
199
                    $string .= "[$k]=>\"".$v.'", ';
200
                }
201
            }
202
            $string .= ')';
203
        }
204
205
        if (isset($this->text)) {
0 ignored issues
show
Bug Best Practice introduced by
The property text does not exist on DomParser\simple_html_dom_node. Since you implemented __get, consider adding a @property annotation.
Loading history...
206
            $string .= ' text: ('.$this->text.')';
207
        }
208
209
        $string .= " HDOM_INNER_INFO: '";
210
        if (isset($node->_[HDOM_INFO_INNER])) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $node seems to be never defined.
Loading history...
211
            $string .= $node->_[HDOM_INFO_INNER]."'";
212
        } else {
213
            $string .= ' NULL ';
214
        }
215
216
        $string .= ' children: '.count($this->children);
217
        $string .= ' nodes: '.count($this->nodes);
218
        $string .= ' tag_start: '.$this->tag_start;
219
        $string .= "\n";
220
221
        if ($echo) {
222
            echo $string;
223
224
            return;
225
        } else {
226
            return $string;
227
        }
228
    }
229
230
    // returns the parent of node
231
    // If a node is passed in, it will reset the parent of the current node to that one.
232
    public function parent($parent = null)
233
    {
234
        // I am SURE that this doesn't work properly.
235
        // It fails to unset the current node from it's current parents nodes or children list first.
236
        if ($parent !== null) {
237
            $this->parent = $parent;
238
            $this->parent->nodes[] = $this;
239
            $this->parent->children[] = $this;
240
        }
241
242
        return $this->parent;
243
    }
244
245
    // verify that node has children
246
    public function has_child()
247
    {
248
        return !empty($this->children);
249
    }
250
251
    // returns children of node
252
    public function children($idx = -1)
253
    {
254
        if ($idx === -1) {
255
            return $this->children;
256
        }
257
        if (isset($this->children[$idx])) {
258
            return $this->children[$idx];
259
        }
260
    }
261
262
    // returns the first child of node
263
    public function first_child()
264
    {
265
        if (count($this->children) > 0) {
266
            return $this->children[0];
267
        }
268
    }
269
270
    // returns the last child of node
271
    public function last_child()
272
    {
273
        if (($count = count($this->children)) > 0) {
274
            return $this->children[$count - 1];
275
        }
276
    }
277
278
    // returns the next sibling of node
279
    public function next_sibling()
280
    {
281
        if ($this->parent === null) {
282
            return;
283
        }
284
285
        $idx = 0;
286
        $count = count($this->parent->children);
287
        while ($idx < $count && $this !== $this->parent->children[$idx]) {
288
            $idx++;
289
        }
290
        if (++$idx >= $count) {
291
            return;
292
        }
293
294
        return $this->parent->children[$idx];
295
    }
296
297
    // returns the previous sibling of node
298
    public function prev_sibling()
299
    {
300
        if ($this->parent === null) {
301
            return;
302
        }
303
        $idx = 0;
304
        $count = count($this->parent->children);
305
        while ($idx < $count && $this !== $this->parent->children[$idx]) {
306
            ++$idx;
307
        }
308
        if (--$idx < 0) {
309
            return;
310
        }
311
312
        return $this->parent->children[$idx];
313
    }
314
315
    // function to locate a specific ancestor tag in the path to the root.
316
    public function find_ancestor_tag($tag)
317
    {
318
        global $debugObject;
319
        if (is_object($debugObject)) {
320
            $debugObject->debugLogEntry(1);
321
        }
322
323
        // Start by including ourselves in the comparison.
324
        $returnDom = $this;
325
326
        while (!is_null($returnDom)) {
327
            if (is_object($debugObject)) {
328
                $debugObject->debugLog(2, 'Current tag is: '.$returnDom->tag);
329
            }
330
331
            if ($returnDom->tag == $tag) {
332
                break;
333
            }
334
            $returnDom = $returnDom->parent;
335
        }
336
337
        return $returnDom;
338
    }
339
340
    // get dom node's inner html
341
    public function innertext()
342
    {
343
        if (isset($this->_[HDOM_INFO_INNER])) {
344
            return $this->_[HDOM_INFO_INNER];
345
        }
346
        if (isset($this->_[HDOM_INFO_TEXT])) {
347
            return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);
348
        }
349
350
        $ret = '';
351
        foreach ($this->nodes as $n) {
352
            $ret .= $n->outertext();
353
        }
354
355
        return $ret;
356
    }
357
358
    // get dom node's outer text (with tag)
359
    public function outertext()
360
    {
361
        global $debugObject;
362
        if (is_object($debugObject)) {
363
            $text = '';
364
            if ($this->tag == 'text') {
365
                if (!empty($this->text)) {
0 ignored issues
show
Bug Best Practice introduced by
The property text does not exist on DomParser\simple_html_dom_node. Since you implemented __get, consider adding a @property annotation.
Loading history...
366
                    $text = ' with text: '.$this->text;
367
                }
368
            }
369
            $debugObject->debugLog(1, 'Innertext of tag: '.$this->tag.$text);
370
        }
371
372
        if ($this->tag === 'root') {
373
            return $this->innertext();
374
        }
375
376
        // trigger callback
377
        if ($this->dom && $this->dom->callback !== null) {
378
            call_user_func_array($this->dom->callback, [$this]);
379
        }
380
381
        if (isset($this->_[HDOM_INFO_OUTER])) {
382
            return $this->_[HDOM_INFO_OUTER];
383
        }
384
        if (isset($this->_[HDOM_INFO_TEXT])) {
385
            return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);
386
        }
387
388
        // render begin tag
389
        if ($this->dom && $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]) {
390
            $ret = $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]->makeup();
391
        } else {
392
            $ret = '';
393
        }
394
395
        // render inner text
396
        if (isset($this->_[HDOM_INFO_INNER])) {
397
            // If it's a br tag...  don't return the HDOM_INNER_INFO that we may or may not have added.
398
            if ($this->tag != 'br') {
399
                $ret .= $this->_[HDOM_INFO_INNER];
400
            }
401
        } else {
402
            if ($this->nodes) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->nodes of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
403
                foreach ($this->nodes as $n) {
404
                    $ret .= $this->convert_text($n->outertext());
405
                }
406
            }
407
        }
408
409
        // render end tag
410
        if (isset($this->_[HDOM_INFO_END]) && $this->_[HDOM_INFO_END] != 0) {
411
            $ret .= '</'.$this->tag.'>';
412
        }
413
414
        return $ret;
415
    }
416
417
    // get dom node's plain text
418
    public function text()
419
    {
420
        if (isset($this->_[HDOM_INFO_INNER])) {
421
            return $this->_[HDOM_INFO_INNER];
422
        }
423
        switch ($this->nodetype) {
424
            case HDOM_TYPE_TEXT: return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);
425
            case HDOM_TYPE_COMMENT: return '';
426
            case HDOM_TYPE_UNKNOWN: return '';
427
        }
428
        if (strcasecmp($this->tag, 'script') === 0) {
429
            return '';
430
        }
431
        if (strcasecmp($this->tag, 'style') === 0) {
432
            return '';
433
        }
434
435
        $ret = '';
436
        // In rare cases, (always node type 1 or HDOM_TYPE_ELEMENT - observed for some span tags, and some p tags) $this->nodes is set to NULL.
437
        // NOTE: This indicates that there is a problem where it's set to NULL without a clear happening.
438
        // WHY is this happening?
439
        if (!is_null($this->nodes)) {
0 ignored issues
show
introduced by
The condition is_null($this->nodes) is always false.
Loading history...
440
            foreach ($this->nodes as $n) {
441
                $ret .= $this->convert_text($n->text());
442
            }
443
444
            // If this node is a span... add a space at the end of it so multiple spans don't run into each other.  This is plaintext after all.
445
            if ($this->tag == 'span') {
446
                $ret .= $this->dom->default_span_text;
447
            }
448
        }
449
450
        return $ret;
451
    }
452
453
    public function xmltext()
454
    {
455
        $ret = $this->innertext();
456
        $ret = str_ireplace('<![CDATA[', '', $ret);
457
        $ret = str_replace(']]>', '', $ret);
458
459
        return $ret;
460
    }
461
462
    // build node's text with tag
463
    public function makeup()
464
    {
465
        // text, comment, unknown
466
        if (isset($this->_[HDOM_INFO_TEXT])) {
467
            return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);
468
        }
469
470
        $ret = '<'.$this->tag;
471
        $i = -1;
472
473
        foreach ($this->attr as $key=>$val) {
474
            $i++;
475
476
            // skip removed attribute
477
            if ($val === null || $val === false) {
478
                continue;
479
            }
480
481
            $ret .= $this->_[HDOM_INFO_SPACE][$i][0];
482
            //no value attr: nowrap, checked selected...
483
            if ($val === true) {
484
                $ret .= $key;
485
            } else {
486
                switch ($this->_[HDOM_INFO_QUOTE][$i]) {
487
                    case HDOM_QUOTE_DOUBLE: $quote = '"'; break;
488
                    case HDOM_QUOTE_SINGLE: $quote = '\''; break;
489
                    default: $quote = '';
490
                }
491
                $ret .= $key.$this->_[HDOM_INFO_SPACE][$i][1].'='.$this->_[HDOM_INFO_SPACE][$i][2].$quote.$val.$quote;
492
            }
493
        }
494
        $ret = $this->dom->restore_noise($ret);
495
496
        return $ret.$this->_[HDOM_INFO_ENDSPACE].'>';
497
    }
498
499
    /**
500
     * find elements by css selector
501
     * PaperG - added ability for find to lowercase the value of the selector.
502
     *
503
     * @param string   $selector
504
     * @param int|null $idx
505
     * @param bool     $lowercase
506
     *
507
     * @return simple_html_dom_node[]|simple_html_dom_node|null
508
     */
509
    public function find($selector, $idx = null, $lowercase = false)
510
    {
511
        $selectors = $this->parse_selector($selector);
512
        if (($count = count($selectors)) === 0) {
513
            return [];
514
        }
515
        $found_keys = [];
516
517
        // find each selector
518
        for ($c = 0; $c < $count; $c++) {
519
            // The change on the below line was documented on the sourceforge code tracker id 2788009
520
            // used to be: if (($levle=count($selectors[0]))===0) return array();
521
            if (($levle = count($selectors[$c])) === 0) {
522
                return [];
523
            }
524
            if (!isset($this->_[HDOM_INFO_BEGIN])) {
525
                return [];
526
            }
527
528
            $head = [$this->_[HDOM_INFO_BEGIN]=>1];
529
530
            // handle descendant selectors, no recursive!
531
            for ($l = 0; $l < $levle; $l++) {
532
                $ret = [];
533
                foreach ($head as $k=>$v) {
534
                    $n = ($k === -1) ? $this->dom->root : $this->dom->nodes[$k];
535
                    //PaperG - Pass this optional parameter on to the seek function.
536
                    $n->seek($selectors[$c][$l], $ret, $lowercase);
537
                }
538
                $head = $ret;
539
            }
540
541
            foreach ($head as $k=>$v) {
542
                if (!isset($found_keys[$k])) {
543
                    $found_keys[$k] = 1;
544
                }
545
            }
546
        }
547
548
        // sort keys
549
        ksort($found_keys);
550
551
        $found = [];
552
        foreach ($found_keys as $k=>$v) {
553
            $found[] = $this->dom->nodes[$k];
554
        }
555
556
        // return nth-element or array
557
        if (is_null($idx)) {
558
            return $found;
559
        } elseif ($idx < 0) {
560
            $idx = count($found) + $idx;
561
        }
562
563
        return (isset($found[$idx])) ? $found[$idx] : null;
564
    }
565
566
    // seek for given conditions
567
    // PaperG - added parameter to allow for case insensitive testing of the value of a selector.
568
    protected function seek($selector, &$ret, $lowercase = false)
569
    {
570
        global $debugObject;
571
        if (is_object($debugObject)) {
572
            $debugObject->debugLogEntry(1);
573
        }
574
575
        list($tag, $key, $val, $exp, $no_key) = $selector;
576
577
        // xpath index
578
        if ($tag && $key && is_numeric($key)) {
579
            $count = 0;
580
            foreach ($this->children as $c) {
581
                if ($tag === '*' || $tag === $c->tag) {
582
                    if (++$count == $key) {
583
                        $ret[$c->_[HDOM_INFO_BEGIN]] = 1;
584
585
                        return;
586
                    }
587
                }
588
            }
589
590
            return;
591
        }
592
593
        $end = (!empty($this->_[HDOM_INFO_END])) ? $this->_[HDOM_INFO_END] : 0;
594
        if ($end == 0) {
595
            $parent = $this->parent;
596
            while (!isset($parent->_[HDOM_INFO_END]) && $parent !== null) {
597
                $end -= 1;
598
                $parent = $parent->parent;
599
            }
600
            $end += $parent->_[HDOM_INFO_END];
601
        }
602
603
        for ($i = $this->_[HDOM_INFO_BEGIN] + 1; $i < $end; $i++) {
604
            $node = $this->dom->nodes[$i];
605
606
            $pass = true;
607
608
            if ($tag === '*' && !$key) {
609
                if (in_array($node, $this->children, true)) {
610
                    $ret[$i] = 1;
611
                }
612
                continue;
613
            }
614
615
            // compare tag
616
            if ($tag && $tag != $node->tag && $tag !== '*') {
617
                $pass = false;
618
            }
619
            // compare key
620
            if ($pass && $key) {
621
                if ($no_key) {
622
                    if (isset($node->attr[$key])) {
623
                        $pass = false;
624
                    }
625
                } else {
626
                    if (($key != 'plaintext') && !isset($node->attr[$key])) {
627
                        $pass = false;
628
                    }
629
                }
630
            }
631
            // compare value
632
            if ($pass && $key && $val && $val !== '*') {
633
                // If they have told us that this is a "plaintext" search then we want the plaintext of the node - right?
634
                if ($key == 'plaintext') {
635
                    // $node->plaintext actually returns $node->text();
636
                    $nodeKeyValue = $node->text();
637
                } else {
638
                    // this is a normal search, we want the value of that attribute of the tag.
639
                    $nodeKeyValue = $node->attr[$key];
640
                }
641
                if (is_object($debugObject)) {
642
                    $debugObject->debugLog(2, 'testing node: '.$node->tag.' for attribute: '.$key.$exp.$val.' where nodes value is: '.$nodeKeyValue);
643
                }
644
645
                //PaperG - If lowercase is set, do a case insensitive test of the value of the selector.
646
                if ($lowercase) {
647
                    $check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue));
648
                } else {
649
                    $check = $this->match($exp, $val, $nodeKeyValue);
650
                }
651
                if (is_object($debugObject)) {
652
                    $debugObject->debugLog(2, 'after match: '.($check ? 'true' : 'false'));
653
                }
654
655
                // handle multiple class
656
                if (!$check && strcasecmp($key, 'class') === 0) {
657
                    foreach (explode(' ', $node->attr[$key]) as $k) {
658
                        // Without this, there were cases where leading, trailing, or double spaces lead to our comparing blanks - bad form.
659
                        if (!empty($k)) {
660
                            if ($lowercase) {
661
                                $check = $this->match($exp, strtolower($val), strtolower($k));
662
                            } else {
663
                                $check = $this->match($exp, $val, $k);
664
                            }
665
                            if ($check) {
666
                                break;
667
                            }
668
                        }
669
                    }
670
                }
671
                if (!$check) {
672
                    $pass = false;
673
                }
674
            }
675
            if ($pass) {
676
                $ret[$i] = 1;
677
            }
678
            unset($node);
679
        }
680
        // It's passed by reference so this is actually what this function returns.
681
        if (is_object($debugObject)) {
682
            $debugObject->debugLog(1, 'EXIT - ret: ', $ret);
683
        }
684
    }
685
686
    protected function match($exp, $pattern, $value)
687
    {
688
        global $debugObject;
689
        if (is_object($debugObject)) {
690
            $debugObject->debugLogEntry(1);
691
        }
692
693
        switch ($exp) {
694
            case '=':
695
                return $value === $pattern;
696
            case '!=':
697
                return $value !== $pattern;
698
            case '^=':
699
                return preg_match('/^'.preg_quote($pattern, '/').'/', $value);
700
            case '$=':
701
                return preg_match('/'.preg_quote($pattern, '/').'$/', $value);
702
            case '*=':
703
                if ($pattern[0] == '/') {
704
                    return preg_match($pattern, $value);
705
                }
706
707
                return preg_match('/'.$pattern.'/i', $value);
708
        }
709
710
        return false;
711
    }
712
713
    protected function parse_selector($selector_string)
714
    {
715
        global $debugObject;
716
        if (is_object($debugObject)) {
717
            $debugObject->debugLogEntry(1);
718
        }
719
720
        // pattern of CSS selectors, modified from mootools
721
        // Paperg: Add the colon to the attrbute, so that it properly finds <tag attr:ibute="something" > like google does.
722
        // Note: if you try to look at this attribute, yo MUST use getAttribute since $dom->x:y will fail the php syntax check.
723
        // Notice the \[ starting the attbute?  and the @? following?  This implies that an attribute can begin with an @ sign that is not captured.
724
        // This implies that an html attribute specifier may start with an @ sign that is NOT captured by the expression.
725
        // farther study is required to determine of this should be documented or removed.
726
//        $pattern = "/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ]+)/is";
727
        $pattern = "/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-:]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ]+)/is";
728
        preg_match_all($pattern, trim($selector_string).' ', $matches, PREG_SET_ORDER);
729
        if (is_object($debugObject)) {
730
            $debugObject->debugLog(2, 'Matches Array: ', $matches);
731
        }
732
733
        $selectors = [];
734
        $result = [];
735
        //print_r($matches);
736
737
        foreach ($matches as $m) {
738
            $m[0] = trim($m[0]);
739
            if ($m[0] === '' || $m[0] === '/' || $m[0] === '//') {
740
                continue;
741
            }
742
            // for browser generated xpath
743
            if ($m[1] === 'tbody') {
744
                continue;
745
            }
746
747
            list($tag, $key, $val, $exp, $no_key) = [$m[1], null, null, '=', false];
748
            if (!empty($m[2])) {
749
                $key = 'id';
750
                $val = $m[2];
751
            }
752
            if (!empty($m[3])) {
753
                $key = 'class';
754
                $val = $m[3];
755
            }
756
            if (!empty($m[4])) {
757
                $key = $m[4];
758
            }
759
            if (!empty($m[5])) {
760
                $exp = $m[5];
761
            }
762
            if (!empty($m[6])) {
763
                $val = $m[6];
764
            }
765
766
            // convert to lowercase
767
            if ($this->dom->lowercase) {
768
                $tag = strtolower($tag);
769
                $key = strtolower($key);
770
            }
771
            //elements that do NOT have the specified attribute
772
            if (isset($key[0]) && $key[0] === '!') {
773
                $key = substr($key, 1);
774
                $no_key = true;
775
            }
776
777
            $result[] = [$tag, $key, $val, $exp, $no_key];
778
            if (trim($m[7]) === ',') {
779
                $selectors[] = $result;
780
                $result = [];
781
            }
782
        }
783
        if (count($result) > 0) {
784
            $selectors[] = $result;
785
        }
786
787
        return $selectors;
788
    }
789
790
    public function __get($name)
791
    {
792
        if (isset($this->attr[$name])) {
793
            return $this->convert_text($this->attr[$name]);
794
        }
795
        switch ($name) {
796
            case 'outertext': return $this->outertext();
797
            case 'innertext': return $this->innertext();
798
            case 'plaintext': return $this->text();
799
            case 'xmltext': return $this->xmltext();
800
            default: return array_key_exists($name, $this->attr);
801
        }
802
    }
803
804
    public function __set($name, $value)
805
    {
806
        switch ($name) {
807
            case 'outertext': return $this->_[HDOM_INFO_OUTER] = $value;
808
            case 'innertext':
809
                if (isset($this->_[HDOM_INFO_TEXT])) {
810
                    return $this->_[HDOM_INFO_TEXT] = $value;
811
                }
812
813
                return $this->_[HDOM_INFO_INNER] = $value;
814
        }
815
        if (!isset($this->attr[$name])) {
816
            $this->_[HDOM_INFO_SPACE][] = [' ', '', ''];
817
            $this->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE;
818
        }
819
        $this->attr[$name] = $value;
820
    }
821
822
    public function __isset($name)
823
    {
824
        switch ($name) {
825
            case 'outertext': return true;
826
            case 'innertext': return true;
827
            case 'plaintext': return true;
828
        }
829
        //no value attr: nowrap, checked selected...
830
        return (array_key_exists($name, $this->attr)) ? true : isset($this->attr[$name]);
831
    }
832
833
    public function __unset($name)
834
    {
835
        if (isset($this->attr[$name])) {
836
            unset($this->attr[$name]);
837
        }
838
    }
839
840
    // PaperG - Function to convert the text from one character set to another if the two sets are not the same.
841
    public function convert_text($text)
842
    {
843
        global $debugObject;
844
        if (is_object($debugObject)) {
845
            $debugObject->debugLogEntry(1);
846
        }
847
848
        $converted_text = $text;
849
850
        $sourceCharset = '';
851
        $targetCharset = '';
852
853
        if ($this->dom) {
854
            $sourceCharset = strtoupper($this->dom->_charset);
855
            $targetCharset = strtoupper($this->dom->_target_charset);
856
        }
857
        if (is_object($debugObject)) {
858
            $debugObject->debugLog(3, 'source charset: '.$sourceCharset.' target charaset: '.$targetCharset);
859
        }
860
861
        if (!empty($sourceCharset) && !empty($targetCharset) && (strcasecmp($sourceCharset, $targetCharset) != 0)) {
862
            // Check if the reported encoding could have been incorrect and the text is actually already UTF-8
863
            if ((strcasecmp($targetCharset, 'UTF-8') == 0) && ($this->is_utf8($text))) {
864
                $converted_text = $text;
865
            } else {
866
                $converted_text = iconv($sourceCharset, $targetCharset, $text);
867
            }
868
        }
869
870
        // Lets make sure that we don't have that silly BOM issue with any of the utf-8 text we output.
871
        if ($targetCharset == 'UTF-8') {
872
            if (substr($converted_text, 0, 3) == "\xef\xbb\xbf") {
873
                $converted_text = substr($converted_text, 3);
874
            }
875
            if (substr($converted_text, -3) == "\xef\xbb\xbf") {
876
                $converted_text = substr($converted_text, 0, -3);
877
            }
878
        }
879
880
        return $converted_text;
881
    }
882
883
    /**
884
     * Returns true if $string is valid UTF-8 and false otherwise.
885
     *
886
     * @param mixed $str String to be tested
887
     *
888
     * @return bool
889
     */
890
    public static function is_utf8($str)
891
    {
892
        $c = 0;
0 ignored issues
show
Unused Code introduced by
The assignment to $c is dead and can be removed.
Loading history...
893
        $b = 0;
0 ignored issues
show
Unused Code introduced by
The assignment to $b is dead and can be removed.
Loading history...
894
        $bits = 0;
895
        $len = strlen($str);
896
        for ($i = 0; $i < $len; $i++) {
897
            $c = ord($str[$i]);
898
            if ($c > 128) {
899
                if (($c >= 254)) {
900
                    return false;
901
                } elseif ($c >= 252) {
902
                    $bits = 6;
903
                } elseif ($c >= 248) {
904
                    $bits = 5;
905
                } elseif ($c >= 240) {
906
                    $bits = 4;
907
                } elseif ($c >= 224) {
908
                    $bits = 3;
909
                } elseif ($c >= 192) {
910
                    $bits = 2;
911
                } else {
912
                    return false;
913
                }
914
                if (($i + $bits) > $len) {
915
                    return false;
916
                }
917
                while ($bits > 1) {
918
                    $i++;
919
                    $b = ord($str[$i]);
920
                    if ($b < 128 || $b > 191) {
921
                        return false;
922
                    }
923
                    $bits--;
924
                }
925
            }
926
        }
927
928
        return true;
929
    }
930
931
    /*
932
    function is_utf8($string)
933
    {
934
        //this is buggy
935
        return (utf8_encode(utf8_decode($string)) == $string);
936
    }
937
    */
938
939
    /**
940
     * Function to try a few tricks to determine the displayed size of an img on the page.
941
     * NOTE: This will ONLY work on an IMG tag. Returns FALSE on all other tag types.
942
     *
943
     * @author John Schlick
944
     *
945
     * @version April 19 2012
946
     *
947
     * @return array an array containing the 'height' and 'width' of the image on the page or -1 if we can't figure it out.
948
     */
949
    public function get_display_size()
950
    {
951
        global $debugObject;
952
953
        $width = -1;
954
        $height = -1;
955
956
        if ($this->tag !== 'img') {
957
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The expression return false returns the type false which is incompatible with the documented return type array.
Loading history...
958
        }
959
960
        // See if there is aheight or width attribute in the tag itself.
961
        if (isset($this->attr['width'])) {
962
            $width = $this->attr['width'];
963
        }
964
965
        if (isset($this->attr['height'])) {
966
            $height = $this->attr['height'];
967
        }
968
969
        // Now look for an inline style.
970
        if (isset($this->attr['style'])) {
971
            // Thanks to user gnarf from stackoverflow for this regular expression.
972
            $attributes = [];
973
            preg_match_all("/([\w-]+)\s*:\s*([^;]+)\s*;?/", $this->attr['style'], $matches, PREG_SET_ORDER);
974
            foreach ($matches as $match) {
975
                $attributes[$match[1]] = $match[2];
976
            }
977
978
            // If there is a width in the style attributes:
979
            if (isset($attributes['width']) && $width == -1) {
980
                // check that the last two characters are px (pixels)
981
                if (strtolower(substr($attributes['width'], -2)) == 'px') {
982
                    $proposed_width = substr($attributes['width'], 0, -2);
983
                    // Now make sure that it's an integer and not something stupid.
984
                    if (filter_var($proposed_width, FILTER_VALIDATE_INT)) {
985
                        $width = $proposed_width;
986
                    }
987
                }
988
            }
989
990
            // If there is a width in the style attributes:
991
            if (isset($attributes['height']) && $height == -1) {
992
                // check that the last two characters are px (pixels)
993
                if (strtolower(substr($attributes['height'], -2)) == 'px') {
994
                    $proposed_height = substr($attributes['height'], 0, -2);
995
                    // Now make sure that it's an integer and not something stupid.
996
                    if (filter_var($proposed_height, FILTER_VALIDATE_INT)) {
997
                        $height = $proposed_height;
998
                    }
999
                }
1000
            }
1001
        }
1002
1003
        // Future enhancement:
1004
        // Look in the tag to see if there is a class or id specified that has a height or width attribute to it.
1005
1006
        // Far future enhancement
1007
        // Look at all the parent tags of this image to see if they specify a class or id that has an img selector that specifies a height or width
1008
        // Note that in this case, the class or id will have the img subselector for it to apply to the image.
1009
1010
        // ridiculously far future development
1011
        // If the class or id is specified in a SEPARATE css file thats not on the page, go get it and do what we were just doing for the ones on the page.
1012
1013
        $result = ['height'     => $height,
1014
                        'width' => $width, ];
1015
1016
        return $result;
1017
    }
1018
1019
    // camel naming conventions
1020
    public function getAllAttributes()
1021
    {
1022
        return array_map('html_entity_decode', $this->attr);
1023
    }
1024
1025
    public function getAttribute($name)
1026
    {
1027
        return html_entity_decode($this->__get($name));
1028
    }
1029
1030
    public function setAttribute($name, $value)
1031
    {
1032
        $this->__set($name, $value);
1033
    }
1034
1035
    public function hasAttribute($name)
1036
    {
1037
        return $this->__isset($name);
1038
    }
1039
1040
    public function removeAttribute($name)
1041
    {
1042
        $this->__set($name, null);
1043
    }
1044
1045
    public function getElementById($id)
1046
    {
1047
        return $this->find("#$id", 0);
1048
    }
1049
1050
    public function getElementsById($id, $idx = null)
1051
    {
1052
        return $this->find("#$id", $idx);
1053
    }
1054
1055
    public function getElementByTagName($name)
1056
    {
1057
        return $this->find($name, 0);
1058
    }
1059
1060
    public function getElementsByTagName($name, $idx = null)
1061
    {
1062
        return $this->find($name, $idx);
1063
    }
1064
1065
    public function parentNode()
1066
    {
1067
        return $this->parent();
1068
    }
1069
1070
    public function childNodes($idx = -1)
1071
    {
1072
        return $this->children($idx);
1073
    }
1074
1075
    public function firstChild()
1076
    {
1077
        return $this->first_child();
1078
    }
1079
1080
    public function lastChild()
1081
    {
1082
        return $this->last_child();
1083
    }
1084
1085
    public function nextSibling()
1086
    {
1087
        return $this->next_sibling();
1088
    }
1089
1090
    public function previousSibling()
1091
    {
1092
        return $this->prev_sibling();
1093
    }
1094
1095
    public function hasChildNodes()
1096
    {
1097
        return $this->has_child();
1098
    }
1099
1100
    public function nodeName()
1101
    {
1102
        return $this->tag;
1103
    }
1104
1105
    public function appendChild($node)
1106
    {
1107
        $node->parent($this);
1108
1109
        return $node;
1110
    }
1111
}
1112
1113
/**
1114
 * simple html dom parser
1115
 * Paperg - in the find routine: allow us to specify that we want case insensitive testing of the value of the selector.
1116
 * Paperg - change $size from protected to public so we can easily access it
1117
 * Paperg - added ForceTagsClosed in the constructor which tells us whether we trust the html or not.  Default is to NOT trust it.
1118
 */
1119
class simple_html_dom
1120
{
1121
    /** @var simple_html_dom_node $root */
1122
    public $root = null;
1123
    public $nodes = [];
1124
    public $callback = null;
1125
    public $lowercase = false;
1126
    // Used to keep track of how large the text was when we started.
1127
    public $original_size;
1128
    public $size;
1129
    protected $pos;
1130
    protected $doc;
1131
    protected $char;
1132
    protected $cursor;
1133
    protected $parent;
1134
    protected $noise = [];
1135
    protected $token_blank = " \t\r\n";
1136
    protected $token_equal = ' =/>';
1137
    protected $token_slash = " />\r\n\t";
1138
    protected $token_attr = ' >';
1139
    // Note that this is referenced by a child node, and so it needs to be public for that node to see this information.
1140
    public $_charset = '';
1141
    public $_target_charset = '';
1142
    protected $default_br_text = '';
1143
    public $default_span_text = '';
1144
1145
    // use isset instead of in_array, performance boost about 30%...
1146
    protected $self_closing_tags = ['img'=>1, 'br'=>1, 'input'=>1, 'meta'=>1, 'link'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1];
1147
    protected $block_tags = ['root'=>1, 'body'=>1, 'form'=>1, 'div'=>1, 'span'=>1, 'table'=>1];
1148
    // Known sourceforge issue #2977341
1149
    // B tags that are not closed cause us to return everything to the end of the document.
1150
    protected $optional_closing_tags = [
1151
        'tr'    => ['tr'=>1, 'td'=>1, 'th'=>1],
1152
        'th'    => ['th'=>1],
1153
        'td'    => ['td'=>1],
1154
        'li'    => ['li'=>1],
1155
        'dt'    => ['dt'=>1, 'dd'=>1],
1156
        'dd'    => ['dd'=>1, 'dt'=>1],
1157
        'dl'    => ['dd'=>1, 'dt'=>1],
1158
        'p'     => ['p'=>1],
1159
        'nobr'  => ['nobr'=>1],
1160
        'b'     => ['b'=>1],
1161
        'option'=> ['option'=>1],
1162
    ];
1163
1164
    public function __construct($str = null, $lowercase = true, $forceTagsClosed = true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN = true, $defaultBRText = DEFAULT_BR_TEXT, $defaultSpanText = DEFAULT_SPAN_TEXT)
1165
    {
1166
        if ($str) {
1167
            if (preg_match("/^http:\/\//i", $str) || is_file($str)) {
1168
                $this->load_file($str);
1169
            } else {
1170
                $this->load($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);
1171
            }
1172
        }
1173
        // Forcing tags to be closed implies that we don't trust the html, but it can lead to parsing errors if we SHOULD trust the html.
1174
        if (!$forceTagsClosed) {
1175
            $this->optional_closing_array = [];
0 ignored issues
show
Bug Best Practice introduced by
The property optional_closing_array does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
1176
        }
1177
        $this->_target_charset = $target_charset;
1178
    }
1179
1180
    public function __destruct()
1181
    {
1182
        $this->clear();
1183
    }
1184
1185
    // load html from string
1186
    public function load($str, $lowercase = true, $stripRN = true, $defaultBRText = DEFAULT_BR_TEXT, $defaultSpanText = DEFAULT_SPAN_TEXT)
1187
    {
1188
        global $debugObject;
1189
1190
        // prepare
1191
        $this->prepare($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);
1192
        // strip out comments
1193
        $this->remove_noise("'<!--(.*?)-->'is");
1194
        // strip out cdata
1195
        $this->remove_noise("'<!\[CDATA\[(.*?)\]\]>'is", true);
1196
        // Per sourceforge http://sourceforge.net/tracker/?func=detail&aid=2949097&group_id=218559&atid=1044037
1197
        // Script tags removal now preceeds style tag removal.
1198
        // strip out <script> tags
1199
        $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");
1200
        $this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");
1201
        // strip out <style> tags
1202
        $this->remove_noise("'<\s*style[^>]*[^/]>(.*?)<\s*/\s*style\s*>'is");
1203
        $this->remove_noise("'<\s*style\s*>(.*?)<\s*/\s*style\s*>'is");
1204
        // strip out preformatted tags
1205
        $this->remove_noise("'<\s*(?:code)[^>]*>(.*?)<\s*/\s*(?:code)\s*>'is");
1206
        // strip out server side scripts
1207
        $this->remove_noise("'(<\?)(.*?)(\?>)'s", true);
1208
        // strip smarty scripts
1209
        $this->remove_noise("'(\{\w)(.*?)(\})'s", true);
1210
1211
        // parsing
1212
        while ($this->parse());
1213
        // end
1214
        $this->root->_[HDOM_INFO_END] = $this->cursor;
1215
        $this->parse_charset();
1216
1217
        // make load function chainable
1218
        return $this;
1219
    }
1220
1221
    // load html from file
1222
    public function load_file()
1223
    {
1224
        $args = func_get_args();
1225
        $this->load(call_user_func_array('file_get_contents', $args), true);
1226
        // Throw an error if we can't properly load the dom.
1227
        if (($error = error_get_last()) !== null) {
0 ignored issues
show
Unused Code introduced by
The assignment to $error is dead and can be removed.
Loading history...
1228
            $this->clear();
1229
1230
            return false;
1231
        }
1232
    }
1233
1234
    // set callback function
1235
    public function set_callback($function_name)
1236
    {
1237
        $this->callback = $function_name;
1238
    }
1239
1240
    // remove callback function
1241
    public function remove_callback()
1242
    {
1243
        $this->callback = null;
1244
    }
1245
1246
    // save dom as string
1247
    public function save($filepath = '')
1248
    {
1249
        $ret = $this->root->innertext();
1250
        if ($filepath !== '') {
1251
            file_put_contents($filepath, $ret, LOCK_EX);
1252
        }
1253
1254
        return $ret;
1255
    }
1256
1257
    // find dom node by css selector
1258
    // Paperg - allow us to specify that we want case insensitive testing of the value of the selector.
1259
    public function find($selector, $idx = null, $lowercase = false)
1260
    {
1261
        return $this->root->find($selector, $idx, $lowercase);
1262
    }
1263
1264
    // clean up memory due to php5 circular references memory leak...
1265
    public function clear()
1266
    {
1267
        foreach ($this->nodes as $n) {
1268
            $n->clear();
1269
            $n = null;
0 ignored issues
show
Unused Code introduced by
The assignment to $n is dead and can be removed.
Loading history...
1270
        }
1271
        // This add next line is documented in the sourceforge repository. 2977248 as a fix for ongoing memory leaks that occur even with the use of clear.
1272
        if (isset($this->children)) {
1273
            foreach ($this->children as $n) {
1274
                $n->clear();
1275
                $n = null;
1276
            }
1277
        }
1278
        if (isset($this->parent)) {
1279
            $this->parent->clear();
1280
            unset($this->parent);
1281
        }
1282
        if (isset($this->root)) {
1283
            $this->root->clear();
1284
            unset($this->root);
1285
        }
1286
        unset($this->doc);
1287
        unset($this->noise);
1288
    }
1289
1290
    public function dump($show_attr = true)
1291
    {
1292
        $this->root->dump($show_attr);
1293
    }
1294
1295
    // prepare HTML data and init everything
1296
    protected function prepare($str, $lowercase = true, $stripRN = true, $defaultBRText = DEFAULT_BR_TEXT, $defaultSpanText = DEFAULT_SPAN_TEXT)
1297
    {
1298
        $this->clear();
1299
1300
        // set the length of content before we do anything to it.
1301
        $this->size = strlen($str);
1302
        // Save the original size of the html that we got in.  It might be useful to someone.
1303
        $this->original_size = $this->size;
1304
1305
        //before we save the string as the doc...  strip out the \r \n's if we are told to.
1306
        if ($stripRN) {
1307
            $str = str_replace("\r", ' ', $str);
1308
            $str = str_replace("\n", ' ', $str);
1309
1310
            // set the length of content since we have changed it.
1311
            $this->size = strlen($str);
1312
        }
1313
1314
        $this->doc = $str;
1315
        $this->pos = 0;
1316
        $this->cursor = 1;
1317
        $this->noise = [];
1318
        $this->nodes = [];
1319
        $this->lowercase = $lowercase;
1320
        $this->default_br_text = $defaultBRText;
1321
        $this->default_span_text = $defaultSpanText;
1322
        $this->root = new simple_html_dom_node($this);
1323
        $this->root->tag = 'root';
1324
        $this->root->_[HDOM_INFO_BEGIN] = -1;
1325
        $this->root->nodetype = HDOM_TYPE_ROOT;
1326
        $this->parent = $this->root;
1327
        if ($this->size > 0) {
1328
            $this->char = $this->doc[0];
1329
        }
1330
    }
1331
1332
    // parse html content
1333
    protected function parse()
1334
    {
1335
        if (($s = $this->copy_until_char('<')) === '') {
1336
            return $this->read_tag();
1337
        }
1338
1339
        // text
1340
        $node = new simple_html_dom_node($this);
1341
        $this->cursor++;
1342
        $node->_[HDOM_INFO_TEXT] = $s;
1343
        $this->link_nodes($node, false);
1344
1345
        return true;
1346
    }
1347
1348
    // PAPERG - dkchou - added this to try to identify the character set of the page we have just parsed so we know better how to spit it out later.
1349
    // NOTE:  IF you provide a routine called get_last_retrieve_url_contents_content_type which returns the CURLINFO_CONTENT_TYPE from the last curl_exec
1350
    // (or the content_type header from the last transfer), we will parse THAT, and if a charset is specified, we will use it over any other mechanism.
1351
    protected function parse_charset()
1352
    {
1353
        global $debugObject;
1354
1355
        $charset = null;
1356
1357
        if (function_exists('get_last_retrieve_url_contents_content_type')) {
1358
            $contentTypeHeader = get_last_retrieve_url_contents_content_type();
1359
            $success = preg_match('/charset=(.+)/', $contentTypeHeader, $matches);
1360
            if ($success) {
1361
                $charset = $matches[1];
1362
                if (is_object($debugObject)) {
1363
                    $debugObject->debugLog(2, 'header content-type found charset of: '.$charset);
1364
                }
1365
            }
1366
        }
1367
1368
        if (empty($charset)) {
1369
            $el = $this->root->find('meta[http-equiv=Content-Type]', 0);
1370
            if (!empty($el)) {
1371
                $fullvalue = $el->content;
0 ignored issues
show
Bug Best Practice introduced by
The property content does not exist on DomParser\simple_html_dom_node. Since you implemented __get, consider adding a @property annotation.
Loading history...
1372
                if (is_object($debugObject)) {
1373
                    $debugObject->debugLog(2, 'meta content-type tag found'.$fullvalue);
1374
                }
1375
1376
                if (!empty($fullvalue)) {
1377
                    $success = preg_match('/charset=(.+)/', $fullvalue, $matches);
1378
                    if ($success) {
1379
                        $charset = $matches[1];
1380
                    } else {
1381
                        // If there is a meta tag, and they don't specify the character set, research says that it's typically ISO-8859-1
1382
                        if (is_object($debugObject)) {
1383
                            $debugObject->debugLog(2, 'meta content-type tag couldn\'t be parsed. using iso-8859 default.');
1384
                        }
1385
                        $charset = 'ISO-8859-1';
1386
                    }
1387
                }
1388
            }
1389
        }
1390
1391
        // If we couldn't find a charset above, then lets try to detect one based on the text we got...
1392
        if (empty($charset)) {
1393
            // Have php try to detect the encoding from the text given to us.
1394
            $charset = (function_exists('mb_detect_encoding')) ? mb_detect_encoding($this->root->plaintext.'ascii', $encoding_list = ['UTF-8', 'CP1252']) : false;
0 ignored issues
show
Bug Best Practice introduced by
The property plaintext does not exist on DomParser\simple_html_dom_node. Since you implemented __get, consider adding a @property annotation.
Loading history...
1395
            if (is_object($debugObject)) {
1396
                $debugObject->debugLog(2, 'mb_detect found: '.$charset);
0 ignored issues
show
Bug introduced by
Are you sure $charset of type false|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

1396
                $debugObject->debugLog(2, 'mb_detect found: './** @scrutinizer ignore-type */ $charset);
Loading history...
1397
            }
1398
1399
            // and if this doesn't work...  then we need to just wrongheadedly assume it's UTF-8 so that we can move on - cause this will usually give us most of what we need...
1400
            if ($charset === false) {
1401
                if (is_object($debugObject)) {
1402
                    $debugObject->debugLog(2, 'since mb_detect failed - using default of utf-8');
1403
                }
1404
                $charset = 'UTF-8';
1405
            }
1406
        }
1407
1408
        // Since CP1252 is a superset, if we get one of it's subsets, we want it instead.
1409
        if ((strtolower($charset) == strtolower('ISO-8859-1')) || (strtolower($charset) == strtolower('Latin1')) || (strtolower($charset) == strtolower('Latin-1'))) {
1410
            if (is_object($debugObject)) {
1411
                $debugObject->debugLog(2, 'replacing '.$charset.' with CP1252 as its a superset');
1412
            }
1413
            $charset = 'CP1252';
1414
        }
1415
1416
        if (is_object($debugObject)) {
1417
            $debugObject->debugLog(1, 'EXIT - '.$charset);
1418
        }
1419
1420
        return $this->_charset = $charset;
1421
    }
1422
1423
    // read tag info
1424
    protected function read_tag()
1425
    {
1426
        if ($this->char !== '<') {
1427
            $this->root->_[HDOM_INFO_END] = $this->cursor;
1428
1429
            return false;
1430
        }
1431
        $begin_tag_pos = $this->pos;
1432
        $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1433
1434
        // end tag
1435
        if ($this->char === '/') {
1436
            $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1437
            // This represents the change in the simple_html_dom trunk from revision 180 to 181.
1438
            // $this->skip($this->token_blank_t);
1439
            $this->skip($this->token_blank);
1440
            $tag = $this->copy_until_char('>');
1441
1442
            // skip attributes in end tag
1443
            if (($pos = strpos($tag, ' ')) !== false) {
1444
                $tag = substr($tag, 0, $pos);
1445
            }
1446
1447
            $parent_lower = strtolower($this->parent->tag);
1448
            $tag_lower = strtolower($tag);
1449
1450
            if ($parent_lower !== $tag_lower) {
1451
                if (isset($this->optional_closing_tags[$parent_lower]) && isset($this->block_tags[$tag_lower])) {
1452
                    $this->parent->_[HDOM_INFO_END] = 0;
1453
                    $org_parent = $this->parent;
1454
1455
                    while (($this->parent->parent) && strtolower($this->parent->tag) !== $tag_lower) {
1456
                        $this->parent = $this->parent->parent;
1457
                    }
1458
1459
                    if (strtolower($this->parent->tag) !== $tag_lower) {
1460
                        $this->parent = $org_parent; // restore origonal parent
1461
                        if ($this->parent->parent) {
1462
                            $this->parent = $this->parent->parent;
1463
                        }
1464
                        $this->parent->_[HDOM_INFO_END] = $this->cursor;
1465
1466
                        return $this->as_text_node($tag);
1467
                    }
1468
                } elseif (($this->parent->parent) && isset($this->block_tags[$tag_lower])) {
1469
                    $this->parent->_[HDOM_INFO_END] = 0;
1470
                    $org_parent = $this->parent;
1471
1472
                    while (($this->parent->parent) && strtolower($this->parent->tag) !== $tag_lower) {
1473
                        $this->parent = $this->parent->parent;
1474
                    }
1475
1476
                    if (strtolower($this->parent->tag) !== $tag_lower) {
1477
                        $this->parent = $org_parent; // restore origonal parent
1478
                        $this->parent->_[HDOM_INFO_END] = $this->cursor;
1479
1480
                        return $this->as_text_node($tag);
1481
                    }
1482
                } elseif (($this->parent->parent) && strtolower($this->parent->parent->tag) === $tag_lower) {
1483
                    $this->parent->_[HDOM_INFO_END] = 0;
1484
                    $this->parent = $this->parent->parent;
1485
                } else {
1486
                    return $this->as_text_node($tag);
1487
                }
1488
            }
1489
1490
            $this->parent->_[HDOM_INFO_END] = $this->cursor;
1491
            if ($this->parent->parent) {
1492
                $this->parent = $this->parent->parent;
1493
            }
1494
1495
            $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1496
            return true;
1497
        }
1498
1499
        $node = new simple_html_dom_node($this);
1500
        $node->_[HDOM_INFO_BEGIN] = $this->cursor;
1501
        $this->cursor++;
1502
        $tag = $this->copy_until($this->token_slash);
1503
        $node->tag_start = $begin_tag_pos;
1504
1505
        // doctype, cdata & comments...
1506
        if (isset($tag[0]) && $tag[0] === '!') {
1507
            $node->_[HDOM_INFO_TEXT] = '<'.$tag.$this->copy_until_char('>');
1508
1509
            if (isset($tag[2]) && $tag[1] === '-' && $tag[2] === '-') {
1510
                $node->nodetype = HDOM_TYPE_COMMENT;
1511
                $node->tag = 'comment';
1512
            } else {
1513
                $node->nodetype = HDOM_TYPE_UNKNOWN;
1514
                $node->tag = 'unknown';
1515
            }
1516
            if ($this->char === '>') {
1517
                $node->_[HDOM_INFO_TEXT] .= '>';
1518
            }
1519
            $this->link_nodes($node, true);
1520
            $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1521
            return true;
1522
        }
1523
1524
        // text
1525
        if ($pos = strpos($tag, '<') !== false) {
0 ignored issues
show
Unused Code introduced by
The assignment to $pos is dead and can be removed.
Loading history...
1526
            $tag = '<'.substr($tag, 0, -1);
1527
            $node->_[HDOM_INFO_TEXT] = $tag;
1528
            $this->link_nodes($node, false);
1529
            $this->char = $this->doc[--$this->pos]; // prev
1530
            return true;
1531
        }
1532
1533
        if (!preg_match("/^[\w-:]+$/", $tag)) {
1534
            $node->_[HDOM_INFO_TEXT] = '<'.$tag.$this->copy_until('<>');
1535
            if ($this->char === '<') {
1536
                $this->link_nodes($node, false);
1537
1538
                return true;
1539
            }
1540
1541
            if ($this->char === '>') {
1542
                $node->_[HDOM_INFO_TEXT] .= '>';
1543
            }
1544
            $this->link_nodes($node, false);
1545
            $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1546
            return true;
1547
        }
1548
1549
        // begin tag
1550
        $node->nodetype = HDOM_TYPE_ELEMENT;
1551
        $tag_lower = strtolower($tag);
1552
        $node->tag = ($this->lowercase) ? $tag_lower : $tag;
1553
1554
        // handle optional closing tags
1555
        if (isset($this->optional_closing_tags[$tag_lower])) {
1556
            while (isset($this->optional_closing_tags[$tag_lower][strtolower($this->parent->tag)])) {
1557
                $this->parent->_[HDOM_INFO_END] = 0;
1558
                $this->parent = $this->parent->parent;
1559
            }
1560
            $node->parent = $this->parent;
1561
        }
1562
1563
        $guard = 0; // prevent infinity loop
1564
        $space = [$this->copy_skip($this->token_blank), '', ''];
1565
1566
        // attributes
1567
        do {
1568
            if ($this->char !== null && $space[0] === '') {
1569
                break;
1570
            }
1571
            $name = $this->copy_until($this->token_equal);
1572
            if ($guard === $this->pos) {
1573
                $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1574
                continue;
1575
            }
1576
            $guard = $this->pos;
1577
1578
            // handle endless '<'
1579
            if ($this->pos >= $this->size - 1 && $this->char !== '>') {
1580
                $node->nodetype = HDOM_TYPE_TEXT;
1581
                $node->_[HDOM_INFO_END] = 0;
1582
                $node->_[HDOM_INFO_TEXT] = '<'.$tag.$space[0].$name;
1583
                $node->tag = 'text';
1584
                $this->link_nodes($node, false);
1585
1586
                return true;
1587
            }
1588
1589
            // handle mismatch '<'
1590
            if ($this->doc[$this->pos - 1] == '<') {
1591
                $node->nodetype = HDOM_TYPE_TEXT;
1592
                $node->tag = 'text';
1593
                $node->attr = [];
1594
                $node->_[HDOM_INFO_END] = 0;
1595
                $node->_[HDOM_INFO_TEXT] = substr($this->doc, $begin_tag_pos, $this->pos - $begin_tag_pos - 1);
1596
                $this->pos -= 2;
1597
                $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1598
                $this->link_nodes($node, false);
1599
1600
                return true;
1601
            }
1602
1603
            if ($name !== '/' && $name !== '') {
1604
                $space[1] = $this->copy_skip($this->token_blank);
1605
                $name = $this->restore_noise($name);
1606
                if ($this->lowercase) {
1607
                    $name = strtolower($name);
1608
                }
1609
                if ($this->char === '=') {
1610
                    $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1611
                    $this->parse_attr($node, $name, $space);
1612
                } else {
1613
                    //no value attr: nowrap, checked selected...
1614
                    $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;
1615
                    $node->attr[$name] = true;
1616
                    if ($this->char != '>') {
1617
                        $this->char = $this->doc[--$this->pos];
1618
                    } // prev
1619
                }
1620
                $node->_[HDOM_INFO_SPACE][] = $space;
1621
                $space = [$this->copy_skip($this->token_blank), '', ''];
1622
            } else {
1623
                break;
1624
            }
1625
        } while ($this->char !== '>' && $this->char !== '/');
1626
1627
        $this->link_nodes($node, true);
1628
        $node->_[HDOM_INFO_ENDSPACE] = $space[0];
1629
1630
        // check self closing
1631
        if ($this->copy_until_char_escape('>') === '/') {
1632
            $node->_[HDOM_INFO_ENDSPACE] .= '/';
1633
            $node->_[HDOM_INFO_END] = 0;
1634
        } else {
1635
            // reset parent
1636
            if (!isset($this->self_closing_tags[strtolower($node->tag)])) {
1637
                $this->parent = $node;
1638
            }
1639
        }
1640
        $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1641
1642
        // If it's a BR tag, we need to set it's text to the default text.
1643
        // This way when we see it in plaintext, we can generate formatting that the user wants.
1644
        // since a br tag never has sub nodes, this works well.
1645
        if ($node->tag == 'br') {
1646
            $node->_[HDOM_INFO_INNER] = $this->default_br_text;
1647
        }
1648
1649
        return true;
1650
    }
1651
1652
    // parse attributes
1653
    protected function parse_attr($node, $name, &$space)
1654
    {
1655
        // Per sourceforge: http://sourceforge.net/tracker/?func=detail&aid=3061408&group_id=218559&atid=1044037
1656
        // If the attribute is already defined inside a tag, only pay atetntion to the first one as opposed to the last one.
1657
        if (isset($node->attr[$name])) {
1658
            return;
1659
        }
1660
1661
        $space[2] = $this->copy_skip($this->token_blank);
1662
        switch ($this->char) {
1663
            case '"':
1664
                $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE;
1665
                $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1666
                $node->attr[$name] = $this->restore_noise($this->copy_until_char_escape('"'));
1667
                $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1668
                break;
1669
            case '\'':
1670
                $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_SINGLE;
1671
                $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1672
                $node->attr[$name] = $this->restore_noise($this->copy_until_char_escape('\''));
1673
                $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1674
                break;
1675
            default:
1676
                $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;
1677
                $node->attr[$name] = $this->restore_noise($this->copy_until($this->token_attr));
1678
        }
1679
        // PaperG: Attributes should not have \r or \n in them, that counts as html whitespace.
1680
        $node->attr[$name] = str_replace("\r", '', $node->attr[$name]);
1681
        $node->attr[$name] = str_replace("\n", '', $node->attr[$name]);
1682
        // PaperG: If this is a "class" selector, lets get rid of the preceeding and trailing space since some people leave it in the multi class case.
1683
        if ($name == 'class') {
1684
            $node->attr[$name] = trim($node->attr[$name]);
1685
        }
1686
    }
1687
1688
    // link node's parent
1689
    protected function link_nodes(&$node, $is_child)
1690
    {
1691
        $node->parent = $this->parent;
1692
        $this->parent->nodes[] = $node;
1693
        if ($is_child) {
1694
            $this->parent->children[] = $node;
1695
        }
1696
    }
1697
1698
    // as a text node
1699
    protected function as_text_node($tag)
1700
    {
1701
        $node = new simple_html_dom_node($this);
1702
        $this->cursor++;
1703
        $node->_[HDOM_INFO_TEXT] = '</'.$tag.'>';
1704
        $this->link_nodes($node, false);
1705
        $this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1706
        return true;
1707
    }
1708
1709
    protected function skip($chars)
1710
    {
1711
        $this->pos += strspn($this->doc, $chars, $this->pos);
1712
        $this->char = ($this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1713
    }
1714
1715
    protected function copy_skip($chars)
1716
    {
1717
        $pos = $this->pos;
1718
        $len = strspn($this->doc, $chars, $pos);
1719
        $this->pos += $len;
1720
        $this->char = ($this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1721
        if ($len === 0) {
1722
            return '';
1723
        }
1724
1725
        return substr($this->doc, $pos, $len);
1726
    }
1727
1728
    protected function copy_until($chars)
1729
    {
1730
        $pos = $this->pos;
1731
        $len = strcspn($this->doc, $chars, $pos);
1732
        $this->pos += $len;
1733
        $this->char = ($this->pos < $this->size) ? $this->doc[$this->pos] : null; // next
1734
        return substr($this->doc, $pos, $len);
1735
    }
1736
1737
    protected function copy_until_char($char)
1738
    {
1739
        if ($this->char === null) {
1740
            return '';
1741
        }
1742
1743
        if (($pos = strpos($this->doc, $char, $this->pos)) === false) {
1744
            $ret = substr($this->doc, $this->pos, $this->size - $this->pos);
1745
            $this->char = null;
1746
            $this->pos = $this->size;
1747
1748
            return $ret;
1749
        }
1750
1751
        if ($pos === $this->pos) {
1752
            return '';
1753
        }
1754
        $pos_old = $this->pos;
1755
        $this->char = $this->doc[$pos];
1756
        $this->pos = $pos;
1757
1758
        return substr($this->doc, $pos_old, $pos - $pos_old);
1759
    }
1760
1761
    protected function copy_until_char_escape($char)
1762
    {
1763
        if ($this->char === null) {
1764
            return '';
1765
        }
1766
1767
        $start = $this->pos;
1768
        while (1) {
1769
            if (($pos = strpos($this->doc, $char, $start)) === false) {
1770
                $ret = substr($this->doc, $this->pos, $this->size - $this->pos);
1771
                $this->char = null;
1772
                $this->pos = $this->size;
1773
1774
                return $ret;
1775
            }
1776
1777
            if ($pos === $this->pos) {
1778
                return '';
1779
            }
1780
1781
            if ($this->doc[$pos - 1] === '\\') {
1782
                $start = $pos + 1;
1783
                continue;
1784
            }
1785
1786
            $pos_old = $this->pos;
1787
            $this->char = $this->doc[$pos];
1788
            $this->pos = $pos;
1789
1790
            return substr($this->doc, $pos_old, $pos - $pos_old);
1791
        }
1792
    }
1793
1794
    // remove noise from html content
1795
    // save the noise in the $this->noise array.
1796
    protected function remove_noise($pattern, $remove_tag = false)
1797
    {
1798
        global $debugObject;
1799
        if (is_object($debugObject)) {
1800
            $debugObject->debugLogEntry(1);
1801
        }
1802
1803
        $count = preg_match_all($pattern, $this->doc, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE);
1804
1805
        for ($i = $count - 1; $i > -1; $i--) {
1806
            $key = '___noise___'.sprintf('% 5d', count($this->noise) + 1000);
1807
            if (is_object($debugObject)) {
1808
                $debugObject->debugLog(2, 'key is: '.$key);
1809
            }
1810
            $idx = ($remove_tag) ? 0 : 1;
1811
            $this->noise[$key] = $matches[$i][$idx][0];
1812
            $this->doc = substr_replace($this->doc, $key, $matches[$i][$idx][1], strlen($matches[$i][$idx][0]));
1813
        }
1814
1815
        // reset the length of content
1816
        $this->size = strlen($this->doc);
1817
        if ($this->size > 0) {
1818
            $this->char = $this->doc[0];
1819
        }
1820
    }
1821
1822
    // restore noise to html content
1823
    public function restore_noise($text)
1824
    {
1825
        global $debugObject;
1826
        if (is_object($debugObject)) {
1827
            $debugObject->debugLogEntry(1);
1828
        }
1829
1830
        while (($pos = strpos($text, '___noise___')) !== false) {
1831
            // Sometimes there is a broken piece of markup, and we don't GET the pos+11 etc... token which indicates a problem outside of us...
1832
            if (strlen($text) > $pos + 15) {
1833
                $key = '___noise___'.$text[$pos + 11].$text[$pos + 12].$text[$pos + 13].$text[$pos + 14].$text[$pos + 15];
1834
                if (is_object($debugObject)) {
1835
                    $debugObject->debugLog(2, 'located key of: '.$key);
1836
                }
1837
1838
                if (isset($this->noise[$key])) {
1839
                    $text = substr($text, 0, $pos).$this->noise[$key].substr($text, $pos + 16);
1840
                } else {
1841
                    // do this to prevent an infinite loop.
1842
                    $text = substr($text, 0, $pos).'UNDEFINED NOISE FOR KEY: '.$key.substr($text, $pos + 16);
1843
                }
1844
            } else {
1845
                // There is no valid key being given back to us... We must get rid of the ___noise___ or we will have a problem.
1846
                $text = substr($text, 0, $pos).'NO NUMERIC NOISE KEY'.substr($text, $pos + 11);
1847
            }
1848
        }
1849
1850
        return $text;
1851
    }
1852
1853
    // Sometimes we NEED one of the noise elements.
1854
    public function search_noise($text)
1855
    {
1856
        global $debugObject;
1857
        if (is_object($debugObject)) {
1858
            $debugObject->debugLogEntry(1);
1859
        }
1860
1861
        foreach ($this->noise as $noiseElement) {
1862
            if (strpos($noiseElement, $text) !== false) {
1863
                return $noiseElement;
1864
            }
1865
        }
1866
    }
1867
1868
    public function __toString()
1869
    {
1870
        return $this->root->innertext();
1871
    }
1872
1873
    public function __get($name)
1874
    {
1875
        switch ($name) {
1876
            case 'outertext':
1877
                return $this->root->innertext();
1878
            case 'innertext':
1879
                return $this->root->innertext();
1880
            case 'plaintext':
1881
                return $this->root->text();
1882
            case 'charset':
1883
                return $this->_charset;
1884
            case 'target_charset':
1885
                return $this->_target_charset;
1886
        }
1887
    }
1888
1889
    // camel naming conventions
1890
    public function childNodes($idx = -1)
1891
    {
1892
        return $this->root->childNodes($idx);
1893
    }
1894
1895
    public function firstChild()
1896
    {
1897
        return $this->root->first_child();
1898
    }
1899
1900
    public function lastChild()
1901
    {
1902
        return $this->root->last_child();
1903
    }
1904
1905
    public function createElement($name, $value = null)
1906
    {
1907
        return @str_get_html("<$name>$value</$name>")->first_child();
0 ignored issues
show
Bug introduced by
The method first_child() does not exist on DomParser\simple_html_dom. Did you maybe mean firstChild()? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1907
        return @str_get_html("<$name>$value</$name>")->/** @scrutinizer ignore-call */ first_child();

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
1908
    }
1909
1910
    public function createTextNode($value)
1911
    {
1912
        return @end(str_get_html($value)->nodes);
1913
    }
1914
1915
    public function getElementById($id)
1916
    {
1917
        return $this->find("#$id", 0);
1918
    }
1919
1920
    public function getElementsById($id, $idx = null)
1921
    {
1922
        return $this->find("#$id", $idx);
1923
    }
1924
1925
    public function getElementByTagName($name)
1926
    {
1927
        return $this->find($name, 0);
1928
    }
1929
1930
    public function getElementsByTagName($name, $idx = -1)
1931
    {
1932
        return $this->find($name, $idx);
1933
    }
1934
1935
    public function loadFile()
1936
    {
1937
        $args = func_get_args();
1938
        $this->load_file($args);
1939
    }
1940
}
1941