Completed
Pull Request — master (#106)
by
unknown
09:10
created

Dom   C

Complexity

Total Complexity 73

Size/Duplication

Total Lines 694
Duplicated Lines 0 %

Coupling/Cohesion

Components 1
Dependencies 11

Importance

Changes 4
Bugs 0 Features 0
Metric Value
wmc 73
c 4
b 0
f 0
lcom 1
cbo 11
dl 0
loc 694
rs 5.0364

24 Methods

Rating   Name   Duplication   Size   Complexity  
A __toString() 0 4 1
A __get() 0 4 1
A load() 0 13 4
A loadFromFile() 0 4 1
A loadFromUrl() 0 10 2
A loadStr() 0 19 1
A setOptions() 0 6 1
A find() 0 6 1
A addSelfClosingTag() 0 11 3
A removeSelfClosingTag() 0 9 2
A clearSelfClosingTags() 0 6 1
A addNoSlashTag() 0 11 3
A removeNoSlashTag() 0 9 2
A clearNoSlashTags() 0 6 1
A firstChild() 0 6 1
A lastChild() 0 6 1
A getElementById() 0 6 1
A getElementsByTag() 0 6 1
A getElementsByClass() 0 6 1
A isLoaded() 0 6 2
B clean() 0 47 5
C parse() 0 53 12
D parseTag() 0 142 20
B detectCharset() 0 42 5

How to fix   Complexity   

Complex Class

Complex classes like Dom often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use Dom, and based on these observations, apply Extract Interface, too.

1
<?php
2
namespace PHPHtmlParser;
3
4
use PHPHtmlParser\Dom\AbstractNode;
5
use PHPHtmlParser\Dom\HtmlNode;
6
use PHPHtmlParser\Dom\TextNode;
7
use PHPHtmlParser\Exceptions\NotLoadedException;
8
use PHPHtmlParser\Exceptions\StrictException;
9
use stringEncode\Encode;
10
11
/**
12
 * Class Dom
13
 *
14
 * @package PHPHtmlParser
15
 */
16
class Dom
17
{
18
19
    /**
20
     * The charset we would like the output to be in.
21
     *
22
     * @var string
23
     */
24
    protected $defaultCharset = 'UTF-8';
25
26
    /**
27
     * Contains the root node of this dom tree.
28
     *
29
     * @var HtmlNode
30
     */
31
    public $root;
32
33
    /**
34
     * The raw version of the document string.
35
     *
36
     * @var string
37
     */
38
    protected $raw;
39
40
    /**
41
     * The document string.
42
     *
43
     * @var Content
44
     */
45
    protected $content = null;
46
47
    /**
48
     * The original file size of the document.
49
     *
50
     * @var int
51
     */
52
    protected $rawSize;
53
54
    /**
55
     * The size of the document after it is cleaned.
56
     *
57
     * @var int
58
     */
59
    protected $size;
60
61
    /**
62
     * A global options array to be used by all load calls.
63
     *
64
     * @var array
65
     */
66
    protected $globalOptions = [];
67
68
    /**
69
     * A persistent option object to be used for all options in the
70
     * parsing of the file.
71
     *
72
     * @var Options
73
     */
74
    protected $options;
75
76
    /**
77
     * A list of tags which will always be self closing
78
     *
79
     * @var array
80
     */
81
    protected $selfClosing = [
82
        'img',
83
        'br',
84
        'input',
85
        'meta',
86
        'link',
87
        'hr',
88
        'base',
89
        'embed',
90
        'spacer',
91
    ];
92
93
    /**
94
     * A list of tags where there should be no /> at the end (html5 style)
95
     *
96
     * @var array
97
     */
98
    protected $noSlash = [];
99
100
    /**
101
     * Returns the inner html of the root node.
102
     *
103
     * @return string
104
     */
105
    public function __toString()
106
    {
107
        return $this->root->innerHtml();
108
    }
109
110
    /**
111
     * A simple wrapper around the root node.
112
     *
113
     * @param string $name
114
     * @return mixed
115
     */
116
    public function __get($name)
117
    {
118
        return $this->root->$name;
119
    }
120
121
    /**
122
     * Attempts to load the dom from any resource, string, file, or URL.
123
     *
124
     * @param string $str
125
     * @param array $options
126
     * @return $this
127
     */
128
    public function load($str, $options = [])
129
    {
130
        // check if it's a file
131
        if (strpos($str, "\n") === false && is_file($str)) {
132
            return $this->loadFromFile($str, $options);
133
        }
134
        // check if it's a url
135
        if (preg_match("/^https?:\/\//i", $str)) {
136
            return $this->loadFromUrl($str, $options);
137
        }
138
139
        return $this->loadStr($str, $options);
140
    }
141
142
    /**
143
     * Loads the dom from a document file/url
144
     *
145
     * @param string $file
146
     * @param array $options
147
     * @return $this
148
     */
149
    public function loadFromFile($file, $options = [])
150
    {
151
        return $this->loadStr(file_get_contents($file), $options);
152
    }
153
154
    /**
155
     * Use a curl interface implementation to attempt to load
156
     * the content from a url.
157
     *
158
     * @param string $url
159
     * @param array $options
160
     * @param CurlInterface $curl
161
     * @return $this
162
     */
163
    public function loadFromUrl($url, $options = [], CurlInterface $curl = null)
164
    {
165
        if (is_null($curl)) {
166
            // use the default curl interface
167
            $curl = new Curl;
168
        }
169
        $content = $curl->get($url);
170
171
        return $this->loadStr($content, $options);
172
    }
173
174
    /**
175
     * Parsers the html of the given string. Used for load(), loadFromFile(),
176
     * and loadFromUrl().
177
     *
178
     * @param string $str
179
     * @param array $option
180
     * @return $this
181
     */
182
    public function loadStr($str, $option)
183
    {
184
        $this->options = new Options;
185
        $this->options->setOptions($this->globalOptions)
186
                      ->setOptions($option);
187
188
        $this->rawSize = strlen($str);
189
        $this->raw     = $str;
190
191
        $html = $this->clean($str);
192
193
        $this->size    = strlen($str);
194
        $this->content = new Content($html);
195
196
        $this->parse();
197
        $this->detectCharset();
198
199
        return $this;
200
    }
201
202
    /**
203
     * Sets a global options array to be used by all load calls.
204
     *
205
     * @param array $options
206
     * @return $this
207
     */
208
    public function setOptions(array $options)
209
    {
210
        $this->globalOptions = $options;
211
212
        return $this;
213
    }
214
215
    /**
216
     * Find elements by css selector on the root node.
217
     *
218
     * @param string $selector
219
     * @param int $nth
220
     * @return array
221
     */
222
    public function find($selector, $nth = null)
223
    {
224
        $this->isLoaded();
225
226
        return $this->root->find($selector, $nth);
227
    }
228
229
    /**
230
     * Adds the tag (or tags in an array) to the list of tags that will always
231
     * be self closing.
232
     *
233
     * @param string|array $tag
234
     * @return $this
235
     */
236
    public function addSelfClosingTag($tag)
237
    {
238
        if ( ! is_array($tag)) {
239
            $tag = [$tag];
240
        }
241
        foreach ($tag as $value) {
242
            $this->selfClosing[] = $value;
243
        }
244
245
        return $this;
246
    }
247
248
    /**
249
     * Removes the tag (or tags in an array) from the list of tags that will
250
     * always be self closing.
251
     *
252
     * @param string|array $tag
253
     * @return $this
254
     */
255
    public function removeSelfClosingTag($tag)
256
    {
257
        if ( ! is_array($tag)) {
258
            $tag = [$tag];
259
        }
260
        $this->selfClosing = array_diff($this->selfClosing, $tag);
261
262
        return $this;
263
    }
264
265
    /**
266
     * Sets the list of self closing tags to empty.
267
     *
268
     * @return $this
269
     */
270
    public function clearSelfClosingTags()
271
    {
272
        $this->selfClosing = [];
273
274
        return $this;
275
    }
276
277
278
    /**
279
     * Adds a tag to the list of self closing tags that should not have a trailing slash
280
     *
281
     * @param $tag
282
     * @return $this
283
     */
284
    public function addNoSlashTag($tag)
285
    {
286
        if ( ! is_array($tag)) {
287
            $tag = [$tag];
288
        }
289
        foreach ($tag as $value) {
290
            $this->noSlash[] = $value;
291
        }
292
293
        return $this;
294
    }
295
296
    /**
297
     * Removes a tag from the list of no-slash tags.
298
     *
299
     * @param $tag
300
     * @return $this
301
     */
302
    public function removeNoSlashTag($tag)
303
    {
304
        if ( ! is_array($tag)) {
305
            $tag = [$tag];
306
        }
307
        $this->noSlash = array_diff($this->noSlash, $tag);
308
309
        return $this;
310
    }
311
312
    /**
313
     * Empties the list of no-slash tags.
314
     *
315
     * @return $this
316
     */
317
    public function clearNoSlashTags()
318
    {
319
        $this->noSlash = [];
320
321
        return $this;
322
    }
323
324
    /**
325
     * Simple wrapper function that returns the first child.
326
     *
327
     * @return \PHPHtmlParser\Dom\AbstractNode
328
     */
329
    public function firstChild()
330
    {
331
        $this->isLoaded();
332
333
        return $this->root->firstChild();
334
    }
335
336
    /**
337
     * Simple wrapper function that returns the last child.
338
     *
339
     * @return \PHPHtmlParser\Dom\AbstractNode
340
     */
341
    public function lastChild()
342
    {
343
        $this->isLoaded();
344
345
        return $this->root->lastChild();
346
    }
347
348
    /**
349
     * Simple wrapper function that returns an element by the
350
     * id.
351
     *
352
     * @param string $id
353
     * @return \PHPHtmlParser\Dom\AbstractNode
354
     */
355
    public function getElementById($id)
356
    {
357
        $this->isLoaded();
358
359
        return $this->find('#'.$id, 0);
0 ignored issues
show
Bug Compatibility introduced by
The expression $this->find('#' . $id, 0); of type array|PHPHtmlParser\Dom\AbstractNode adds the type array to the return on line 359 which is incompatible with the return type documented by PHPHtmlParser\Dom::getElementById of type PHPHtmlParser\Dom\AbstractNode.
Loading history...
360
    }
361
362
    /**
363
     * Simple wrapper function that returns all elements by
364
     * tag name.
365
     *
366
     * @param string $name
367
     * @return array
368
     */
369
    public function getElementsByTag($name)
370
    {
371
        $this->isLoaded();
372
373
        return $this->find($name);
374
    }
375
376
    /**
377
     * Simple wrapper function that returns all elements by
378
     * class name.
379
     *
380
     * @param string $class
381
     * @return array
382
     */
383
    public function getElementsByClass($class)
384
    {
385
        $this->isLoaded();
386
387
        return $this->find('.'.$class);
388
    }
389
390
    /**
391
     * Checks if the load methods have been called.
392
     *
393
     * @throws NotLoadedException
394
     */
395
    protected function isLoaded()
396
    {
397
        if (is_null($this->content)) {
398
            throw new NotLoadedException('Content is not loaded!');
399
        }
400
    }
401
402
    /**
403
     * Cleans the html of any none-html information.
404
     *
405
     * @param string $str
406
     * @return string
407
     */
408
    protected function clean($str)
409
    {
410
        if ($this->options->get('cleanupInput') != true) {
411
            // skip entire cleanup step
412
            return $str;
413
        }
414
415
        // remove white space before closing tags
416
        $str = mb_eregi_replace("'\s+>", "'>", $str);
417
        $str = mb_eregi_replace('"\s+>', '">', $str);
418
419
        // clean out the \n\r
420
        $replace = ' ';
421
        if ($this->options->get('preserveLineBreaks')) {
422
            $replace = '&#10;';
423
        }
424
        $str = str_replace(["\r\n", "\r", "\n"], $replace, $str);
425
426
        // strip the doctype
427
        $str = mb_eregi_replace("<!doctype(.*?)>", '', $str);
428
429
        // strip out comments
430
        $str = mb_eregi_replace("<!--(.*?)-->", '', $str);
431
432
        // strip out cdata
433
        $str = mb_eregi_replace("<!\[CDATA\[(.*?)\]\]>", '', $str);
434
435
        // strip out <script> tags
436
        if ($this->options->get('removeScripts') == true) {
437
            $str = mb_eregi_replace("<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>", '', $str);
438
            $str = mb_eregi_replace("<\s*script\s*>(.*?)<\s*/\s*script\s*>", '', $str);
439
        }
440
441
        // strip out <style> tags
442
        if ($this->options->get('removeStyles') == true) {
443
            $str = mb_eregi_replace("<\s*style[^>]*[^/]>(.*?)<\s*/\s*style\s*>", '', $str);
444
            $str = mb_eregi_replace("<\s*style\s*>(.*?)<\s*/\s*style\s*>", '', $str);
445
        }
446
447
        // strip out server side scripts
448
        $str = mb_eregi_replace("(<\?)(.*?)(\?>)", '', $str);
449
450
        // strip smarty scripts
451
        $str = mb_eregi_replace("(\{\w)(.*?)(\})", '', $str);
452
453
        return $str;
454
    }
455
456
    /**
457
     * Attempts to parse the html in content.
458
     */
459
    protected function parse()
460
    {
461
        // add the root node
462
        $this->root = new HtmlNode('root');
463
        $activeNode = $this->root;
464
        while ( ! is_null($activeNode)) {
465
            $str = $this->content->copyUntil('<');
466
            if ($str == '') {
467
                $info = $this->parseTag();
468
                if ( ! $info['status']) {
469
                    // we are done here
470
                    $activeNode = null;
471
                    continue;
472
                }
473
474
                // check if it was a closing tag
475
                if ($info['closing']) {
476
                    $originalNode = $activeNode;
477
                    while ($activeNode->getTag()->name() != $info['tag']) {
478
                        $activeNode = $activeNode->getParent();
479
                        if (is_null($activeNode)) {
480
                            // we could not find opening tag
481
                            $activeNode = $originalNode;
482
                            break;
483
                        }
484
                    }
485
                    if ( ! is_null($activeNode)) {
486
                        $activeNode = $activeNode->getParent();
487
                    }
488
                    continue;
489
                }
490
491
                if ( ! isset($info['node'])) {
492
                    continue;
493
                }
494
495
                /** @var AbstractNode $node */
496
                $node = $info['node'];
497
                $activeNode->addChild($node);
498
499
                // check if node is self closing
500
                if ( ! $node->getTag()->isSelfClosing()) {
501
                    $activeNode = $node;
502
                }
503
            } else if ($this->options->whitespaceTextNode ||
504
                trim($str) != ''
505
            ) {
506
                // we found text we care about
507
                $textNode = new TextNode($str);
508
                $activeNode->addChild($textNode);
509
            }
510
        }
511
    }
512
513
    /**
514
     * Attempt to parse a tag out of the content.
515
     *
516
     * @return array
517
     * @throws StrictException
518
     */
519
    protected function parseTag()
520
    {
521
        $return = [
522
            'status'  => false,
523
            'closing' => false,
524
            'node'    => null,
525
        ];
526
        if ($this->content->char() != '<') {
527
            // we are not at the beginning of a tag
528
            return $return;
529
        }
530
531
        // check if this is a closing tag
532
        if ($this->content->fastForward(1)->char() == '/') {
533
            // end tag
534
            $tag = $this->content->fastForward(1)
535
                                 ->copyByToken('slash', true);
536
            // move to end of tag
537
            $this->content->copyUntil('>');
538
            $this->content->fastForward(1);
539
540
            // check if this closing tag counts
541
            $tag = strtolower($tag);
542
            if (in_array($tag, $this->selfClosing)) {
543
                $return['status'] = true;
544
545
                return $return;
546
            } else {
547
                $return['status']  = true;
548
                $return['closing'] = true;
549
                $return['tag']     = strtolower($tag);
550
            }
551
552
            return $return;
553
        }
554
555
        $tag  = strtolower($this->content->copyByToken('slash', true));
556
        $node = new HtmlNode($tag);
557
558
        // attributes
559
        while ($this->content->char() != '>' &&
560
            $this->content->char() != '/') {
561
            $space = $this->content->skipByToken('blank', true);
562
            if (empty($space)) {
563
                $this->content->fastForward(1);
564
                continue;
565
            }
566
567
            $name = $this->content->copyByToken('equal', true);
568
            if ($name == '/') {
569
                break;
570
            }
571
572
            if (empty($name)) {
573
                $this->content->fastForward(1);
574
                continue;
575
            }
576
577
            $this->content->skipByToken('blank');
578
            if ($this->content->char() == '=') {
579
                $attr = [];
580
                $this->content->fastForward(1)
581
                              ->skipByToken('blank');
582
                switch ($this->content->char()) {
583
                    case '"':
584
                        $attr['doubleQuote'] = true;
585
                        $this->content->fastForward(1);
586
                        $string = $this->content->copyUntil('"', true, true);
587
                        do {
588
                            $moreString = $this->content->copyUntilUnless('"', '=>');
589
                            $string .= $moreString;
590
                        } while ( ! empty($moreString));
591
                        $attr['value'] = $string;
592
                        $this->content->fastForward(1);
593
                        $node->getTag()->$name = $attr;
594
                        break;
595
                    case "'":
596
                        $attr['doubleQuote'] = false;
597
                        $this->content->fastForward(1);
598
                        $string = $this->content->copyUntil("'", true, true);
599
                        do {
600
                            $moreString = $this->content->copyUntilUnless("'", '=>');
601
                            $string .= $moreString;
602
                        } while ( ! empty($moreString));
603
                        $attr['value'] = $string;
604
                        $this->content->fastForward(1);
605
                        $node->getTag()->$name = $attr;
606
                        break;
607
                    default:
608
                        $attr['doubleQuote']   = true;
609
                        $attr['value']         = $this->content->copyByToken('attr', true);
610
                        $node->getTag()->$name = $attr;
611
                        break;
612
                }
613
            } else {
614
                // no value attribute
615
                if ($this->options->strict) {
616
                    // can't have this in strict html
617
                    $character = $this->content->getPosition();
618
                    throw new StrictException("Tag '$tag' has an attribute '$name' with out a value! (character #$character)");
619
                }
620
                $node->getTag()->$name = [
621
                    'value'       => null,
622
                    'doubleQuote' => true,
623
                ];
624
                if ($this->content->char() != '>') {
625
                    $this->content->rewind(1);
626
                }
627
            }
628
        }
629
630
        $this->content->skipByToken('blank');
631
        if ($this->content->char() == '/') {
632
            // self closing tag
633
            $node->getTag()->selfClosing();
634
            $this->content->fastForward(1);
635
        } elseif (in_array($tag, $this->selfClosing)) {
636
637
            // Should be a self closing tag, check if we are strict
638
            if ($this->options->strict) {
639
                $character = $this->content->getPosition();
640
                throw new StrictException("Tag '$tag' is not self closing! (character #$character)");
641
            }
642
643
            // We force self closing on this tag.
644
            $node->getTag()->selfClosing();
645
646
            // Should this tag use a trailing slash?
647
            if(in_array($tag, $this->noSlash))
648
            {
649
                $node->getTag()->noTrailingSlash();
650
            }
651
652
        }
653
654
        $this->content->fastForward(1);
655
656
        $return['status'] = true;
657
        $return['node']   = $node;
658
659
        return $return;
660
    }
661
662
    /**
663
     * Attempts to detect the charset that the html was sent in.
664
     *
665
     * @return bool
666
     */
667
    protected function detectCharset()
668
    {
669
        // set the default
670
        $encode = new Encode;
671
        $encode->from($this->defaultCharset);
672
        $encode->to($this->defaultCharset);
673
674
        if ( ! is_null($this->options->enforceEncoding)) {
675
            //  they want to enforce the given encoding
676
            $encode->from($this->options->enforceEncoding);
0 ignored issues
show
Documentation introduced by
$this->options->enforceEncoding is of type boolean, but the function expects a string.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
677
            $encode->to($this->options->enforceEncoding);
0 ignored issues
show
Documentation introduced by
$this->options->enforceEncoding is of type boolean, but the function expects a string.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
678
679
            return false;
680
        }
681
682
        $meta = $this->root->find('meta[http-equiv=Content-Type]', 0);
683
        if (is_null($meta)) {
684
            // could not find meta tag
685
            $this->root->propagateEncoding($encode);
686
687
            return false;
688
        }
689
        $content = $meta->content;
690
        if (empty($content)) {
691
            // could not find content
692
            $this->root->propagateEncoding($encode);
693
694
            return false;
695
        }
696
        $matches = [];
697
        if (preg_match('/charset=(.+)/', $content, $matches)) {
698
            $encode->from(trim($matches[1]));
699
            $this->root->propagateEncoding($encode);
700
701
            return true;
702
        }
703
704
        // no charset found
705
        $this->root->propagateEncoding($encode);
706
707
        return false;
708
    }
709
}
710