Completed
Push — master ( 182f34...f24a60 )
by Asmir
13s
created

Tokenizer   F

Complexity

Total Complexity 169

Size/Duplication

Total Lines 1163
Duplicated Lines 0 %

Coupling/Cohesion

Dependencies 5

Test Coverage

Coverage 91.12%

Importance

Changes 0
Metric Value
wmc 169
cbo 5
dl 0
loc 1163
ccs 472
cts 518
cp 0.9112
rs 0.948
c 0
b 0
f 0

30 Methods

Rating   Name   Duplication   Size   Complexity  
B endTag() 0 37 7
A __construct() 0 6 1
A parse() 0 7 2
A setTextMode() 0 5 1
C consumeData() 0 71 13
A characterData() 0 19 6
A text() 0 17 3
A rawText() 0 13 2
B rcdata() 0 32 7
A eof() 0 7 1
B markupDeclaration() 0 21 6
A tagName() 0 29 5
B isTagEnd() 0 36 6
B attribute() 0 54 9
B attributeValue() 0 36 11
A quotedAttributeValue() 0 24 4
C unquotedAttributeValue() 0 38 13
A bogusComment() 0 18 3
A comment() 0 29 4
A isCommentEnd() 0 27 5
C doctype() 0 111 14
A quotedString() 0 18 4
A cdataSection() 0 31 5
B processingInstruction() 0 37 8
A readUntilSequence() 0 22 3
A sequenceMatches() 0 6 1
A flushBuffer() 0 8 2
A buffer() 0 4 1
A parseError() 0 15 2
D decodeCharacterReference() 0 96 20

How to fix   Complexity   

Complex Class

Complex classes like Tokenizer often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use Tokenizer, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
namespace Masterminds\HTML5\Parser;
4
5
use Masterminds\HTML5\Elements;
6
7
/**
8
 * The HTML5 tokenizer.
9
 *
10
 * The tokenizer's role is reading data from the scanner and gathering it into
11
 * semantic units. From the tokenizer, data is emitted to an event handler,
12
 * which may (for example) create a DOM tree.
13
 *
14
 * The HTML5 specification has a detailed explanation of tokenizing HTML5. We
15
 * follow that specification to the maximum extent that we can. If you find
16
 * a discrepancy that is not documented, please file a bug and/or submit a
17
 * patch.
18
 *
19
 * This tokenizer is implemented as a recursive descent parser.
20
 *
21
 * Within the API documentation, you may see references to the specific section
22
 * of the HTML5 spec that the code attempts to reproduce. Example: 8.2.4.1.
23
 * This refers to section 8.2.4.1 of the HTML5 CR specification.
24
 *
25
 * @see http://www.w3.org/TR/2012/CR-html5-20121217/
26
 */
27
class Tokenizer
28
{
29
    protected $scanner;
30
31
    protected $events;
32
33
    protected $tok;
34
35
    /**
36
     * Buffer for text.
37
     */
38
    protected $text = '';
39
40
    // When this goes to false, the parser stops.
41
    protected $carryOn = true;
42
43
    protected $textMode = 0; // TEXTMODE_NORMAL;
44
    protected $untilTag = null;
45
46
    const CONFORMANT_XML = 'xml';
47
    const CONFORMANT_HTML = 'html';
48
    protected $mode = self::CONFORMANT_HTML;
49
50
    /**
51
     * Create a new tokenizer.
52
     *
53
     * Typically, parsing a document involves creating a new tokenizer, giving
54
     * it a scanner (input) and an event handler (output), and then calling
55
     * the Tokenizer::parse() method.`
56
     *
57
     * @param Scanner      $scanner      A scanner initialized with an input stream.
58
     * @param EventHandler $eventHandler An event handler, initialized and ready to receive events.
59
     * @param string       $mode
60
     */
61 127
    public function __construct($scanner, $eventHandler, $mode = self::CONFORMANT_HTML)
62
    {
63 127
        $this->scanner = $scanner;
64 127
        $this->events = $eventHandler;
65 127
        $this->mode = $mode;
66 127
    }
67
68
    /**
69
     * Begin parsing.
70
     *
71
     * This will begin scanning the document, tokenizing as it goes.
72
     * Tokens are emitted into the event handler.
73
     *
74
     * Tokenizing will continue until the document is completely
75
     * read. Errors are emitted into the event handler, but
76
     * the parser will attempt to continue parsing until the
77
     * entire input stream is read.
78
     */
79 127
    public function parse()
80
    {
81
        do {
82 127
            $this->consumeData();
83
            // FIXME: Add infinite loop protection.
84 127
        } while ($this->carryOn);
85 127
    }
86
87
    /**
88
     * Set the text mode for the character data reader.
89
     *
90
     * HTML5 defines three different modes for reading text:
91
     * - Normal: Read until a tag is encountered.
92
     * - RCDATA: Read until a tag is encountered, but skip a few otherwise-
93
     * special characters.
94
     * - Raw: Read until a special closing tag is encountered (viz. pre, script)
95
     *
96
     * This allows those modes to be set.
97
     *
98
     * Normally, setting is done by the event handler via a special return code on
99
     * startTag(), but it can also be set manually using this function.
100
     *
101
     * @param int    $textmode One of Elements::TEXT_*.
102
     * @param string $untilTag The tag that should stop RAW or RCDATA mode. Normal mode does not
103
     *                         use this indicator.
104
     */
105 108
    public function setTextMode($textmode, $untilTag = null)
106
    {
107 108
        $this->textMode = $textmode & (Elements::TEXT_RAW | Elements::TEXT_RCDATA);
108 108
        $this->untilTag = $untilTag;
109 108
    }
110
111
    /**
112
     * Consume a character and make a move.
113
     * HTML5 8.2.4.1.
114
     */
115 127
    protected function consumeData()
116
    {
117 127
        $tok = $this->scanner->current();
118
119 127
        if ('&' === $tok) {
120
            // Character reference
121 8
            $ref = $this->decodeCharacterReference();
122 8
            $this->buffer($ref);
123
124 8
            $tok = $this->scanner->current();
125 8
        }
126
127
        // Parse tag
128 127
        if ('<' === $tok) {
129
            // Any buffered text data can go out now.
130 123
            $this->flushBuffer();
131
132 123
            $tok = $this->scanner->next();
133
134 123
            if ('!' === $tok) {
135 101
                $this->markupDeclaration();
136 123
            } elseif ('/' === $tok) {
137 111
                $this->endTag();
138 120
            } elseif ('?' === $tok) {
139 7
                $this->processingInstruction();
140 119
            } elseif (ctype_alpha($tok)) {
141 114
                $this->tagName();
142 114
            } else {
143 1
                $this->parseError('Illegal tag opening');
144
                // TODO is this necessary ?
145 1
                $this->characterData();
146
            }
147
148 123
            $tok = $this->scanner->current();
149 123
        }
150
151 127
        if (false === $tok) {
152
            // Handle end of document
153 127
            $this->eof();
154 127
        } else {
155
            // Parse character
156 112
            switch ($this->textMode) {
157 112
                case Elements::TEXT_RAW:
158 8
                    $this->rawText($tok);
159 8
                    break;
160
161 112
                case Elements::TEXT_RCDATA:
162 37
                    $this->rcdata($tok);
163 37
                    break;
164
165 111
                default:
166 111
                    if ('<' === $tok || '&' === $tok) {
167 70
                        break;
168
                    }
169
170
                    // NULL character
171 87
                    if ("\00" === $tok) {
172
                        $this->parseError('Received null character.');
173
174
                        $this->text .= $tok;
175
                        $this->scanner->consume();
176
177
                        break;
178
                    }
179
180 87
                    $this->text .= $this->scanner->charsUntil("<&\0");
181 112
            }
182
        }
183
184 127
        return $this->carryOn;
185
    }
186
187
    /**
188
     * Parse anything that looks like character data.
189
     *
190
     * Different rules apply based on the current text mode.
191
     *
192
     * @see Elements::TEXT_RAW Elements::TEXT_RCDATA.
193
     */
194 1
    protected function characterData()
195
    {
196 1
        $tok = $this->scanner->current();
197 1
        if (false === $tok) {
198
            return false;
199
        }
200 1
        switch ($this->textMode) {
201 1
            case Elements::TEXT_RAW:
202
                return $this->rawText($tok);
203 1
            case Elements::TEXT_RCDATA:
204
                return $this->rcdata($tok);
205 1
            default:
206 1
                if ('<' === $tok || '&' === $tok) {
207
                    return false;
208
                }
209
210 1
                return $this->text($tok);
211 1
        }
212
    }
213
214
    /**
215
     * This buffers the current token as character data.
216
     *
217
     * @param string $tok The current token.
218
     *
219
     * @return bool
220
     */
221 1
    protected function text($tok)
222
    {
223
        // This should never happen...
224 1
        if (false === $tok) {
225
            return false;
226
        }
227
228
        // NULL character
229 1
        if ("\00" === $tok) {
230
            $this->parseError('Received null character.');
231
        }
232
233 1
        $this->buffer($tok);
234 1
        $this->scanner->consume();
235
236 1
        return true;
237
    }
238
239
    /**
240
     * Read text in RAW mode.
241
     *
242
     * @param string $tok The current token.
243
     *
244
     * @return bool
245
     */
246 8
    protected function rawText($tok)
247
    {
248 8
        if (is_null($this->untilTag)) {
249
            return $this->text($tok);
250
        }
251
252 8
        $sequence = '</' . $this->untilTag . '>';
253 8
        $txt = $this->readUntilSequence($sequence);
254 8
        $this->events->text($txt);
255 8
        $this->setTextMode(0);
256
257 8
        return $this->endTag();
258
    }
259
260
    /**
261
     * Read text in RCDATA mode.
262
     *
263
     * @param string $tok The current token.
264
     *
265
     * @return bool
266
     */
267 37
    protected function rcdata($tok)
268
    {
269 37
        if (is_null($this->untilTag)) {
270
            return $this->text($tok);
271
        }
272
273 37
        $sequence = '</' . $this->untilTag;
274 37
        $txt = '';
275
276 37
        $caseSensitive = !Elements::isHtml5Element($this->untilTag);
277 37
        while (false !== $tok && !('<' == $tok && ($this->scanner->sequenceMatches($sequence, $caseSensitive)))) {
278 35
            if ('&' == $tok) {
279 1
                $txt .= $this->decodeCharacterReference();
280 1
                $tok = $this->scanner->current();
281 1
            } else {
282 35
                $txt .= $tok;
283 35
                $tok = $this->scanner->next();
284
            }
285 35
        }
286 37
        $len = strlen($sequence);
287 37
        $this->scanner->consume($len);
288 37
        $len += $this->scanner->whitespace();
289 37
        if ('>' !== $this->scanner->current()) {
290
            $this->parseError('Unclosed RCDATA end tag');
291
        }
292
293 37
        $this->scanner->unconsume($len);
294 37
        $this->events->text($txt);
295 37
        $this->setTextMode(0);
296
297 37
        return $this->endTag();
298
    }
299
300
    /**
301
     * If the document is read, emit an EOF event.
302
     */
303 127
    protected function eof()
304
    {
305
        // fprintf(STDOUT, "EOF");
306 127
        $this->flushBuffer();
307 127
        $this->events->eof();
308 127
        $this->carryOn = false;
309 127
    }
310
311
    /**
312
     * Look for markup.
313
     */
314 101
    protected function markupDeclaration()
315
    {
316 101
        $tok = $this->scanner->next();
317
318
        // Comment:
319 101
        if ('-' == $tok && '-' == $this->scanner->peek()) {
320 6
            $this->scanner->consume(2);
321
322 6
            return $this->comment();
323 98
        } elseif ('D' == $tok || 'd' == $tok) { // Doctype
324 96
            return $this->doctype();
325 7
        } elseif ('[' == $tok) { // CDATA section
326 7
            return $this->cdataSection();
327
        }
328
329
        // FINISH
330 1
        $this->parseError('Expected <!--, <![CDATA[, or <!DOCTYPE. Got <!%s', $tok);
331 1
        $this->bogusComment('<!');
332
333 1
        return true;
334
    }
335
336
    /**
337
     * Consume an end tag. See section 8.2.4.9.
338
     */
339 111
    protected function endTag()
340
    {
341 111
        if ('/' != $this->scanner->current()) {
342 44
            return false;
343
        }
344 111
        $tok = $this->scanner->next();
345
346
        // a-zA-Z -> tagname
347
        // > -> parse error
348
        // EOF -> parse error
349
        // -> parse error
350 111
        if (!ctype_alpha($tok)) {
351 2
            $this->parseError("Expected tag name, got '%s'", $tok);
352 2
            if ("\0" == $tok || false === $tok) {
353
                return false;
354
            }
355
356 2
            return $this->bogusComment('</');
357
        }
358
359 110
        $name = $this->scanner->charsUntil("\n\f \t>");
360 110
        $name = self::CONFORMANT_XML === $this->mode ? $name : strtolower($name);
361
        // Trash whitespace.
362 110
        $this->scanner->whitespace();
363
364 110
        $tok = $this->scanner->current();
365 110
        if ('>' != $tok) {
366 1
            $this->parseError("Expected >, got '%s'", $tok);
367
            // We just trash stuff until we get to the next tag close.
368 1
            $this->scanner->charsUntil('>');
369 1
        }
370
371 110
        $this->events->endTag($name);
372 110
        $this->scanner->consume();
373
374 110
        return true;
375
    }
376
377
    /**
378
     * Consume a tag name and body. See section 8.2.4.10.
379
     */
380 114
    protected function tagName()
381
    {
382
        // We know this is at least one char.
383 114
        $name = $this->scanner->charsWhile(':_-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz');
384 114
        $name = self::CONFORMANT_XML === $this->mode ? $name : strtolower($name);
385 114
        $attributes = array();
386 114
        $selfClose = false;
387
388
        // Handle attribute parse exceptions here so that we can
389
        // react by trying to build a sensible parse tree.
390
        try {
391
            do {
392 114
                $this->scanner->whitespace();
393 114
                $this->attribute($attributes);
394 114
            } while (!$this->isTagEnd($selfClose));
395 114
        } catch (ParseError $e) {
396 2
            $selfClose = false;
397
        }
398
399 114
        $mode = $this->events->startTag($name, $attributes, $selfClose);
400
401 114
        if (is_int($mode)) {
402 107
            $this->setTextMode($mode, $name);
403 107
        }
404
405 114
        $this->scanner->consume();
406
407 114
        return true;
408
    }
409
410
    /**
411
     * Check if the scanner has reached the end of a tag.
412
     */
413 114
    protected function isTagEnd(&$selfClose)
414
    {
415 114
        $tok = $this->scanner->current();
416 114
        if ('/' == $tok) {
417 15
            $this->scanner->consume();
418 15
            $this->scanner->whitespace();
419 15
            $tok = $this->scanner->current();
420
421 15
            if ('>' == $tok) {
422 15
                $selfClose = true;
423
424 15
                return true;
425
            }
426 2
            if (false === $tok) {
427 1
                $this->parseError('Unexpected EOF inside of tag.');
428
429 1
                return true;
430
            }
431
            // Basically, we skip the / token and go on.
432
            // See 8.2.4.43.
433 1
            $this->parseError("Unexpected '%s' inside of a tag.", $tok);
434
435 1
            return false;
436
        }
437
438 114
        if ('>' == $tok) {
439 114
            return true;
440
        }
441 32
        if (false === $tok) {
442 2
            $this->parseError('Unexpected EOF inside of tag.');
443
444 2
            return true;
445
        }
446
447 31
        return false;
448
    }
449
450
    /**
451
     * Parse attributes from inside of a tag.
452
     *
453
     * @param string[] $attributes
454
     *
455
     * @return bool
456
     *
457
     * @throws ParseError
458
     */
459 114
    protected function attribute(&$attributes)
460
    {
461 114
        $tok = $this->scanner->current();
462 114
        if ('/' == $tok || '>' == $tok || false === $tok) {
463 108
            return false;
464
        }
465
466 82
        if ('<' == $tok) {
467 2
            $this->parseError("Unexpected '<' inside of attributes list.");
468
            // Push the < back onto the stack.
469 2
            $this->scanner->unconsume();
470
            // Let the caller figure out how to handle this.
471 2
            throw new ParseError('Start tag inside of attribute.');
472
        }
473
474 82
        $name = strtolower($this->scanner->charsUntil("/>=\n\f\t "));
475
476 82
        if (0 == strlen($name)) {
477 3
            $tok = $this->scanner->current();
478 3
            $this->parseError('Expected an attribute name, got %s.', $tok);
479
            // Really, only '=' can be the char here. Everything else gets absorbed
480
            // under one rule or another.
481 3
            $name = $tok;
482 3
            $this->scanner->consume();
483 3
        }
484
485 82
        $isValidAttribute = true;
486
        // Attribute names can contain most Unicode characters for HTML5.
487
        // But method "DOMElement::setAttribute" is throwing exception
488
        // because of it's own internal restriction so these have to be filtered.
489
        // see issue #23: https://github.com/Masterminds/html5-php/issues/23
490
        // and http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#syntax-attribute-name
491 82
        if (preg_match("/[\x1-\x2C\\/\x3B-\x40\x5B-\x5E\x60\x7B-\x7F]/u", $name)) {
492 4
            $this->parseError('Unexpected characters in attribute name: %s', $name);
493 4
            $isValidAttribute = false;
494 4
        }         // There is no limitation for 1st character in HTML5.
495
        // But method "DOMElement::setAttribute" is throwing exception for the
496
        // characters below so they have to be filtered.
497
        // see issue #23: https://github.com/Masterminds/html5-php/issues/23
498
        // and http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#syntax-attribute-name
499 79
        elseif (preg_match('/^[0-9.-]/u', $name)) {
500 1
            $this->parseError('Unexpected character at the begining of attribute name: %s', $name);
501 1
            $isValidAttribute = false;
502 1
        }
503
        // 8.1.2.3
504 82
        $this->scanner->whitespace();
505
506 82
        $val = $this->attributeValue();
507 82
        if ($isValidAttribute) {
508 79
            $attributes[$name] = $val;
509 79
        }
510
511 82
        return true;
512
    }
513
514
    /**
515
     * Consume an attribute value. See section 8.2.4.37 and after.
516
     *
517
     * @return string|null
518
     */
519 82
    protected function attributeValue()
520
    {
521 82
        if ('=' != $this->scanner->current()) {
522 13
            return null;
523
        }
524 78
        $this->scanner->consume();
525
        // 8.1.2.3
526 78
        $this->scanner->whitespace();
527
528 78
        $tok = $this->scanner->current();
529
        switch ($tok) {
530 78
            case "\n":
531 78
            case "\f":
532 78
            case ' ':
533 78
            case "\t":
534
                // Whitespace here indicates an empty value.
535
                return null;
536 78
            case '"':
537 78
            case "'":
538 78
                $this->scanner->consume();
539
540 78
                return $this->quotedAttributeValue($tok);
0 ignored issues
show
Security Bug introduced by
It seems like $tok defined by $this->scanner->current() on line 528 can also be of type false; however, Masterminds\HTML5\Parser...:quotedAttributeValue() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
541 1
            case '>':
542
                // case '/': // 8.2.4.37 seems to allow foo=/ as a valid attr.
543 1
                $this->parseError('Expected attribute value, got tag end.');
544
545 1
                return null;
546 1
            case '=':
547 1
            case '`':
548
                $this->parseError('Expecting quotes, got %s.', $tok);
549
550
                return $this->unquotedAttributeValue();
551 1
            default:
552 1
                return $this->unquotedAttributeValue();
553 1
        }
554
    }
555
556
    /**
557
     * Get an attribute value string.
558
     *
559
     * @param string $quote IMPORTANT: This is a series of chars! Any one of which will be considered
560
     *                      termination of an attribute's value. E.g. "\"'" will stop at either
561
     *                      ' or ".
562
     *
563
     * @return string The attribute value.
564
     */
565 78
    protected function quotedAttributeValue($quote)
566
    {
567 78
        $stoplist = "\f" . $quote;
568 78
        $val = '';
569
570 78
        while (true) {
571 78
            $tokens = $this->scanner->charsUntil($stoplist . '&');
572 78
            if (false !== $tokens) {
573 78
                $val .= $tokens;
574 78
            } else {
575
                break;
576
            }
577
578 78
            $tok = $this->scanner->current();
579 78
            if ('&' == $tok) {
580 3
                $val .= $this->decodeCharacterReference(true);
581 3
                continue;
582
            }
583 78
            break;
584
        }
585 78
        $this->scanner->consume();
586
587 78
        return $val;
588
    }
589
590 1
    protected function unquotedAttributeValue()
591
    {
592 1
        $val = '';
593 1
        $tok = $this->scanner->current();
594 1
        while (false !== $tok) {
595
            switch ($tok) {
596 1
                case "\n":
597 1
                case "\f":
598 1
                case ' ':
599 1
                case "\t":
600 1
                case '>':
601 1
                    break 2;
602
603 1
                case '&':
604 1
                    $val .= $this->decodeCharacterReference(true);
605 1
                    $tok = $this->scanner->current();
606
607 1
                    break;
608
609 1
                case "'":
610 1
                case '"':
611 1
                case '<':
612 1
                case '=':
613 1
                case '`':
614 1
                    $this->parseError('Unexpected chars in unquoted attribute value %s', $tok);
615 1
                    $val .= $tok;
616 1
                    $tok = $this->scanner->next();
617 1
                    break;
618
619 1
                default:
620 1
                    $val .= $this->scanner->charsUntil("\t\n\f >&\"'<=`");
621
622 1
                    $tok = $this->scanner->current();
623 1
            }
624 1
        }
625
626 1
        return $val;
627
    }
628
629
    /**
630
     * Consume malformed markup as if it were a comment.
631
     * 8.2.4.44.
632
     *
633
     * The spec requires that the ENTIRE tag-like thing be enclosed inside of
634
     * the comment. So this will generate comments like:
635
     *
636
     * &lt;!--&lt/+foo&gt;--&gt;
637
     *
638
     * @param string $leading Prepend any leading characters. This essentially
639
     *                        negates the need to backtrack, but it's sort of a hack.
640
     *
641
     * @return bool
642
     */
643 3
    protected function bogusComment($leading = '')
644
    {
645 3
        $comment = $leading;
646 3
        $tokens = $this->scanner->charsUntil('>');
647 3
        if (false !== $tokens) {
648 2
            $comment .= $tokens;
649 2
        }
650 3
        $tok = $this->scanner->current();
651 3
        if (false !== $tok) {
652 2
            $comment .= $tok;
653 2
        }
654
655 3
        $this->flushBuffer();
656 3
        $this->events->comment($comment);
657 3
        $this->scanner->consume();
658
659 3
        return true;
660
    }
661
662
    /**
663
     * Read a comment.
664
     * Expects the first tok to be inside of the comment.
665
     *
666
     * @return bool
667
     */
668 6
    protected function comment()
669
    {
670 6
        $tok = $this->scanner->current();
671 6
        $comment = '';
672
673
        // <!-->. Emit an empty comment because 8.2.4.46 says to.
674 6
        if ('>' == $tok) {
675
            // Parse error. Emit the comment token.
676 1
            $this->parseError("Expected comment data, got '>'");
677 1
            $this->events->comment('');
678 1
            $this->scanner->consume();
679
680 1
            return true;
681
        }
682
683
        // Replace NULL with the replacement char.
684 6
        if ("\0" == $tok) {
685
            $tok = UTF8Utils::FFFD;
686
        }
687 6
        while (!$this->isCommentEnd()) {
688 6
            $comment .= $tok;
689 6
            $tok = $this->scanner->next();
690 6
        }
691
692 6
        $this->events->comment($comment);
693 6
        $this->scanner->consume();
694
695 6
        return true;
696
    }
697
698
    /**
699
     * Check if the scanner has reached the end of a comment.
700
     *
701
     * @return bool
702
     */
703 6
    protected function isCommentEnd()
704
    {
705 6
        $tok = $this->scanner->current();
706
707
        // EOF
708 6
        if (false === $tok) {
709
            // Hit the end.
710 1
            $this->parseError('Unexpected EOF in a comment.');
711
712 1
            return true;
713
        }
714
715
        // If it doesn't start with -, not the end.
716 6
        if ('-' != $tok) {
717 6
            return false;
718
        }
719
720
        // Advance one, and test for '->'
721 6
        if ('-' == $this->scanner->next() && '>' == $this->scanner->peek()) {
722 6
            $this->scanner->consume(); // Consume the last '>'
723 6
            return true;
724
        }
725
        // Unread '-';
726 2
        $this->scanner->unconsume(1);
727
728 2
        return false;
729
    }
730
731
    /**
732
     * Parse a DOCTYPE.
733
     *
734
     * Parse a DOCTYPE declaration. This method has strong bearing on whether or
735
     * not Quirksmode is enabled on the event handler.
736
     *
737
     * @todo This method is a little long. Should probably refactor.
738
     *
739
     * @return bool
740
     */
741 96
    protected function doctype()
742
    {
743
        // Check that string is DOCTYPE.
744 96
        if ($this->scanner->sequenceMatches('DOCTYPE', false)) {
745 95
            $this->scanner->consume(7);
746 95
        } else {
747 1
            $chars = $this->scanner->charsWhile('DOCTYPEdoctype');
748 1
            $this->parseError('Expected DOCTYPE, got %s', $chars);
749
750 1
            return $this->bogusComment('<!' . $chars);
751
        }
752
753 95
        $this->scanner->whitespace();
754 95
        $tok = $this->scanner->current();
755
756
        // EOF: die.
757 95
        if (false === $tok) {
758
            $this->events->doctype('html5', EventHandler::DOCTYPE_NONE, '', true);
759
            $this->eof();
760
761
            return true;
762
        }
763
764
        // NULL char: convert.
765 95
        if ("\0" === $tok) {
766
            $this->parseError('Unexpected null character in DOCTYPE.');
767
        }
768
769 95
        $stop = " \n\f>";
770 95
        $doctypeName = $this->scanner->charsUntil($stop);
771
        // Lowercase ASCII, replace \0 with FFFD
772 95
        $doctypeName = strtolower(strtr($doctypeName, "\0", UTF8Utils::FFFD));
0 ignored issues
show
Security Bug introduced by
It seems like $doctypeName can also be of type false; however, strtr() does only seem to accept string, did you maybe forget to handle an error condition?
Loading history...
773
774 95
        $tok = $this->scanner->current();
775
776
        // If false, emit a parse error, DOCTYPE, and return.
777 95
        if (false === $tok) {
778 1
            $this->parseError('Unexpected EOF in DOCTYPE declaration.');
779 1
            $this->events->doctype($doctypeName, EventHandler::DOCTYPE_NONE, null, true);
780
781 1
            return true;
782
        }
783
784
        // Short DOCTYPE, like <!DOCTYPE html>
785 95
        if ('>' == $tok) {
786
            // DOCTYPE without a name.
787 95
            if (0 == strlen($doctypeName)) {
788 1
                $this->parseError('Expected a DOCTYPE name. Got nothing.');
789 1
                $this->events->doctype($doctypeName, 0, null, true);
790 1
                $this->scanner->consume();
791
792 1
                return true;
793
            }
794 95
            $this->events->doctype($doctypeName);
795 95
            $this->scanner->consume();
796
797 95
            return true;
798
        }
799 1
        $this->scanner->whitespace();
800
801 1
        $pub = strtoupper($this->scanner->getAsciiAlpha());
802 1
        $white = $this->scanner->whitespace();
803
804
        // Get ID, and flag it as pub or system.
805 1
        if (('PUBLIC' == $pub || 'SYSTEM' == $pub) && $white > 0) {
806
            // Get the sys ID.
807 1
            $type = 'PUBLIC' == $pub ? EventHandler::DOCTYPE_PUBLIC : EventHandler::DOCTYPE_SYSTEM;
808 1
            $id = $this->quotedString("\0>");
809 1
            if (false === $id) {
810
                $this->events->doctype($doctypeName, $type, $pub, false);
811
812
                return true;
813
            }
814
815
            // Premature EOF.
816 1
            if (false === $this->scanner->current()) {
817 1
                $this->parseError('Unexpected EOF in DOCTYPE');
818 1
                $this->events->doctype($doctypeName, $type, $id, true);
819
820 1
                return true;
821
            }
822
823
            // Well-formed complete DOCTYPE.
824 1
            $this->scanner->whitespace();
825 1
            if ('>' == $this->scanner->current()) {
826 1
                $this->events->doctype($doctypeName, $type, $id, false);
827 1
                $this->scanner->consume();
828
829 1
                return true;
830
            }
831
832
            // If we get here, we have <!DOCTYPE foo PUBLIC "bar" SOME_JUNK
833
            // Throw away the junk, parse error, quirks mode, return true.
834 1
            $this->scanner->charsUntil('>');
835 1
            $this->parseError('Malformed DOCTYPE.');
836 1
            $this->events->doctype($doctypeName, $type, $id, true);
837 1
            $this->scanner->consume();
838
839 1
            return true;
840
        }
841
842
        // Else it's a bogus DOCTYPE.
843
        // Consume to > and trash.
844 1
        $this->scanner->charsUntil('>');
845
846 1
        $this->parseError('Expected PUBLIC or SYSTEM. Got %s.', $pub);
847 1
        $this->events->doctype($doctypeName, 0, null, true);
848 1
        $this->scanner->consume();
849
850 1
        return true;
851
    }
852
853
    /**
854
     * Utility for reading a quoted string.
855
     *
856
     * @param string $stopchars Characters (in addition to a close-quote) that should stop the string.
857
     *                          E.g. sometimes '>' is higher precedence than '"' or "'".
858
     *
859
     * @return mixed String if one is found (quotations omitted).
860
     */
861 1
    protected function quotedString($stopchars)
862
    {
863 1
        $tok = $this->scanner->current();
864 1
        if ('"' == $tok || "'" == $tok) {
865 1
            $this->scanner->consume();
866 1
            $ret = $this->scanner->charsUntil($tok . $stopchars);
867 1
            if ($this->scanner->current() == $tok) {
868 1
                $this->scanner->consume();
869 1
            } else {
870
                // Parse error because no close quote.
871
                $this->parseError('Expected %s, got %s', $tok, $this->scanner->current());
872
            }
873
874 1
            return $ret;
875
        }
876
877
        return false;
878
    }
879
880
    /**
881
     * Handle a CDATA section.
882
     *
883
     * @return bool
884
     */
885 7
    protected function cdataSection()
886
    {
887 7
        $cdata = '';
888 7
        $this->scanner->consume();
889
890 7
        $chars = $this->scanner->charsWhile('CDAT');
891 7
        if ('CDATA' != $chars || '[' != $this->scanner->current()) {
892 1
            $this->parseError('Expected [CDATA[, got %s', $chars);
893
894 1
            return $this->bogusComment('<![' . $chars);
895
        }
896
897 7
        $tok = $this->scanner->next();
898
        do {
899 7
            if (false === $tok) {
900 2
                $this->parseError('Unexpected EOF inside CDATA.');
901 2
                $this->bogusComment('<![CDATA[' . $cdata);
902
903 2
                return true;
904
            }
905 7
            $cdata .= $tok;
906 7
            $tok = $this->scanner->next();
907 7
        } while (!$this->scanner->sequenceMatches(']]>'));
908
909
        // Consume ]]>
910 5
        $this->scanner->consume(3);
911
912 5
        $this->events->cdata($cdata);
913
914 5
        return true;
915
    }
916
917
    // ================================================================
918
    // Non-HTML5
919
    // ================================================================
920
921
    /**
922
     * Handle a processing instruction.
923
     *
924
     * XML processing instructions are supposed to be ignored in HTML5,
925
     * treated as "bogus comments". However, since we're not a user
926
     * agent, we allow them. We consume until ?> and then issue a
927
     * EventListener::processingInstruction() event.
928
     *
929
     * @return bool
930
     */
931 7
    protected function processingInstruction()
932
    {
933 7
        if ('?' != $this->scanner->current()) {
934
            return false;
935
        }
936
937 7
        $tok = $this->scanner->next();
938 7
        $procName = $this->scanner->getAsciiAlpha();
939 7
        $white = $this->scanner->whitespace();
940
941
        // If not a PI, send to bogusComment.
942 7
        if (0 == strlen($procName) || 0 == $white || false == $this->scanner->current()) {
943 1
            $this->parseError("Expected processing instruction name, got $tok");
944 1
            $this->bogusComment('<?' . $tok . $procName);
945
946 1
            return true;
947
        }
948
949 6
        $data = '';
950
        // As long as it's not the case that the next two chars are ? and >.
951 6
        while (!('?' == $this->scanner->current() && '>' == $this->scanner->peek())) {
952 6
            $data .= $this->scanner->current();
953
954 6
            $tok = $this->scanner->next();
955 6
            if (false === $tok) {
956
                $this->parseError('Unexpected EOF in processing instruction.');
957
                $this->events->processingInstruction($procName, $data);
0 ignored issues
show
Security Bug introduced by
It seems like $procName defined by $this->scanner->getAsciiAlpha() on line 938 can also be of type false; however, Masterminds\HTML5\Parser...processingInstruction() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
958
959
                return true;
960
            }
961 6
        }
962
963 6
        $this->scanner->consume(2); // Consume the closing tag
964 6
        $this->events->processingInstruction($procName, $data);
0 ignored issues
show
Security Bug introduced by
It seems like $procName defined by $this->scanner->getAsciiAlpha() on line 938 can also be of type false; however, Masterminds\HTML5\Parser...processingInstruction() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
965
966 6
        return true;
967
    }
968
969
    // ================================================================
970
    // UTILITY FUNCTIONS
971
    // ================================================================
972
973
    /**
974
     * Read from the input stream until we get to the desired sequene
975
     * or hit the end of the input stream.
976
     *
977
     * @param string $sequence
978
     *
979
     * @return string
980
     */
981 8
    protected function readUntilSequence($sequence)
982
    {
983 8
        $buffer = '';
984
985
        // Optimization for reading larger blocks faster.
986 8
        $first = substr($sequence, 0, 1);
987 8
        while (false !== $this->scanner->current()) {
988 8
            $buffer .= $this->scanner->charsUntil($first);
989
990
            // Stop as soon as we hit the stopping condition.
991 8
            if ($this->scanner->sequenceMatches($sequence, false)) {
992 8
                return $buffer;
993
            }
994 4
            $buffer .= $this->scanner->current();
995 4
            $this->scanner->consume();
996 4
        }
997
998
        // If we get here, we hit the EOF.
999 1
        $this->parseError('Unexpected EOF during text read.');
1000
1001 1
        return $buffer;
1002
    }
1003
1004
    /**
1005
     * Check if upcomming chars match the given sequence.
1006
     *
1007
     * This will read the stream for the $sequence. If it's
1008
     * found, this will return true. If not, return false.
1009
     * Since this unconsumes any chars it reads, the caller
1010
     * will still need to read the next sequence, even if
1011
     * this returns true.
1012
     *
1013
     * Example: $this->scanner->sequenceMatches('</script>') will
1014
     * see if the input stream is at the start of a
1015
     * '</script>' string.
1016
     *
1017
     * @param string $sequence
1018
     * @param bool   $caseSensitive
1019
     *
1020
     * @return bool
1021
     */
1022
    protected function sequenceMatches($sequence, $caseSensitive = true)
1023
    {
1024
        @trigger_error(__METHOD__ . ' method is deprecated since version 2.4 and will be removed in 3.0. Use Scanner::sequenceMatches() instead.', E_USER_DEPRECATED);
1025
1026
        return $this->scanner->sequenceMatches($sequence, $caseSensitive);
1027
    }
1028
1029
    /**
1030
     * Send a TEXT event with the contents of the text buffer.
1031
     *
1032
     * This emits an EventHandler::text() event with the current contents of the
1033
     * temporary text buffer. (The buffer is used to group as much PCDATA
1034
     * as we can instead of emitting lots and lots of TEXT events.)
1035
     */
1036 127
    protected function flushBuffer()
1037
    {
1038 127
        if ('' === $this->text) {
1039 125
            return;
1040
        }
1041 87
        $this->events->text($this->text);
1042 87
        $this->text = '';
1043 87
    }
1044
1045
    /**
1046
     * Add text to the temporary buffer.
1047
     *
1048
     * @see flushBuffer()
1049
     *
1050
     * @param string $str
1051
     */
1052 9
    protected function buffer($str)
1053
    {
1054 9
        $this->text .= $str;
1055 9
    }
1056
1057
    /**
1058
     * Emit a parse error.
1059
     *
1060
     * A parse error always returns false because it never consumes any
1061
     * characters.
1062
     *
1063
     * @param string $msg
1064
     *
1065
     * @return string
1066
     */
1067 15
    protected function parseError($msg)
1068
    {
1069 15
        $args = func_get_args();
1070
1071 15
        if (count($args) > 1) {
1072 11
            array_shift($args);
1073 11
            $msg = vsprintf($msg, $args);
1074 11
        }
1075
1076 15
        $line = $this->scanner->currentLine();
1077 15
        $col = $this->scanner->columnOffset();
1078 15
        $this->events->parseError($msg, $line, $col);
1079
1080 15
        return false;
1081
    }
1082
1083
    /**
1084
     * Decode a character reference and return the string.
1085
     *
1086
     * If $inAttribute is set to true, a bare & will be returned as-is.
1087
     *
1088
     * @param bool $inAttribute Set to true if the text is inside of an attribute value.
1089
     *                          false otherwise.
1090
     *
1091
     * @return string
1092
     */
1093 12
    protected function decodeCharacterReference($inAttribute = false)
1094
    {
1095
        // Next char after &.
1096 12
        $tok = $this->scanner->next();
1097 12
        $start = $this->scanner->position();
1098
1099 12
        if (false === $tok) {
1100 1
            return '&';
1101
        }
1102
1103
        // These indicate not an entity. We return just
1104
        // the &.
1105 12
        if ("\t" === $tok || "\n" === $tok || "\f" === $tok || ' ' === $tok || '&' === $tok || '<' === $tok) {
1106
            // $this->scanner->next();
1107 2
            return '&';
1108
        }
1109
1110
        // Numeric entity
1111 12
        if ('#' === $tok) {
1112 2
            $tok = $this->scanner->next();
1113
1114
            // Hexidecimal encoding.
1115
            // X[0-9a-fA-F]+;
1116
            // x[0-9a-fA-F]+;
1117 2
            if ('x' === $tok || 'X' === $tok) {
1118 2
                $tok = $this->scanner->next(); // Consume x
1119
1120
                // Convert from hex code to char.
1121 2
                $hex = $this->scanner->getHex();
1122 2
                if (empty($hex)) {
1123
                    $this->parseError('Expected &#xHEX;, got &#x%s', $tok);
1124
                    // We unconsume because we don't know what parser rules might
1125
                    // be in effect for the remaining chars. For example. '&#>'
1126
                    // might result in a specific parsing rule inside of tag
1127
                    // contexts, while not inside of pcdata context.
1128
                    $this->scanner->unconsume(2);
1129
1130
                    return '&';
1131
                }
1132 2
                $entity = CharacterReference::lookupHex($hex);
1133 2
            }             // Decimal encoding.
1134
            // [0-9]+;
1135
            else {
1136
                // Convert from decimal to char.
1137 1
                $numeric = $this->scanner->getNumeric();
1138 1
                if (false === $numeric) {
1139
                    $this->parseError('Expected &#DIGITS;, got &#%s', $tok);
1140
                    $this->scanner->unconsume(2);
1141
1142
                    return '&';
1143
                }
1144 1
                $entity = CharacterReference::lookupDecimal($numeric);
1145
            }
1146 12
        } elseif ('=' === $tok && $inAttribute) {
1147 1
            return '&';
1148
        } else { // String entity.
1149
            // Attempt to consume a string up to a ';'.
1150
            // [a-zA-Z0-9]+;
1151 11
            $cname = $this->scanner->getAsciiAlphaNum();
1152 11
            $entity = CharacterReference::lookupName($cname);
0 ignored issues
show
Security Bug introduced by
It seems like $cname defined by $this->scanner->getAsciiAlphaNum() on line 1151 can also be of type false; however, Masterminds\HTML5\Parser...Reference::lookupName() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
1153
1154
            // When no entity is found provide the name of the unmatched string
1155
            // and continue on as the & is not part of an entity. The & will
1156
            // be converted to &amp; elsewhere.
1157 11
            if (null === $entity) {
1158 6
                if (!$inAttribute || '' === $cname) {
1159 5
                    $this->parseError("No match in entity table for '%s'", $cname);
1160 5
                }
1161 6
                $this->scanner->unconsume($this->scanner->position() - $start);
1162
1163 6
                return '&';
1164
            }
1165
        }
1166
1167
        // The scanner has advanced the cursor for us.
1168 9
        $tok = $this->scanner->current();
1169
1170
        // We have an entity. We're done here.
1171 9
        if (';' === $tok) {
1172 9
            $this->scanner->consume();
1173
1174 9
            return $entity;
1175
        }
1176
1177
        // If in an attribute, then failing to match ; means unconsume the
1178
        // entire string. Otherwise, failure to match is an error.
1179 1
        if ($inAttribute) {
1180
            $this->scanner->unconsume($this->scanner->position() - $start);
1181
1182
            return '&';
1183
        }
1184
1185 1
        $this->parseError('Expected &ENTITY;, got &ENTITY%s (no trailing ;) ', $tok);
1186
1187 1
        return '&' . $entity;
1188
    }
1189
}
1190