Completed
Push — master ( 3ed3bd...88b7c6 )
by Asmir
03:44 queued 01:45
created

Tokenizer::unquotedAttributeValue()   C

Complexity

Conditions 13
Paths 6

Size

Total Lines 38

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 29
CRAP Score 13

Importance

Changes 0
Metric Value
dl 0
loc 38
ccs 29
cts 29
cp 1
rs 6.6166
c 0
b 0
f 0
cc 13
nc 6
nop 0
crap 13

How to fix   Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
namespace Masterminds\HTML5\Parser;
4
5
use Masterminds\HTML5\Elements;
6
7
/**
8
 * The HTML5 tokenizer.
9
 *
10
 * The tokenizer's role is reading data from the scanner and gathering it into
11
 * semantic units. From the tokenizer, data is emitted to an event handler,
12
 * which may (for example) create a DOM tree.
13
 *
14
 * The HTML5 specification has a detailed explanation of tokenizing HTML5. We
15
 * follow that specification to the maximum extent that we can. If you find
16
 * a discrepancy that is not documented, please file a bug and/or submit a
17
 * patch.
18
 *
19
 * This tokenizer is implemented as a recursive descent parser.
20
 *
21
 * Within the API documentation, you may see references to the specific section
22
 * of the HTML5 spec that the code attempts to reproduce. Example: 8.2.4.1.
23
 * This refers to section 8.2.4.1 of the HTML5 CR specification.
24
 *
25
 * @see http://www.w3.org/TR/2012/CR-html5-20121217/
26
 */
27
class Tokenizer
28
{
29
    protected $scanner;
30
31
    protected $events;
32
33
    protected $tok;
34
35
    /**
36
     * Buffer for text.
37
     */
38
    protected $text = '';
39
40
    // When this goes to false, the parser stops.
41
    protected $carryOn = true;
42
43
    protected $textMode = 0; // TEXTMODE_NORMAL;
44
    protected $untilTag = null;
45
46
    const CONFORMANT_XML = 'xml';
47
    const CONFORMANT_HTML = 'html';
48
    protected $mode = self::CONFORMANT_HTML;
49
50
    /**
51
     * Create a new tokenizer.
52
     *
53
     * Typically, parsing a document involves creating a new tokenizer, giving
54
     * it a scanner (input) and an event handler (output), and then calling
55
     * the Tokenizer::parse() method.`
56
     *
57
     * @param Scanner      $scanner      A scanner initialized with an input stream.
58
     * @param EventHandler $eventHandler An event handler, initialized and ready to receive events.
59
     * @param string       $mode
60
     */
61 127
    public function __construct($scanner, $eventHandler, $mode = self::CONFORMANT_HTML)
62
    {
63 127
        $this->scanner = $scanner;
64 127
        $this->events = $eventHandler;
65 127
        $this->mode = $mode;
66 127
    }
67
68
    /**
69
     * Begin parsing.
70
     *
71
     * This will begin scanning the document, tokenizing as it goes.
72
     * Tokens are emitted into the event handler.
73
     *
74
     * Tokenizing will continue until the document is completely
75
     * read. Errors are emitted into the event handler, but
76
     * the parser will attempt to continue parsing until the
77
     * entire input stream is read.
78
     */
79 127
    public function parse()
80
    {
81
        do {
82 127
            $this->consumeData();
83
            // FIXME: Add infinite loop protection.
84 127
        } while ($this->carryOn);
85 127
    }
86
87
    /**
88
     * Set the text mode for the character data reader.
89
     *
90
     * HTML5 defines three different modes for reading text:
91
     * - Normal: Read until a tag is encountered.
92
     * - RCDATA: Read until a tag is encountered, but skip a few otherwise-
93
     * special characters.
94
     * - Raw: Read until a special closing tag is encountered (viz. pre, script)
95
     *
96
     * This allows those modes to be set.
97
     *
98
     * Normally, setting is done by the event handler via a special return code on
99
     * startTag(), but it can also be set manually using this function.
100
     *
101
     * @param int    $textmode One of Elements::TEXT_*.
102
     * @param string $untilTag The tag that should stop RAW or RCDATA mode. Normal mode does not
103
     *                         use this indicator.
104
     */
105 108
    public function setTextMode($textmode, $untilTag = null)
106
    {
107 108
        $this->textMode = $textmode & (Elements::TEXT_RAW | Elements::TEXT_RCDATA);
108 108
        $this->untilTag = $untilTag;
109 108
    }
110
111
    /**
112
     * Consume a character and make a move.
113
     * HTML5 8.2.4.1.
114
     */
115 127
    protected function consumeData()
116
    {
117 127
        $tok = $this->scanner->current();
118
119 127
        if ('&' === $tok) {
120
            // Character reference
121 8
            $ref = $this->decodeCharacterReference();
122 8
            $this->buffer($ref);
123
124 8
            $tok = $this->scanner->current();
125 8
        }
126
127
        // Parse tag
128 127
        if ('<' === $tok) {
129
            // Any buffered text data can go out now.
130 123
            $this->flushBuffer();
131
132 123
            $tok = $this->scanner->next();
133
134 123
            $this->markupDeclaration($tok)
135 120
                || $this->endTag()
136 120
                || $this->processingInstruction()
137 119
                || $this->tagName()
138
                // This always returns false.
139 114
                || $this->parseError('Illegal tag opening')
140 1
                || $this->characterData();
141
142 123
            $tok = $this->scanner->current();
143 123
        }
144
145
        // Handle end of document
146 127
        $this->eof($tok);
147
148
        // Parse character
149 127
        if (false !== $tok) {
150 112
            switch ($this->textMode) {
151 112
                case Elements::TEXT_RAW:
152 8
                    $this->rawText($tok);
153 8
                    break;
154
155 112
                case Elements::TEXT_RCDATA:
156 37
                    $this->rcdata($tok);
157 37
                    break;
158
159 111
                default:
160 111
                    if ('<' !== $tok && '&' !== $tok) {
161
                        // NULL character
162 87
                        if ("\00" === $tok) {
163
                            $this->parseError('Received null character.');
164
                        }
165
166 87
                        $this->text .= $tok;
167 87
                        $this->scanner->consume();
168 87
                    }
169 112
            }
170 112
        }
171
172 127
        return $this->carryOn;
173
    }
174
175
    /**
176
     * Parse anything that looks like character data.
177
     *
178
     * Different rules apply based on the current text mode.
179
     *
180
     * @see Elements::TEXT_RAW Elements::TEXT_RCDATA.
181
     */
182 1
    protected function characterData()
183
    {
184 1
        $tok = $this->scanner->current();
185 1
        if (false === $tok) {
186
            return false;
187
        }
188 1
        switch ($this->textMode) {
189 1
            case Elements::TEXT_RAW:
190
                return $this->rawText($tok);
191 1
            case Elements::TEXT_RCDATA:
192
                return $this->rcdata($tok);
193 1
            default:
194 1
                if ('<' === $tok || '&' === $tok) {
195
                    return false;
196
                }
197
198 1
                return $this->text($tok);
199 1
        }
200
    }
201
202
    /**
203
     * This buffers the current token as character data.
204
     *
205
     * @param string $tok The current token.
206
     *
207
     * @return bool
208
     */
209 1
    protected function text($tok)
210
    {
211
        // This should never happen...
212 1
        if (false === $tok) {
213
            return false;
214
        }
215
216
        // NULL character
217 1
        if ("\00" === $tok) {
218
            $this->parseError('Received null character.');
219
        }
220
221 1
        $this->buffer($tok);
222 1
        $this->scanner->consume();
223
224 1
        return true;
225
    }
226
227
    /**
228
     * Read text in RAW mode.
229
     *
230
     * @param string $tok The current token.
231
     *
232
     * @return bool
233
     */
234 8
    protected function rawText($tok)
235
    {
236 8
        if (is_null($this->untilTag)) {
237
            return $this->text($tok);
238
        }
239
240 8
        $sequence = '</' . $this->untilTag . '>';
241 8
        $txt = $this->readUntilSequence($sequence);
242 8
        $this->events->text($txt);
243 8
        $this->setTextMode(0);
244
245 8
        return $this->endTag();
246
    }
247
248
    /**
249
     * Read text in RCDATA mode.
250
     *
251
     * @param string $tok The current token.
252
     *
253
     * @return bool
254
     */
255 37
    protected function rcdata($tok)
256
    {
257 37
        if (is_null($this->untilTag)) {
258
            return $this->text($tok);
259
        }
260
261 37
        $sequence = '</' . $this->untilTag;
262 37
        $txt = '';
263
264 37
        $caseSensitive = !Elements::isHtml5Element($this->untilTag);
265 37
        while (false !== $tok && !('<' == $tok && ($this->scanner->sequenceMatches($sequence, $caseSensitive)))) {
266 35
            if ('&' == $tok) {
267 1
                $txt .= $this->decodeCharacterReference();
268 1
                $tok = $this->scanner->current();
269 1
            } else {
270 35
                $txt .= $tok;
271 35
                $tok = $this->scanner->next();
272
            }
273 35
        }
274 37
        $len = strlen($sequence);
275 37
        $this->scanner->consume($len);
276 37
        $len += $this->scanner->whitespace();
277 37
        if ('>' !== $this->scanner->current()) {
278
            $this->parseError('Unclosed RCDATA end tag');
279
        }
280
281 37
        $this->scanner->unconsume($len);
282 37
        $this->events->text($txt);
283 37
        $this->setTextMode(0);
284
285 37
        return $this->endTag();
286
    }
287
288
    /**
289
     * If the document is read, emit an EOF event.
290
     */
291 127
    protected function eof($tok)
292
    {
293 127
        if (false === $tok) {
294
            // fprintf(STDOUT, "EOF");
295 127
            $this->flushBuffer();
296 127
            $this->events->eof();
297 127
            $this->carryOn = false;
298
299 127
            return true;
300
        }
301
302 112
        return false;
303
    }
304
305
    /**
306
     * Look for markup.
307
     */
308 123
    protected function markupDeclaration($tok)
309
    {
310 123
        if ('!' != $tok) {
311 120
            return false;
312
        }
313
314 101
        $tok = $this->scanner->next();
315
316
        // Comment:
317 101
        if ('-' == $tok && '-' == $this->scanner->peek()) {
318 6
            $this->scanner->consume(2);
319
320 6
            return $this->comment();
321 98
        } elseif ('D' == $tok || 'd' == $tok) { // Doctype
322 96
            return $this->doctype();
323 7
        } elseif ('[' == $tok) { // CDATA section
324 7
            return $this->cdataSection();
325
        }
326
327
        // FINISH
328 1
        $this->parseError('Expected <!--, <![CDATA[, or <!DOCTYPE. Got <!%s', $tok);
329 1
        $this->bogusComment('<!');
330
331 1
        return true;
332
    }
333
334
    /**
335
     * Consume an end tag. See section 8.2.4.9.
336
     */
337 120
    protected function endTag()
338
    {
339 120
        if ('/' != $this->scanner->current()) {
340 119
            return false;
341
        }
342 111
        $tok = $this->scanner->next();
343
344
        // a-zA-Z -> tagname
345
        // > -> parse error
346
        // EOF -> parse error
347
        // -> parse error
348 111
        if (!ctype_alpha($tok)) {
349 2
            $this->parseError("Expected tag name, got '%s'", $tok);
350 2
            if ("\0" == $tok || false === $tok) {
351
                return false;
352
            }
353
354 2
            return $this->bogusComment('</');
355
        }
356
357 110
        $name = $this->scanner->charsUntil("\n\f \t>");
358 110
        $name = self::CONFORMANT_XML === $this->mode ? $name : strtolower($name);
359
        // Trash whitespace.
360 110
        $this->scanner->whitespace();
361
362 110
        $tok = $this->scanner->current();
363 110
        if ('>' != $tok) {
364 1
            $this->parseError("Expected >, got '%s'", $tok);
365
            // We just trash stuff until we get to the next tag close.
366 1
            $this->scanner->charsUntil('>');
367 1
        }
368
369 110
        $this->events->endTag($name);
370 110
        $this->scanner->consume();
371
372 110
        return true;
373
    }
374
375
    /**
376
     * Consume a tag name and body. See section 8.2.4.10.
377
     */
378 114
    protected function tagName()
379
    {
380 114
        $tok = $this->scanner->current();
381 114
        if (!ctype_alpha($tok)) {
382 1
            return false;
383
        }
384
385
        // We know this is at least one char.
386 114
        $name = $this->scanner->charsWhile(':_-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz');
387 114
        $name = self::CONFORMANT_XML === $this->mode ? $name : strtolower($name);
388 114
        $attributes = array();
389 114
        $selfClose = false;
390
391
        // Handle attribute parse exceptions here so that we can
392
        // react by trying to build a sensible parse tree.
393
        try {
394
            do {
395 114
                $this->scanner->whitespace();
396 114
                $this->attribute($attributes);
397 114
            } while (!$this->isTagEnd($selfClose));
398 114
        } catch (ParseError $e) {
399 2
            $selfClose = false;
400
        }
401
402 114
        $mode = $this->events->startTag($name, $attributes, $selfClose);
403
404 114
        if (is_int($mode)) {
405 107
            $this->setTextMode($mode, $name);
406 107
        }
407
408 114
        $this->scanner->consume();
409
410 114
        return true;
411
    }
412
413
    /**
414
     * Check if the scanner has reached the end of a tag.
415
     */
416 114
    protected function isTagEnd(&$selfClose)
417
    {
418 114
        $tok = $this->scanner->current();
419 114
        if ('/' == $tok) {
420 15
            $this->scanner->consume();
421 15
            $this->scanner->whitespace();
422 15
            $tok = $this->scanner->current();
423
424 15
            if ('>' == $tok) {
425 15
                $selfClose = true;
426
427 15
                return true;
428
            }
429 2
            if (false === $tok) {
430 1
                $this->parseError('Unexpected EOF inside of tag.');
431
432 1
                return true;
433
            }
434
            // Basically, we skip the / token and go on.
435
            // See 8.2.4.43.
436 1
            $this->parseError("Unexpected '%s' inside of a tag.", $tok);
437
438 1
            return false;
439
        }
440
441 114
        if ('>' == $tok) {
442 114
            return true;
443
        }
444 32
        if (false === $tok) {
445 2
            $this->parseError('Unexpected EOF inside of tag.');
446
447 2
            return true;
448
        }
449
450 31
        return false;
451
    }
452
453
    /**
454
     * Parse attributes from inside of a tag.
455
     *
456
     * @param string[] $attributes
457
     *
458
     * @return bool
459
     *
460
     * @throws ParseError
461
     */
462 114
    protected function attribute(&$attributes)
463
    {
464 114
        $tok = $this->scanner->current();
465 114
        if ('/' == $tok || '>' == $tok || false === $tok) {
466 108
            return false;
467
        }
468
469 82
        if ('<' == $tok) {
470 2
            $this->parseError("Unexpected '<' inside of attributes list.");
471
            // Push the < back onto the stack.
472 2
            $this->scanner->unconsume();
473
            // Let the caller figure out how to handle this.
474 2
            throw new ParseError('Start tag inside of attribute.');
475
        }
476
477 82
        $name = strtolower($this->scanner->charsUntil("/>=\n\f\t "));
478
479 82
        if (0 == strlen($name)) {
480 3
            $tok = $this->scanner->current();
481 3
            $this->parseError('Expected an attribute name, got %s.', $tok);
482
            // Really, only '=' can be the char here. Everything else gets absorbed
483
            // under one rule or another.
484 3
            $name = $tok;
485 3
            $this->scanner->consume();
486 3
        }
487
488 82
        $isValidAttribute = true;
489
        // Attribute names can contain most Unicode characters for HTML5.
490
        // But method "DOMElement::setAttribute" is throwing exception
491
        // because of it's own internal restriction so these have to be filtered.
492
        // see issue #23: https://github.com/Masterminds/html5-php/issues/23
493
        // and http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#syntax-attribute-name
494 82
        if (preg_match("/[\x1-\x2C\\/\x3B-\x40\x5B-\x5E\x60\x7B-\x7F]/u", $name)) {
495 4
            $this->parseError('Unexpected characters in attribute name: %s', $name);
496 4
            $isValidAttribute = false;
497 4
        }         // There is no limitation for 1st character in HTML5.
498
        // But method "DOMElement::setAttribute" is throwing exception for the
499
        // characters below so they have to be filtered.
500
        // see issue #23: https://github.com/Masterminds/html5-php/issues/23
501
        // and http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#syntax-attribute-name
502 79
        elseif (preg_match('/^[0-9.-]/u', $name)) {
503 1
            $this->parseError('Unexpected character at the begining of attribute name: %s', $name);
504 1
            $isValidAttribute = false;
505 1
        }
506
        // 8.1.2.3
507 82
        $this->scanner->whitespace();
508
509 82
        $val = $this->attributeValue();
510 82
        if ($isValidAttribute) {
511 79
            $attributes[$name] = $val;
512 79
        }
513
514 82
        return true;
515
    }
516
517
    /**
518
     * Consume an attribute value. See section 8.2.4.37 and after.
519
     *
520
     * @return string|null
521
     */
522 82
    protected function attributeValue()
523
    {
524 82
        if ('=' != $this->scanner->current()) {
525 13
            return null;
526
        }
527 78
        $this->scanner->consume();
528
        // 8.1.2.3
529 78
        $this->scanner->whitespace();
530
531 78
        $tok = $this->scanner->current();
532
        switch ($tok) {
533 78
            case "\n":
534 78
            case "\f":
535 78
            case ' ':
536 78
            case "\t":
537
                // Whitespace here indicates an empty value.
538
                return null;
539 78
            case '"':
540 78
            case "'":
541 78
                $this->scanner->consume();
542
543 78
                return $this->quotedAttributeValue($tok);
0 ignored issues
show
Security Bug introduced by
It seems like $tok defined by $this->scanner->current() on line 531 can also be of type false; however, Masterminds\HTML5\Parser...:quotedAttributeValue() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
544 1
            case '>':
545
                // case '/': // 8.2.4.37 seems to allow foo=/ as a valid attr.
546 1
                $this->parseError('Expected attribute value, got tag end.');
547
548 1
                return null;
549 1
            case '=':
550 1
            case '`':
551
                $this->parseError('Expecting quotes, got %s.', $tok);
552
553
                return $this->unquotedAttributeValue();
554 1
            default:
555 1
                return $this->unquotedAttributeValue();
556 1
        }
557
    }
558
559
    /**
560
     * Get an attribute value string.
561
     *
562
     * @param string $quote IMPORTANT: This is a series of chars! Any one of which will be considered
563
     *                      termination of an attribute's value. E.g. "\"'" will stop at either
564
     *                      ' or ".
565
     *
566
     * @return string The attribute value.
567
     */
568 78
    protected function quotedAttributeValue($quote)
569
    {
570 78
        $stoplist = "\f" . $quote;
571 78
        $val = '';
572
573 78
        while (true) {
574 78
            $tokens = $this->scanner->charsUntil($stoplist . '&');
575 78
            if (false !== $tokens) {
576 78
                $val .= $tokens;
577 78
            } else {
578
                break;
579
            }
580
581 78
            $tok = $this->scanner->current();
582 78
            if ('&' == $tok) {
583 3
                $val .= $this->decodeCharacterReference(true);
584 3
                continue;
585
            }
586 78
            break;
587
        }
588 78
        $this->scanner->consume();
589
590 78
        return $val;
591
    }
592
593 1
    protected function unquotedAttributeValue()
594
    {
595 1
        $val = '';
596 1
        $tok = $this->scanner->current();
597 1
        while (false !== $tok) {
598
            switch ($tok) {
599 1
                case "\n":
600 1
                case "\f":
601 1
                case ' ':
602 1
                case "\t":
603 1
                case '>':
604 1
                    break 2;
605
606 1
                case '&':
607 1
                    $val .= $this->decodeCharacterReference(true);
608 1
                    $tok = $this->scanner->current();
609
610 1
                    break;
611
612 1
                case "'":
613 1
                case '"':
614 1
                case '<':
615 1
                case '=':
616 1
                case '`':
617 1
                    $this->parseError('Unexpected chars in unquoted attribute value %s', $tok);
618 1
                    $val .= $tok;
619 1
                    $tok = $this->scanner->next();
620 1
                    break;
621
622 1
                default:
623 1
                    $val .= $this->scanner->charsUntil("\t\n\f >&\"'<=`");
624
625 1
                    $tok = $this->scanner->current();
626 1
            }
627 1
        }
628
629 1
        return $val;
630
    }
631
632
    /**
633
     * Consume malformed markup as if it were a comment.
634
     * 8.2.4.44.
635
     *
636
     * The spec requires that the ENTIRE tag-like thing be enclosed inside of
637
     * the comment. So this will generate comments like:
638
     *
639
     * &lt;!--&lt/+foo&gt;--&gt;
640
     *
641
     * @param string $leading Prepend any leading characters. This essentially
642
     *                        negates the need to backtrack, but it's sort of a hack.
643
     *
644
     * @return bool
645
     */
646 3
    protected function bogusComment($leading = '')
647
    {
648 3
        $comment = $leading;
649 3
        $tokens = $this->scanner->charsUntil('>');
650 3
        if (false !== $tokens) {
651 2
            $comment .= $tokens;
652 2
        }
653 3
        $tok = $this->scanner->current();
654 3
        if (false !== $tok) {
655 2
            $comment .= $tok;
656 2
        }
657
658 3
        $this->flushBuffer();
659 3
        $this->events->comment($comment);
660 3
        $this->scanner->consume();
661
662 3
        return true;
663
    }
664
665
    /**
666
     * Read a comment.
667
     * Expects the first tok to be inside of the comment.
668
     *
669
     * @return bool
670
     */
671 6
    protected function comment()
672
    {
673 6
        $tok = $this->scanner->current();
674 6
        $comment = '';
675
676
        // <!-->. Emit an empty comment because 8.2.4.46 says to.
677 6
        if ('>' == $tok) {
678
            // Parse error. Emit the comment token.
679 1
            $this->parseError("Expected comment data, got '>'");
680 1
            $this->events->comment('');
681 1
            $this->scanner->consume();
682
683 1
            return true;
684
        }
685
686
        // Replace NULL with the replacement char.
687 6
        if ("\0" == $tok) {
688
            $tok = UTF8Utils::FFFD;
689
        }
690 6
        while (!$this->isCommentEnd()) {
691 6
            $comment .= $tok;
692 6
            $tok = $this->scanner->next();
693 6
        }
694
695 6
        $this->events->comment($comment);
696 6
        $this->scanner->consume();
697
698 6
        return true;
699
    }
700
701
    /**
702
     * Check if the scanner has reached the end of a comment.
703
     *
704
     * @return bool
705
     */
706 6
    protected function isCommentEnd()
707
    {
708 6
        $tok = $this->scanner->current();
709
710
        // EOF
711 6
        if (false === $tok) {
712
            // Hit the end.
713 1
            $this->parseError('Unexpected EOF in a comment.');
714
715 1
            return true;
716
        }
717
718
        // If it doesn't start with -, not the end.
719 6
        if ('-' != $tok) {
720 6
            return false;
721
        }
722
723
        // Advance one, and test for '->'
724 6
        if ('-' == $this->scanner->next() && '>' == $this->scanner->peek()) {
725 6
            $this->scanner->consume(); // Consume the last '>'
726 6
            return true;
727
        }
728
        // Unread '-';
729 2
        $this->scanner->unconsume(1);
730
731 2
        return false;
732
    }
733
734
    /**
735
     * Parse a DOCTYPE.
736
     *
737
     * Parse a DOCTYPE declaration. This method has strong bearing on whether or
738
     * not Quirksmode is enabled on the event handler.
739
     *
740
     * @todo This method is a little long. Should probably refactor.
741
     *
742
     * @return bool
743
     */
744 96
    protected function doctype()
745
    {
746 96
        if (strcasecmp($this->scanner->current(), 'D')) {
747
            return false;
748
        }
749
        // Check that string is DOCTYPE.
750 96
        $chars = $this->scanner->charsWhile('DOCTYPEdoctype');
751 96
        if (strcasecmp($chars, 'DOCTYPE')) {
752 1
            $this->parseError('Expected DOCTYPE, got %s', $chars);
753
754 1
            return $this->bogusComment('<!' . $chars);
755
        }
756
757 95
        $this->scanner->whitespace();
758 95
        $tok = $this->scanner->current();
759
760
        // EOF: die.
761 95
        if (false === $tok) {
762
            $this->events->doctype('html5', EventHandler::DOCTYPE_NONE, '', true);
763
764
            return $this->eof($tok);
765
        }
766
767
        // NULL char: convert.
768 95
        if ("\0" === $tok) {
769
            $this->parseError('Unexpected null character in DOCTYPE.');
770
        }
771
772 95
        $stop = " \n\f>";
773 95
        $doctypeName = $this->scanner->charsUntil($stop);
774
        // Lowercase ASCII, replace \0 with FFFD
775 95
        $doctypeName = strtolower(strtr($doctypeName, "\0", UTF8Utils::FFFD));
0 ignored issues
show
Security Bug introduced by
It seems like $doctypeName can also be of type false; however, strtr() does only seem to accept string, did you maybe forget to handle an error condition?
Loading history...
776
777 95
        $tok = $this->scanner->current();
778
779
        // If false, emit a parse error, DOCTYPE, and return.
780 95
        if (false === $tok) {
781 1
            $this->parseError('Unexpected EOF in DOCTYPE declaration.');
782 1
            $this->events->doctype($doctypeName, EventHandler::DOCTYPE_NONE, null, true);
783
784 1
            return true;
785
        }
786
787
        // Short DOCTYPE, like <!DOCTYPE html>
788 95
        if ('>' == $tok) {
789
            // DOCTYPE without a name.
790 95
            if (0 == strlen($doctypeName)) {
791 1
                $this->parseError('Expected a DOCTYPE name. Got nothing.');
792 1
                $this->events->doctype($doctypeName, 0, null, true);
793 1
                $this->scanner->consume();
794
795 1
                return true;
796
            }
797 95
            $this->events->doctype($doctypeName);
798 95
            $this->scanner->consume();
799
800 95
            return true;
801
        }
802 1
        $this->scanner->whitespace();
803
804 1
        $pub = strtoupper($this->scanner->getAsciiAlpha());
805 1
        $white = $this->scanner->whitespace();
806
807
        // Get ID, and flag it as pub or system.
808 1
        if (('PUBLIC' == $pub || 'SYSTEM' == $pub) && $white > 0) {
809
            // Get the sys ID.
810 1
            $type = 'PUBLIC' == $pub ? EventHandler::DOCTYPE_PUBLIC : EventHandler::DOCTYPE_SYSTEM;
811 1
            $id = $this->quotedString("\0>");
812 1
            if (false === $id) {
813
                $this->events->doctype($doctypeName, $type, $pub, false);
814
815
                return false;
816
            }
817
818
            // Premature EOF.
819 1
            if (false === $this->scanner->current()) {
820 1
                $this->parseError('Unexpected EOF in DOCTYPE');
821 1
                $this->events->doctype($doctypeName, $type, $id, true);
822
823 1
                return true;
824
            }
825
826
            // Well-formed complete DOCTYPE.
827 1
            $this->scanner->whitespace();
828 1
            if ('>' == $this->scanner->current()) {
829 1
                $this->events->doctype($doctypeName, $type, $id, false);
830 1
                $this->scanner->consume();
831
832 1
                return true;
833
            }
834
835
            // If we get here, we have <!DOCTYPE foo PUBLIC "bar" SOME_JUNK
836
            // Throw away the junk, parse error, quirks mode, return true.
837 1
            $this->scanner->charsUntil('>');
838 1
            $this->parseError('Malformed DOCTYPE.');
839 1
            $this->events->doctype($doctypeName, $type, $id, true);
840 1
            $this->scanner->consume();
841
842 1
            return true;
843
        }
844
845
        // Else it's a bogus DOCTYPE.
846
        // Consume to > and trash.
847 1
        $this->scanner->charsUntil('>');
848
849 1
        $this->parseError('Expected PUBLIC or SYSTEM. Got %s.', $pub);
850 1
        $this->events->doctype($doctypeName, 0, null, true);
851 1
        $this->scanner->consume();
852
853 1
        return true;
854
    }
855
856
    /**
857
     * Utility for reading a quoted string.
858
     *
859
     * @param string $stopchars Characters (in addition to a close-quote) that should stop the string.
860
     *                          E.g. sometimes '>' is higher precedence than '"' or "'".
861
     *
862
     * @return mixed String if one is found (quotations omitted).
863
     */
864 1
    protected function quotedString($stopchars)
865
    {
866 1
        $tok = $this->scanner->current();
867 1
        if ('"' == $tok || "'" == $tok) {
868 1
            $this->scanner->consume();
869 1
            $ret = $this->scanner->charsUntil($tok . $stopchars);
870 1
            if ($this->scanner->current() == $tok) {
871 1
                $this->scanner->consume();
872 1
            } else {
873
                // Parse error because no close quote.
874
                $this->parseError('Expected %s, got %s', $tok, $this->scanner->current());
875
            }
876
877 1
            return $ret;
878
        }
879
880
        return false;
881
    }
882
883
    /**
884
     * Handle a CDATA section.
885
     *
886
     * @return bool
887
     */
888 7
    protected function cdataSection()
889
    {
890 7
        if ('[' != $this->scanner->current()) {
891
            return false;
892
        }
893 7
        $cdata = '';
894 7
        $this->scanner->consume();
895
896 7
        $chars = $this->scanner->charsWhile('CDAT');
897 7
        if ('CDATA' != $chars || '[' != $this->scanner->current()) {
898 1
            $this->parseError('Expected [CDATA[, got %s', $chars);
899
900 1
            return $this->bogusComment('<![' . $chars);
901
        }
902
903 7
        $tok = $this->scanner->next();
904
        do {
905 7
            if (false === $tok) {
906 2
                $this->parseError('Unexpected EOF inside CDATA.');
907 2
                $this->bogusComment('<![CDATA[' . $cdata);
908
909 2
                return true;
910
            }
911 7
            $cdata .= $tok;
912 7
            $tok = $this->scanner->next();
913 7
        } while (!$this->scanner->sequenceMatches(']]>'));
914
915
        // Consume ]]>
916 5
        $this->scanner->consume(3);
917
918 5
        $this->events->cdata($cdata);
919
920 5
        return true;
921
    }
922
923
    // ================================================================
924
    // Non-HTML5
925
    // ================================================================
926
927
    /**
928
     * Handle a processing instruction.
929
     *
930
     * XML processing instructions are supposed to be ignored in HTML5,
931
     * treated as "bogus comments". However, since we're not a user
932
     * agent, we allow them. We consume until ?> and then issue a
933
     * EventListener::processingInstruction() event.
934
     *
935
     * @return bool
936
     */
937 119
    protected function processingInstruction()
938
    {
939 119
        if ('?' != $this->scanner->current()) {
940 114
            return false;
941
        }
942
943 7
        $tok = $this->scanner->next();
944 7
        $procName = $this->scanner->getAsciiAlpha();
945 7
        $white = $this->scanner->whitespace();
946
947
        // If not a PI, send to bogusComment.
948 7
        if (0 == strlen($procName) || 0 == $white || false == $this->scanner->current()) {
949 1
            $this->parseError("Expected processing instruction name, got $tok");
950 1
            $this->bogusComment('<?' . $tok . $procName);
951
952 1
            return true;
953
        }
954
955 6
        $data = '';
956
        // As long as it's not the case that the next two chars are ? and >.
957 6
        while (!('?' == $this->scanner->current() && '>' == $this->scanner->peek())) {
958 6
            $data .= $this->scanner->current();
959
960 6
            $tok = $this->scanner->next();
961 6
            if (false === $tok) {
962
                $this->parseError('Unexpected EOF in processing instruction.');
963
                $this->events->processingInstruction($procName, $data);
0 ignored issues
show
Security Bug introduced by
It seems like $procName defined by $this->scanner->getAsciiAlpha() on line 944 can also be of type false; however, Masterminds\HTML5\Parser...processingInstruction() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
964
965
                return true;
966
            }
967 6
        }
968
969 6
        $this->scanner->consume(2); // Consume the closing tag
970 6
        $this->events->processingInstruction($procName, $data);
0 ignored issues
show
Security Bug introduced by
It seems like $procName defined by $this->scanner->getAsciiAlpha() on line 944 can also be of type false; however, Masterminds\HTML5\Parser...processingInstruction() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
971
972 6
        return true;
973
    }
974
975
    // ================================================================
976
    // UTILITY FUNCTIONS
977
    // ================================================================
978
979
    /**
980
     * Read from the input stream until we get to the desired sequene
981
     * or hit the end of the input stream.
982
     *
983
     * @param string $sequence
984
     *
985
     * @return string
986
     */
987 8
    protected function readUntilSequence($sequence)
988
    {
989 8
        $buffer = '';
990
991
        // Optimization for reading larger blocks faster.
992 8
        $first = substr($sequence, 0, 1);
993 8
        while (false !== $this->scanner->current()) {
994 8
            $buffer .= $this->scanner->charsUntil($first);
995
996
            // Stop as soon as we hit the stopping condition.
997 8
            if ($this->scanner->sequenceMatches($sequence, false)) {
998 8
                return $buffer;
999
            }
1000 4
            $buffer .= $this->scanner->current();
1001 4
            $this->scanner->consume();
1002 4
        }
1003
1004
        // If we get here, we hit the EOF.
1005 1
        $this->parseError('Unexpected EOF during text read.');
1006
1007 1
        return $buffer;
1008
    }
1009
1010
    /**
1011
     * Check if upcomming chars match the given sequence.
1012
     *
1013
     * This will read the stream for the $sequence. If it's
1014
     * found, this will return true. If not, return false.
1015
     * Since this unconsumes any chars it reads, the caller
1016
     * will still need to read the next sequence, even if
1017
     * this returns true.
1018
     *
1019
     * Example: $this->scanner->sequenceMatches('</script>') will
1020
     * see if the input stream is at the start of a
1021
     * '</script>' string.
1022
     *
1023
     * @param string $sequence
1024
     * @param bool   $caseSensitive
1025
     *
1026
     * @return bool
1027
     */
1028
    protected function sequenceMatches($sequence, $caseSensitive = true)
1029
    {
1030
        @trigger_error(__METHOD__ . ' method is deprecated since version 2.4 and will be removed in 3.0. Use Scanner::sequenceMatches() instead.', E_USER_DEPRECATED);
1031
1032
        return $this->scanner->sequenceMatches($sequence, $caseSensitive);
1033
    }
1034
1035
    /**
1036
     * Send a TEXT event with the contents of the text buffer.
1037
     *
1038
     * This emits an EventHandler::text() event with the current contents of the
1039
     * temporary text buffer. (The buffer is used to group as much PCDATA
1040
     * as we can instead of emitting lots and lots of TEXT events.)
1041
     */
1042 127
    protected function flushBuffer()
1043
    {
1044 127
        if ('' === $this->text) {
1045 125
            return;
1046
        }
1047 87
        $this->events->text($this->text);
1048 87
        $this->text = '';
1049 87
    }
1050
1051
    /**
1052
     * Add text to the temporary buffer.
1053
     *
1054
     * @see flushBuffer()
1055
     *
1056
     * @param string $str
1057
     */
1058 9
    protected function buffer($str)
1059
    {
1060 9
        $this->text .= $str;
1061 9
    }
1062
1063
    /**
1064
     * Emit a parse error.
1065
     *
1066
     * A parse error always returns false because it never consumes any
1067
     * characters.
1068
     *
1069
     * @param string $msg
1070
     *
1071
     * @return string
1072
     */
1073 15
    protected function parseError($msg)
1074
    {
1075 15
        $args = func_get_args();
1076
1077 15
        if (count($args) > 1) {
1078 11
            array_shift($args);
1079 11
            $msg = vsprintf($msg, $args);
1080 11
        }
1081
1082 15
        $line = $this->scanner->currentLine();
1083 15
        $col = $this->scanner->columnOffset();
1084 15
        $this->events->parseError($msg, $line, $col);
1085
1086 15
        return false;
1087
    }
1088
1089
    /**
1090
     * Decode a character reference and return the string.
1091
     *
1092
     * If $inAttribute is set to true, a bare & will be returned as-is.
1093
     *
1094
     * @param bool $inAttribute Set to true if the text is inside of an attribute value.
1095
     *                          false otherwise.
1096
     *
1097
     * @return string
1098
     */
1099 12
    protected function decodeCharacterReference($inAttribute = false)
1100
    {
1101
        // Next char after &.
1102 12
        $tok = $this->scanner->next();
1103 12
        $start = $this->scanner->position();
1104
1105 12
        if (false === $tok) {
1106 1
            return '&';
1107
        }
1108
1109
        // These indicate not an entity. We return just
1110
        // the &.
1111 12
        if ("\t" === $tok || "\n" === $tok || "\f" === $tok || ' ' === $tok || '&' === $tok || '<' === $tok) {
1112
            // $this->scanner->next();
1113 2
            return '&';
1114
        }
1115
1116
        // Numeric entity
1117 12
        if ('#' === $tok) {
1118 2
            $tok = $this->scanner->next();
1119
1120
            // Hexidecimal encoding.
1121
            // X[0-9a-fA-F]+;
1122
            // x[0-9a-fA-F]+;
1123 2
            if ('x' === $tok || 'X' === $tok) {
1124 2
                $tok = $this->scanner->next(); // Consume x
1125
1126
                // Convert from hex code to char.
1127 2
                $hex = $this->scanner->getHex();
1128 2
                if (empty($hex)) {
1129
                    $this->parseError('Expected &#xHEX;, got &#x%s', $tok);
1130
                    // We unconsume because we don't know what parser rules might
1131
                    // be in effect for the remaining chars. For example. '&#>'
1132
                    // might result in a specific parsing rule inside of tag
1133
                    // contexts, while not inside of pcdata context.
1134
                    $this->scanner->unconsume(2);
1135
1136
                    return '&';
1137
                }
1138 2
                $entity = CharacterReference::lookupHex($hex);
1139 2
            }             // Decimal encoding.
1140
            // [0-9]+;
1141
            else {
1142
                // Convert from decimal to char.
1143 1
                $numeric = $this->scanner->getNumeric();
1144 1
                if (false === $numeric) {
1145
                    $this->parseError('Expected &#DIGITS;, got &#%s', $tok);
1146
                    $this->scanner->unconsume(2);
1147
1148
                    return '&';
1149
                }
1150 1
                $entity = CharacterReference::lookupDecimal($numeric);
1151
            }
1152 12
        } elseif ('=' === $tok && $inAttribute) {
1153 1
            return '&';
1154
        } else { // String entity.
1155
            // Attempt to consume a string up to a ';'.
1156
            // [a-zA-Z0-9]+;
1157 11
            $cname = $this->scanner->getAsciiAlphaNum();
1158 11
            $entity = CharacterReference::lookupName($cname);
0 ignored issues
show
Security Bug introduced by
It seems like $cname defined by $this->scanner->getAsciiAlphaNum() on line 1157 can also be of type false; however, Masterminds\HTML5\Parser...Reference::lookupName() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
1159
1160
            // When no entity is found provide the name of the unmatched string
1161
            // and continue on as the & is not part of an entity. The & will
1162
            // be converted to &amp; elsewhere.
1163 11
            if (null === $entity) {
1164 6
                if (!$inAttribute || '' === $cname) {
1165 5
                    $this->parseError("No match in entity table for '%s'", $cname);
1166 5
                }
1167 6
                $this->scanner->unconsume($this->scanner->position() - $start);
1168
1169 6
                return '&';
1170
            }
1171
        }
1172
1173
        // The scanner has advanced the cursor for us.
1174 9
        $tok = $this->scanner->current();
1175
1176
        // We have an entity. We're done here.
1177 9
        if (';' === $tok) {
1178 9
            $this->scanner->consume();
1179
1180 9
            return $entity;
1181
        }
1182
1183
        // If in an attribute, then failing to match ; means unconsume the
1184
        // entire string. Otherwise, failure to match is an error.
1185 1
        if ($inAttribute) {
1186
            $this->scanner->unconsume($this->scanner->position() - $start);
1187
1188
            return '&';
1189
        }
1190
1191 1
        $this->parseError('Expected &ENTITY;, got &ENTITY%s (no trailing ;) ', $tok);
1192
1193 1
        return '&' . $entity;
1194
    }
1195
}
1196