Completed
Push — master ( 88b7c6...182f34 )
by Asmir
11s
created

Tokenizer::doctype()   C

Complexity

Conditions 14
Paths 26

Size

Total Lines 111

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 54
CRAP Score 14.2965

Importance

Changes 0
Metric Value
dl 0
loc 111
ccs 54
cts 61
cp 0.8852
rs 5.0133
c 0
b 0
f 0
cc 14
nc 26
nop 0
crap 14.2965

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
namespace Masterminds\HTML5\Parser;
4
5
use Masterminds\HTML5\Elements;
6
7
/**
8
 * The HTML5 tokenizer.
9
 *
10
 * The tokenizer's role is reading data from the scanner and gathering it into
11
 * semantic units. From the tokenizer, data is emitted to an event handler,
12
 * which may (for example) create a DOM tree.
13
 *
14
 * The HTML5 specification has a detailed explanation of tokenizing HTML5. We
15
 * follow that specification to the maximum extent that we can. If you find
16
 * a discrepancy that is not documented, please file a bug and/or submit a
17
 * patch.
18
 *
19
 * This tokenizer is implemented as a recursive descent parser.
20
 *
21
 * Within the API documentation, you may see references to the specific section
22
 * of the HTML5 spec that the code attempts to reproduce. Example: 8.2.4.1.
23
 * This refers to section 8.2.4.1 of the HTML5 CR specification.
24
 *
25
 * @see http://www.w3.org/TR/2012/CR-html5-20121217/
26
 */
27
class Tokenizer
28
{
29
    protected $scanner;
30
31
    protected $events;
32
33
    protected $tok;
34
35
    /**
36
     * Buffer for text.
37
     */
38
    protected $text = '';
39
40
    // When this goes to false, the parser stops.
41
    protected $carryOn = true;
42
43
    protected $textMode = 0; // TEXTMODE_NORMAL;
44
    protected $untilTag = null;
45
46
    const CONFORMANT_XML = 'xml';
47
    const CONFORMANT_HTML = 'html';
48
    protected $mode = self::CONFORMANT_HTML;
49
50
    /**
51
     * Create a new tokenizer.
52
     *
53
     * Typically, parsing a document involves creating a new tokenizer, giving
54
     * it a scanner (input) and an event handler (output), and then calling
55
     * the Tokenizer::parse() method.`
56
     *
57
     * @param Scanner      $scanner      A scanner initialized with an input stream.
58
     * @param EventHandler $eventHandler An event handler, initialized and ready to receive events.
59
     * @param string       $mode
60
     */
61 127
    public function __construct($scanner, $eventHandler, $mode = self::CONFORMANT_HTML)
62
    {
63 127
        $this->scanner = $scanner;
64 127
        $this->events = $eventHandler;
65 127
        $this->mode = $mode;
66 127
    }
67
68
    /**
69
     * Begin parsing.
70
     *
71
     * This will begin scanning the document, tokenizing as it goes.
72
     * Tokens are emitted into the event handler.
73
     *
74
     * Tokenizing will continue until the document is completely
75
     * read. Errors are emitted into the event handler, but
76
     * the parser will attempt to continue parsing until the
77
     * entire input stream is read.
78
     */
79 127
    public function parse()
80
    {
81
        do {
82 127
            $this->consumeData();
83
            // FIXME: Add infinite loop protection.
84 127
        } while ($this->carryOn);
85 127
    }
86
87
    /**
88
     * Set the text mode for the character data reader.
89
     *
90
     * HTML5 defines three different modes for reading text:
91
     * - Normal: Read until a tag is encountered.
92
     * - RCDATA: Read until a tag is encountered, but skip a few otherwise-
93
     * special characters.
94
     * - Raw: Read until a special closing tag is encountered (viz. pre, script)
95
     *
96
     * This allows those modes to be set.
97
     *
98
     * Normally, setting is done by the event handler via a special return code on
99
     * startTag(), but it can also be set manually using this function.
100
     *
101
     * @param int    $textmode One of Elements::TEXT_*.
102
     * @param string $untilTag The tag that should stop RAW or RCDATA mode. Normal mode does not
103
     *                         use this indicator.
104
     */
105 108
    public function setTextMode($textmode, $untilTag = null)
106
    {
107 108
        $this->textMode = $textmode & (Elements::TEXT_RAW | Elements::TEXT_RCDATA);
108 108
        $this->untilTag = $untilTag;
109 108
    }
110
111
    /**
112
     * Consume a character and make a move.
113
     * HTML5 8.2.4.1.
114
     */
115 127
    protected function consumeData()
116
    {
117 127
        $tok = $this->scanner->current();
118
119 127
        if ('&' === $tok) {
120
            // Character reference
121 8
            $ref = $this->decodeCharacterReference();
122 8
            $this->buffer($ref);
123
124 8
            $tok = $this->scanner->current();
125 8
        }
126
127
        // Parse tag
128 127
        if ('<' === $tok) {
129
            // Any buffered text data can go out now.
130 123
            $this->flushBuffer();
131
132 123
            $tok = $this->scanner->next();
133
134 123
            if ('!' === $tok) {
135 101
                $this->markupDeclaration();
136 123
            } elseif ('/' === $tok) {
137 111
                $this->endTag();
138 120
            } elseif ('?' === $tok) {
139 7
                $this->processingInstruction();
140 119
            } elseif (ctype_alpha($tok)) {
141 114
                $this->tagName();
142 114
            } else {
143 1
                $this->parseError('Illegal tag opening');
144
                // TODO is this necessary ?
145 1
                $this->characterData();
146
            }
147
148 123
            $tok = $this->scanner->current();
149 123
        }
150
151 127
        if (false === $tok) {
152
            // Handle end of document
153 127
            $this->eof();
154 127
        } else {
155
            // Parse character
156 112
            switch ($this->textMode) {
157 112
                case Elements::TEXT_RAW:
158 8
                    $this->rawText($tok);
159 8
                    break;
160
161 112
                case Elements::TEXT_RCDATA:
162 37
                    $this->rcdata($tok);
163 37
                    break;
164
165 111
                default:
166 111
                    if ('<' !== $tok && '&' !== $tok) {
167
                        // NULL character
168 87
                        if ("\00" === $tok) {
169
                            $this->parseError('Received null character.');
170
                        }
171
172 87
                        $this->text .= $tok;
173 87
                        $this->scanner->consume();
174 87
                    }
175 112
            }
176
        }
177
178 127
        return $this->carryOn;
179
    }
180
181
    /**
182
     * Parse anything that looks like character data.
183
     *
184
     * Different rules apply based on the current text mode.
185
     *
186
     * @see Elements::TEXT_RAW Elements::TEXT_RCDATA.
187
     */
188 1
    protected function characterData()
189
    {
190 1
        $tok = $this->scanner->current();
191 1
        if (false === $tok) {
192
            return false;
193
        }
194 1
        switch ($this->textMode) {
195 1
            case Elements::TEXT_RAW:
196
                return $this->rawText($tok);
197 1
            case Elements::TEXT_RCDATA:
198
                return $this->rcdata($tok);
199 1
            default:
200 1
                if ('<' === $tok || '&' === $tok) {
201
                    return false;
202
                }
203
204 1
                return $this->text($tok);
205 1
        }
206
    }
207
208
    /**
209
     * This buffers the current token as character data.
210
     *
211
     * @param string $tok The current token.
212
     *
213
     * @return bool
214
     */
215 1
    protected function text($tok)
216
    {
217
        // This should never happen...
218 1
        if (false === $tok) {
219
            return false;
220
        }
221
222
        // NULL character
223 1
        if ("\00" === $tok) {
224
            $this->parseError('Received null character.');
225
        }
226
227 1
        $this->buffer($tok);
228 1
        $this->scanner->consume();
229
230 1
        return true;
231
    }
232
233
    /**
234
     * Read text in RAW mode.
235
     *
236
     * @param string $tok The current token.
237
     *
238
     * @return bool
239
     */
240 8
    protected function rawText($tok)
241
    {
242 8
        if (is_null($this->untilTag)) {
243
            return $this->text($tok);
244
        }
245
246 8
        $sequence = '</' . $this->untilTag . '>';
247 8
        $txt = $this->readUntilSequence($sequence);
248 8
        $this->events->text($txt);
249 8
        $this->setTextMode(0);
250
251 8
        return $this->endTag();
252
    }
253
254
    /**
255
     * Read text in RCDATA mode.
256
     *
257
     * @param string $tok The current token.
258
     *
259
     * @return bool
260
     */
261 37
    protected function rcdata($tok)
262
    {
263 37
        if (is_null($this->untilTag)) {
264
            return $this->text($tok);
265
        }
266
267 37
        $sequence = '</' . $this->untilTag;
268 37
        $txt = '';
269
270 37
        $caseSensitive = !Elements::isHtml5Element($this->untilTag);
271 37
        while (false !== $tok && !('<' == $tok && ($this->scanner->sequenceMatches($sequence, $caseSensitive)))) {
272 35
            if ('&' == $tok) {
273 1
                $txt .= $this->decodeCharacterReference();
274 1
                $tok = $this->scanner->current();
275 1
            } else {
276 35
                $txt .= $tok;
277 35
                $tok = $this->scanner->next();
278
            }
279 35
        }
280 37
        $len = strlen($sequence);
281 37
        $this->scanner->consume($len);
282 37
        $len += $this->scanner->whitespace();
283 37
        if ('>' !== $this->scanner->current()) {
284
            $this->parseError('Unclosed RCDATA end tag');
285
        }
286
287 37
        $this->scanner->unconsume($len);
288 37
        $this->events->text($txt);
289 37
        $this->setTextMode(0);
290
291 37
        return $this->endTag();
292
    }
293
294
    /**
295
     * If the document is read, emit an EOF event.
296
     */
297 127
    protected function eof()
298
    {
299
        // fprintf(STDOUT, "EOF");
300 127
        $this->flushBuffer();
301 127
        $this->events->eof();
302 127
        $this->carryOn = false;
303 127
    }
304
305
    /**
306
     * Look for markup.
307
     */
308 101
    protected function markupDeclaration()
309
    {
310 101
        $tok = $this->scanner->next();
311
312
        // Comment:
313 101
        if ('-' == $tok && '-' == $this->scanner->peek()) {
314 6
            $this->scanner->consume(2);
315
316 6
            return $this->comment();
317 98
        } elseif ('D' == $tok || 'd' == $tok) { // Doctype
318 96
            return $this->doctype();
319 7
        } elseif ('[' == $tok) { // CDATA section
320 7
            return $this->cdataSection();
321
        }
322
323
        // FINISH
324 1
        $this->parseError('Expected <!--, <![CDATA[, or <!DOCTYPE. Got <!%s', $tok);
325 1
        $this->bogusComment('<!');
326
327 1
        return true;
328
    }
329
330
    /**
331
     * Consume an end tag. See section 8.2.4.9.
332
     */
333 111
    protected function endTag()
334
    {
335 111
        if ('/' != $this->scanner->current()) {
336 44
            return false;
337
        }
338 111
        $tok = $this->scanner->next();
339
340
        // a-zA-Z -> tagname
341
        // > -> parse error
342
        // EOF -> parse error
343
        // -> parse error
344 111
        if (!ctype_alpha($tok)) {
345 2
            $this->parseError("Expected tag name, got '%s'", $tok);
346 2
            if ("\0" == $tok || false === $tok) {
347
                return false;
348
            }
349
350 2
            return $this->bogusComment('</');
351
        }
352
353 110
        $name = $this->scanner->charsUntil("\n\f \t>");
354 110
        $name = self::CONFORMANT_XML === $this->mode ? $name : strtolower($name);
355
        // Trash whitespace.
356 110
        $this->scanner->whitespace();
357
358 110
        $tok = $this->scanner->current();
359 110
        if ('>' != $tok) {
360 1
            $this->parseError("Expected >, got '%s'", $tok);
361
            // We just trash stuff until we get to the next tag close.
362 1
            $this->scanner->charsUntil('>');
363 1
        }
364
365 110
        $this->events->endTag($name);
366 110
        $this->scanner->consume();
367
368 110
        return true;
369
    }
370
371
    /**
372
     * Consume a tag name and body. See section 8.2.4.10.
373
     */
374 114
    protected function tagName()
375
    {
376
        // We know this is at least one char.
377 114
        $name = $this->scanner->charsWhile(':_-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz');
378 114
        $name = self::CONFORMANT_XML === $this->mode ? $name : strtolower($name);
379 114
        $attributes = array();
380 114
        $selfClose = false;
381
382
        // Handle attribute parse exceptions here so that we can
383
        // react by trying to build a sensible parse tree.
384
        try {
385
            do {
386 114
                $this->scanner->whitespace();
387 114
                $this->attribute($attributes);
388 114
            } while (!$this->isTagEnd($selfClose));
389 114
        } catch (ParseError $e) {
390 2
            $selfClose = false;
391
        }
392
393 114
        $mode = $this->events->startTag($name, $attributes, $selfClose);
394
395 114
        if (is_int($mode)) {
396 107
            $this->setTextMode($mode, $name);
397 107
        }
398
399 114
        $this->scanner->consume();
400
401 114
        return true;
402
    }
403
404
    /**
405
     * Check if the scanner has reached the end of a tag.
406
     */
407 114
    protected function isTagEnd(&$selfClose)
408
    {
409 114
        $tok = $this->scanner->current();
410 114
        if ('/' == $tok) {
411 15
            $this->scanner->consume();
412 15
            $this->scanner->whitespace();
413 15
            $tok = $this->scanner->current();
414
415 15
            if ('>' == $tok) {
416 15
                $selfClose = true;
417
418 15
                return true;
419
            }
420 2
            if (false === $tok) {
421 1
                $this->parseError('Unexpected EOF inside of tag.');
422
423 1
                return true;
424
            }
425
            // Basically, we skip the / token and go on.
426
            // See 8.2.4.43.
427 1
            $this->parseError("Unexpected '%s' inside of a tag.", $tok);
428
429 1
            return false;
430
        }
431
432 114
        if ('>' == $tok) {
433 114
            return true;
434
        }
435 32
        if (false === $tok) {
436 2
            $this->parseError('Unexpected EOF inside of tag.');
437
438 2
            return true;
439
        }
440
441 31
        return false;
442
    }
443
444
    /**
445
     * Parse attributes from inside of a tag.
446
     *
447
     * @param string[] $attributes
448
     *
449
     * @return bool
450
     *
451
     * @throws ParseError
452
     */
453 114
    protected function attribute(&$attributes)
454
    {
455 114
        $tok = $this->scanner->current();
456 114
        if ('/' == $tok || '>' == $tok || false === $tok) {
457 108
            return false;
458
        }
459
460 82
        if ('<' == $tok) {
461 2
            $this->parseError("Unexpected '<' inside of attributes list.");
462
            // Push the < back onto the stack.
463 2
            $this->scanner->unconsume();
464
            // Let the caller figure out how to handle this.
465 2
            throw new ParseError('Start tag inside of attribute.');
466
        }
467
468 82
        $name = strtolower($this->scanner->charsUntil("/>=\n\f\t "));
469
470 82
        if (0 == strlen($name)) {
471 3
            $tok = $this->scanner->current();
472 3
            $this->parseError('Expected an attribute name, got %s.', $tok);
473
            // Really, only '=' can be the char here. Everything else gets absorbed
474
            // under one rule or another.
475 3
            $name = $tok;
476 3
            $this->scanner->consume();
477 3
        }
478
479 82
        $isValidAttribute = true;
480
        // Attribute names can contain most Unicode characters for HTML5.
481
        // But method "DOMElement::setAttribute" is throwing exception
482
        // because of it's own internal restriction so these have to be filtered.
483
        // see issue #23: https://github.com/Masterminds/html5-php/issues/23
484
        // and http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#syntax-attribute-name
485 82
        if (preg_match("/[\x1-\x2C\\/\x3B-\x40\x5B-\x5E\x60\x7B-\x7F]/u", $name)) {
486 4
            $this->parseError('Unexpected characters in attribute name: %s', $name);
487 4
            $isValidAttribute = false;
488 4
        }         // There is no limitation for 1st character in HTML5.
489
        // But method "DOMElement::setAttribute" is throwing exception for the
490
        // characters below so they have to be filtered.
491
        // see issue #23: https://github.com/Masterminds/html5-php/issues/23
492
        // and http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#syntax-attribute-name
493 79
        elseif (preg_match('/^[0-9.-]/u', $name)) {
494 1
            $this->parseError('Unexpected character at the begining of attribute name: %s', $name);
495 1
            $isValidAttribute = false;
496 1
        }
497
        // 8.1.2.3
498 82
        $this->scanner->whitespace();
499
500 82
        $val = $this->attributeValue();
501 82
        if ($isValidAttribute) {
502 79
            $attributes[$name] = $val;
503 79
        }
504
505 82
        return true;
506
    }
507
508
    /**
509
     * Consume an attribute value. See section 8.2.4.37 and after.
510
     *
511
     * @return string|null
512
     */
513 82
    protected function attributeValue()
514
    {
515 82
        if ('=' != $this->scanner->current()) {
516 13
            return null;
517
        }
518 78
        $this->scanner->consume();
519
        // 8.1.2.3
520 78
        $this->scanner->whitespace();
521
522 78
        $tok = $this->scanner->current();
523
        switch ($tok) {
524 78
            case "\n":
525 78
            case "\f":
526 78
            case ' ':
527 78
            case "\t":
528
                // Whitespace here indicates an empty value.
529
                return null;
530 78
            case '"':
531 78
            case "'":
532 78
                $this->scanner->consume();
533
534 78
                return $this->quotedAttributeValue($tok);
0 ignored issues
show
Security Bug introduced by
It seems like $tok defined by $this->scanner->current() on line 522 can also be of type false; however, Masterminds\HTML5\Parser...:quotedAttributeValue() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
535 1
            case '>':
536
                // case '/': // 8.2.4.37 seems to allow foo=/ as a valid attr.
537 1
                $this->parseError('Expected attribute value, got tag end.');
538
539 1
                return null;
540 1
            case '=':
541 1
            case '`':
542
                $this->parseError('Expecting quotes, got %s.', $tok);
543
544
                return $this->unquotedAttributeValue();
545 1
            default:
546 1
                return $this->unquotedAttributeValue();
547 1
        }
548
    }
549
550
    /**
551
     * Get an attribute value string.
552
     *
553
     * @param string $quote IMPORTANT: This is a series of chars! Any one of which will be considered
554
     *                      termination of an attribute's value. E.g. "\"'" will stop at either
555
     *                      ' or ".
556
     *
557
     * @return string The attribute value.
558
     */
559 78
    protected function quotedAttributeValue($quote)
560
    {
561 78
        $stoplist = "\f" . $quote;
562 78
        $val = '';
563
564 78
        while (true) {
565 78
            $tokens = $this->scanner->charsUntil($stoplist . '&');
566 78
            if (false !== $tokens) {
567 78
                $val .= $tokens;
568 78
            } else {
569
                break;
570
            }
571
572 78
            $tok = $this->scanner->current();
573 78
            if ('&' == $tok) {
574 3
                $val .= $this->decodeCharacterReference(true);
575 3
                continue;
576
            }
577 78
            break;
578
        }
579 78
        $this->scanner->consume();
580
581 78
        return $val;
582
    }
583
584 1
    protected function unquotedAttributeValue()
585
    {
586 1
        $val = '';
587 1
        $tok = $this->scanner->current();
588 1
        while (false !== $tok) {
589
            switch ($tok) {
590 1
                case "\n":
591 1
                case "\f":
592 1
                case ' ':
593 1
                case "\t":
594 1
                case '>':
595 1
                    break 2;
596
597 1
                case '&':
598 1
                    $val .= $this->decodeCharacterReference(true);
599 1
                    $tok = $this->scanner->current();
600
601 1
                    break;
602
603 1
                case "'":
604 1
                case '"':
605 1
                case '<':
606 1
                case '=':
607 1
                case '`':
608 1
                    $this->parseError('Unexpected chars in unquoted attribute value %s', $tok);
609 1
                    $val .= $tok;
610 1
                    $tok = $this->scanner->next();
611 1
                    break;
612
613 1
                default:
614 1
                    $val .= $this->scanner->charsUntil("\t\n\f >&\"'<=`");
615
616 1
                    $tok = $this->scanner->current();
617 1
            }
618 1
        }
619
620 1
        return $val;
621
    }
622
623
    /**
624
     * Consume malformed markup as if it were a comment.
625
     * 8.2.4.44.
626
     *
627
     * The spec requires that the ENTIRE tag-like thing be enclosed inside of
628
     * the comment. So this will generate comments like:
629
     *
630
     * &lt;!--&lt/+foo&gt;--&gt;
631
     *
632
     * @param string $leading Prepend any leading characters. This essentially
633
     *                        negates the need to backtrack, but it's sort of a hack.
634
     *
635
     * @return bool
636
     */
637 3
    protected function bogusComment($leading = '')
638
    {
639 3
        $comment = $leading;
640 3
        $tokens = $this->scanner->charsUntil('>');
641 3
        if (false !== $tokens) {
642 2
            $comment .= $tokens;
643 2
        }
644 3
        $tok = $this->scanner->current();
645 3
        if (false !== $tok) {
646 2
            $comment .= $tok;
647 2
        }
648
649 3
        $this->flushBuffer();
650 3
        $this->events->comment($comment);
651 3
        $this->scanner->consume();
652
653 3
        return true;
654
    }
655
656
    /**
657
     * Read a comment.
658
     * Expects the first tok to be inside of the comment.
659
     *
660
     * @return bool
661
     */
662 6
    protected function comment()
663
    {
664 6
        $tok = $this->scanner->current();
665 6
        $comment = '';
666
667
        // <!-->. Emit an empty comment because 8.2.4.46 says to.
668 6
        if ('>' == $tok) {
669
            // Parse error. Emit the comment token.
670 1
            $this->parseError("Expected comment data, got '>'");
671 1
            $this->events->comment('');
672 1
            $this->scanner->consume();
673
674 1
            return true;
675
        }
676
677
        // Replace NULL with the replacement char.
678 6
        if ("\0" == $tok) {
679
            $tok = UTF8Utils::FFFD;
680
        }
681 6
        while (!$this->isCommentEnd()) {
682 6
            $comment .= $tok;
683 6
            $tok = $this->scanner->next();
684 6
        }
685
686 6
        $this->events->comment($comment);
687 6
        $this->scanner->consume();
688
689 6
        return true;
690
    }
691
692
    /**
693
     * Check if the scanner has reached the end of a comment.
694
     *
695
     * @return bool
696
     */
697 6
    protected function isCommentEnd()
698
    {
699 6
        $tok = $this->scanner->current();
700
701
        // EOF
702 6
        if (false === $tok) {
703
            // Hit the end.
704 1
            $this->parseError('Unexpected EOF in a comment.');
705
706 1
            return true;
707
        }
708
709
        // If it doesn't start with -, not the end.
710 6
        if ('-' != $tok) {
711 6
            return false;
712
        }
713
714
        // Advance one, and test for '->'
715 6
        if ('-' == $this->scanner->next() && '>' == $this->scanner->peek()) {
716 6
            $this->scanner->consume(); // Consume the last '>'
717 6
            return true;
718
        }
719
        // Unread '-';
720 2
        $this->scanner->unconsume(1);
721
722 2
        return false;
723
    }
724
725
    /**
726
     * Parse a DOCTYPE.
727
     *
728
     * Parse a DOCTYPE declaration. This method has strong bearing on whether or
729
     * not Quirksmode is enabled on the event handler.
730
     *
731
     * @todo This method is a little long. Should probably refactor.
732
     *
733
     * @return bool
734
     */
735 96
    protected function doctype()
736
    {
737
        // Check that string is DOCTYPE.
738 96
        if ($this->scanner->sequenceMatches('DOCTYPE', false)) {
739 95
            $this->scanner->consume(7);
740 95
        } else {
741 1
            $chars = $this->scanner->charsWhile('DOCTYPEdoctype');
742 1
            $this->parseError('Expected DOCTYPE, got %s', $chars);
743
744 1
            return $this->bogusComment('<!' . $chars);
745
        }
746
747 95
        $this->scanner->whitespace();
748 95
        $tok = $this->scanner->current();
749
750
        // EOF: die.
751 95
        if (false === $tok) {
752
            $this->events->doctype('html5', EventHandler::DOCTYPE_NONE, '', true);
753
            $this->eof();
754
755
            return true;
756
        }
757
758
        // NULL char: convert.
759 95
        if ("\0" === $tok) {
760
            $this->parseError('Unexpected null character in DOCTYPE.');
761
        }
762
763 95
        $stop = " \n\f>";
764 95
        $doctypeName = $this->scanner->charsUntil($stop);
765
        // Lowercase ASCII, replace \0 with FFFD
766 95
        $doctypeName = strtolower(strtr($doctypeName, "\0", UTF8Utils::FFFD));
0 ignored issues
show
Security Bug introduced by
It seems like $doctypeName can also be of type false; however, strtr() does only seem to accept string, did you maybe forget to handle an error condition?
Loading history...
767
768 95
        $tok = $this->scanner->current();
769
770
        // If false, emit a parse error, DOCTYPE, and return.
771 95
        if (false === $tok) {
772 1
            $this->parseError('Unexpected EOF in DOCTYPE declaration.');
773 1
            $this->events->doctype($doctypeName, EventHandler::DOCTYPE_NONE, null, true);
774
775 1
            return true;
776
        }
777
778
        // Short DOCTYPE, like <!DOCTYPE html>
779 95
        if ('>' == $tok) {
780
            // DOCTYPE without a name.
781 95
            if (0 == strlen($doctypeName)) {
782 1
                $this->parseError('Expected a DOCTYPE name. Got nothing.');
783 1
                $this->events->doctype($doctypeName, 0, null, true);
784 1
                $this->scanner->consume();
785
786 1
                return true;
787
            }
788 95
            $this->events->doctype($doctypeName);
789 95
            $this->scanner->consume();
790
791 95
            return true;
792
        }
793 1
        $this->scanner->whitespace();
794
795 1
        $pub = strtoupper($this->scanner->getAsciiAlpha());
796 1
        $white = $this->scanner->whitespace();
797
798
        // Get ID, and flag it as pub or system.
799 1
        if (('PUBLIC' == $pub || 'SYSTEM' == $pub) && $white > 0) {
800
            // Get the sys ID.
801 1
            $type = 'PUBLIC' == $pub ? EventHandler::DOCTYPE_PUBLIC : EventHandler::DOCTYPE_SYSTEM;
802 1
            $id = $this->quotedString("\0>");
803 1
            if (false === $id) {
804
                $this->events->doctype($doctypeName, $type, $pub, false);
805
806
                return true;
807
            }
808
809
            // Premature EOF.
810 1
            if (false === $this->scanner->current()) {
811 1
                $this->parseError('Unexpected EOF in DOCTYPE');
812 1
                $this->events->doctype($doctypeName, $type, $id, true);
813
814 1
                return true;
815
            }
816
817
            // Well-formed complete DOCTYPE.
818 1
            $this->scanner->whitespace();
819 1
            if ('>' == $this->scanner->current()) {
820 1
                $this->events->doctype($doctypeName, $type, $id, false);
821 1
                $this->scanner->consume();
822
823 1
                return true;
824
            }
825
826
            // If we get here, we have <!DOCTYPE foo PUBLIC "bar" SOME_JUNK
827
            // Throw away the junk, parse error, quirks mode, return true.
828 1
            $this->scanner->charsUntil('>');
829 1
            $this->parseError('Malformed DOCTYPE.');
830 1
            $this->events->doctype($doctypeName, $type, $id, true);
831 1
            $this->scanner->consume();
832
833 1
            return true;
834
        }
835
836
        // Else it's a bogus DOCTYPE.
837
        // Consume to > and trash.
838 1
        $this->scanner->charsUntil('>');
839
840 1
        $this->parseError('Expected PUBLIC or SYSTEM. Got %s.', $pub);
841 1
        $this->events->doctype($doctypeName, 0, null, true);
842 1
        $this->scanner->consume();
843
844 1
        return true;
845
    }
846
847
    /**
848
     * Utility for reading a quoted string.
849
     *
850
     * @param string $stopchars Characters (in addition to a close-quote) that should stop the string.
851
     *                          E.g. sometimes '>' is higher precedence than '"' or "'".
852
     *
853
     * @return mixed String if one is found (quotations omitted).
854
     */
855 1
    protected function quotedString($stopchars)
856
    {
857 1
        $tok = $this->scanner->current();
858 1
        if ('"' == $tok || "'" == $tok) {
859 1
            $this->scanner->consume();
860 1
            $ret = $this->scanner->charsUntil($tok . $stopchars);
861 1
            if ($this->scanner->current() == $tok) {
862 1
                $this->scanner->consume();
863 1
            } else {
864
                // Parse error because no close quote.
865
                $this->parseError('Expected %s, got %s', $tok, $this->scanner->current());
866
            }
867
868 1
            return $ret;
869
        }
870
871
        return false;
872
    }
873
874
    /**
875
     * Handle a CDATA section.
876
     *
877
     * @return bool
878
     */
879 7
    protected function cdataSection()
880
    {
881 7
        $cdata = '';
882 7
        $this->scanner->consume();
883
884 7
        $chars = $this->scanner->charsWhile('CDAT');
885 7
        if ('CDATA' != $chars || '[' != $this->scanner->current()) {
886 1
            $this->parseError('Expected [CDATA[, got %s', $chars);
887
888 1
            return $this->bogusComment('<![' . $chars);
889
        }
890
891 7
        $tok = $this->scanner->next();
892
        do {
893 7
            if (false === $tok) {
894 2
                $this->parseError('Unexpected EOF inside CDATA.');
895 2
                $this->bogusComment('<![CDATA[' . $cdata);
896
897 2
                return true;
898
            }
899 7
            $cdata .= $tok;
900 7
            $tok = $this->scanner->next();
901 7
        } while (!$this->scanner->sequenceMatches(']]>'));
902
903
        // Consume ]]>
904 5
        $this->scanner->consume(3);
905
906 5
        $this->events->cdata($cdata);
907
908 5
        return true;
909
    }
910
911
    // ================================================================
912
    // Non-HTML5
913
    // ================================================================
914
915
    /**
916
     * Handle a processing instruction.
917
     *
918
     * XML processing instructions are supposed to be ignored in HTML5,
919
     * treated as "bogus comments". However, since we're not a user
920
     * agent, we allow them. We consume until ?> and then issue a
921
     * EventListener::processingInstruction() event.
922
     *
923
     * @return bool
924
     */
925 7
    protected function processingInstruction()
926
    {
927 7
        if ('?' != $this->scanner->current()) {
928
            return false;
929
        }
930
931 7
        $tok = $this->scanner->next();
932 7
        $procName = $this->scanner->getAsciiAlpha();
933 7
        $white = $this->scanner->whitespace();
934
935
        // If not a PI, send to bogusComment.
936 7
        if (0 == strlen($procName) || 0 == $white || false == $this->scanner->current()) {
937 1
            $this->parseError("Expected processing instruction name, got $tok");
938 1
            $this->bogusComment('<?' . $tok . $procName);
939
940 1
            return true;
941
        }
942
943 6
        $data = '';
944
        // As long as it's not the case that the next two chars are ? and >.
945 6
        while (!('?' == $this->scanner->current() && '>' == $this->scanner->peek())) {
946 6
            $data .= $this->scanner->current();
947
948 6
            $tok = $this->scanner->next();
949 6
            if (false === $tok) {
950
                $this->parseError('Unexpected EOF in processing instruction.');
951
                $this->events->processingInstruction($procName, $data);
0 ignored issues
show
Security Bug introduced by
It seems like $procName defined by $this->scanner->getAsciiAlpha() on line 932 can also be of type false; however, Masterminds\HTML5\Parser...processingInstruction() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
952
953
                return true;
954
            }
955 6
        }
956
957 6
        $this->scanner->consume(2); // Consume the closing tag
958 6
        $this->events->processingInstruction($procName, $data);
0 ignored issues
show
Security Bug introduced by
It seems like $procName defined by $this->scanner->getAsciiAlpha() on line 932 can also be of type false; however, Masterminds\HTML5\Parser...processingInstruction() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
959
960 6
        return true;
961
    }
962
963
    // ================================================================
964
    // UTILITY FUNCTIONS
965
    // ================================================================
966
967
    /**
968
     * Read from the input stream until we get to the desired sequene
969
     * or hit the end of the input stream.
970
     *
971
     * @param string $sequence
972
     *
973
     * @return string
974
     */
975 8
    protected function readUntilSequence($sequence)
976
    {
977 8
        $buffer = '';
978
979
        // Optimization for reading larger blocks faster.
980 8
        $first = substr($sequence, 0, 1);
981 8
        while (false !== $this->scanner->current()) {
982 8
            $buffer .= $this->scanner->charsUntil($first);
983
984
            // Stop as soon as we hit the stopping condition.
985 8
            if ($this->scanner->sequenceMatches($sequence, false)) {
986 8
                return $buffer;
987
            }
988 4
            $buffer .= $this->scanner->current();
989 4
            $this->scanner->consume();
990 4
        }
991
992
        // If we get here, we hit the EOF.
993 1
        $this->parseError('Unexpected EOF during text read.');
994
995 1
        return $buffer;
996
    }
997
998
    /**
999
     * Check if upcomming chars match the given sequence.
1000
     *
1001
     * This will read the stream for the $sequence. If it's
1002
     * found, this will return true. If not, return false.
1003
     * Since this unconsumes any chars it reads, the caller
1004
     * will still need to read the next sequence, even if
1005
     * this returns true.
1006
     *
1007
     * Example: $this->scanner->sequenceMatches('</script>') will
1008
     * see if the input stream is at the start of a
1009
     * '</script>' string.
1010
     *
1011
     * @param string $sequence
1012
     * @param bool   $caseSensitive
1013
     *
1014
     * @return bool
1015
     */
1016
    protected function sequenceMatches($sequence, $caseSensitive = true)
1017
    {
1018
        @trigger_error(__METHOD__ . ' method is deprecated since version 2.4 and will be removed in 3.0. Use Scanner::sequenceMatches() instead.', E_USER_DEPRECATED);
1019
1020
        return $this->scanner->sequenceMatches($sequence, $caseSensitive);
1021
    }
1022
1023
    /**
1024
     * Send a TEXT event with the contents of the text buffer.
1025
     *
1026
     * This emits an EventHandler::text() event with the current contents of the
1027
     * temporary text buffer. (The buffer is used to group as much PCDATA
1028
     * as we can instead of emitting lots and lots of TEXT events.)
1029
     */
1030 127
    protected function flushBuffer()
1031
    {
1032 127
        if ('' === $this->text) {
1033 125
            return;
1034
        }
1035 87
        $this->events->text($this->text);
1036 87
        $this->text = '';
1037 87
    }
1038
1039
    /**
1040
     * Add text to the temporary buffer.
1041
     *
1042
     * @see flushBuffer()
1043
     *
1044
     * @param string $str
1045
     */
1046 9
    protected function buffer($str)
1047
    {
1048 9
        $this->text .= $str;
1049 9
    }
1050
1051
    /**
1052
     * Emit a parse error.
1053
     *
1054
     * A parse error always returns false because it never consumes any
1055
     * characters.
1056
     *
1057
     * @param string $msg
1058
     *
1059
     * @return string
1060
     */
1061 15
    protected function parseError($msg)
1062
    {
1063 15
        $args = func_get_args();
1064
1065 15
        if (count($args) > 1) {
1066 11
            array_shift($args);
1067 11
            $msg = vsprintf($msg, $args);
1068 11
        }
1069
1070 15
        $line = $this->scanner->currentLine();
1071 15
        $col = $this->scanner->columnOffset();
1072 15
        $this->events->parseError($msg, $line, $col);
1073
1074 15
        return false;
1075
    }
1076
1077
    /**
1078
     * Decode a character reference and return the string.
1079
     *
1080
     * If $inAttribute is set to true, a bare & will be returned as-is.
1081
     *
1082
     * @param bool $inAttribute Set to true if the text is inside of an attribute value.
1083
     *                          false otherwise.
1084
     *
1085
     * @return string
1086
     */
1087 12
    protected function decodeCharacterReference($inAttribute = false)
1088
    {
1089
        // Next char after &.
1090 12
        $tok = $this->scanner->next();
1091 12
        $start = $this->scanner->position();
1092
1093 12
        if (false === $tok) {
1094 1
            return '&';
1095
        }
1096
1097
        // These indicate not an entity. We return just
1098
        // the &.
1099 12
        if ("\t" === $tok || "\n" === $tok || "\f" === $tok || ' ' === $tok || '&' === $tok || '<' === $tok) {
1100
            // $this->scanner->next();
1101 2
            return '&';
1102
        }
1103
1104
        // Numeric entity
1105 12
        if ('#' === $tok) {
1106 2
            $tok = $this->scanner->next();
1107
1108
            // Hexidecimal encoding.
1109
            // X[0-9a-fA-F]+;
1110
            // x[0-9a-fA-F]+;
1111 2
            if ('x' === $tok || 'X' === $tok) {
1112 2
                $tok = $this->scanner->next(); // Consume x
1113
1114
                // Convert from hex code to char.
1115 2
                $hex = $this->scanner->getHex();
1116 2
                if (empty($hex)) {
1117
                    $this->parseError('Expected &#xHEX;, got &#x%s', $tok);
1118
                    // We unconsume because we don't know what parser rules might
1119
                    // be in effect for the remaining chars. For example. '&#>'
1120
                    // might result in a specific parsing rule inside of tag
1121
                    // contexts, while not inside of pcdata context.
1122
                    $this->scanner->unconsume(2);
1123
1124
                    return '&';
1125
                }
1126 2
                $entity = CharacterReference::lookupHex($hex);
1127 2
            }             // Decimal encoding.
1128
            // [0-9]+;
1129
            else {
1130
                // Convert from decimal to char.
1131 1
                $numeric = $this->scanner->getNumeric();
1132 1
                if (false === $numeric) {
1133
                    $this->parseError('Expected &#DIGITS;, got &#%s', $tok);
1134
                    $this->scanner->unconsume(2);
1135
1136
                    return '&';
1137
                }
1138 1
                $entity = CharacterReference::lookupDecimal($numeric);
1139
            }
1140 12
        } elseif ('=' === $tok && $inAttribute) {
1141 1
            return '&';
1142
        } else { // String entity.
1143
            // Attempt to consume a string up to a ';'.
1144
            // [a-zA-Z0-9]+;
1145 11
            $cname = $this->scanner->getAsciiAlphaNum();
1146 11
            $entity = CharacterReference::lookupName($cname);
0 ignored issues
show
Security Bug introduced by
It seems like $cname defined by $this->scanner->getAsciiAlphaNum() on line 1145 can also be of type false; however, Masterminds\HTML5\Parser...Reference::lookupName() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
1147
1148
            // When no entity is found provide the name of the unmatched string
1149
            // and continue on as the & is not part of an entity. The & will
1150
            // be converted to &amp; elsewhere.
1151 11
            if (null === $entity) {
1152 6
                if (!$inAttribute || '' === $cname) {
1153 5
                    $this->parseError("No match in entity table for '%s'", $cname);
1154 5
                }
1155 6
                $this->scanner->unconsume($this->scanner->position() - $start);
1156
1157 6
                return '&';
1158
            }
1159
        }
1160
1161
        // The scanner has advanced the cursor for us.
1162 9
        $tok = $this->scanner->current();
1163
1164
        // We have an entity. We're done here.
1165 9
        if (';' === $tok) {
1166 9
            $this->scanner->consume();
1167
1168 9
            return $entity;
1169
        }
1170
1171
        // If in an attribute, then failing to match ; means unconsume the
1172
        // entire string. Otherwise, failure to match is an error.
1173 1
        if ($inAttribute) {
1174
            $this->scanner->unconsume($this->scanner->position() - $start);
1175
1176
            return '&';
1177
        }
1178
1179 1
        $this->parseError('Expected &ENTITY;, got &ENTITY%s (no trailing ;) ', $tok);
1180
1181 1
        return '&' . $entity;
1182
    }
1183
}
1184