Completed
Push — master ( 30aab1...cadcfa )
by Asmir
07:00
created

Tokenizer::attributeValue()   B

Complexity

Conditions 11
Paths 11

Size

Total Lines 33

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 22
CRAP Score 11.209

Importance

Changes 0
Metric Value
dl 0
loc 33
ccs 22
cts 25
cp 0.88
rs 7.3166
c 0
b 0
f 0
cc 11
nc 11
nop 0
crap 11.209

How to fix   Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
namespace Masterminds\HTML5\Parser;
3
4
use Masterminds\HTML5\Elements;
5
6
/**
7
 * The HTML5 tokenizer.
8
 *
9
 * The tokenizer's role is reading data from the scanner and gathering it into
10
 * semantic units. From the tokenizer, data is emitted to an event handler,
11
 * which may (for example) create a DOM tree.
12
 *
13
 * The HTML5 specification has a detailed explanation of tokenizing HTML5. We
14
 * follow that specification to the maximum extent that we can. If you find
15
 * a discrepancy that is not documented, please file a bug and/or submit a
16
 * patch.
17
 *
18
 * This tokenizer is implemented as a recursive descent parser.
19
 *
20
 * Within the API documentation, you may see references to the specific section
21
 * of the HTML5 spec that the code attempts to reproduce. Example: 8.2.4.1.
22
 * This refers to section 8.2.4.1 of the HTML5 CR specification.
23
 *
24
 * @see http://www.w3.org/TR/2012/CR-html5-20121217/
25
 */
26
class Tokenizer
27
{
28
29
    protected $scanner;
30
31
    protected $events;
32
33
    protected $tok;
34
35
    /**
36
     * Buffer for text.
37
     */
38
    protected $text = '';
39
40
    // When this goes to false, the parser stops.
41
    protected $carryOn = true;
42
43
    protected $textMode = 0; // TEXTMODE_NORMAL;
44
    protected $untilTag = null;
45
46
    const CONFORMANT_XML = 'xml';
47
    const CONFORMANT_HTML = 'html';
48
    protected $mode = self::CONFORMANT_HTML;
49
50
    const WHITE = "\t\n\f ";
51
52
    /**
53
     * Create a new tokenizer.
54
     *
55
     * Typically, parsing a document involves creating a new tokenizer, giving
56
     * it a scanner (input) and an event handler (output), and then calling
57
     * the Tokenizer::parse() method.`
58
     *
59
     * @param \Masterminds\HTML5\Parser\Scanner $scanner
60
     *            A scanner initialized with an input stream.
61
     * @param \Masterminds\HTML5\Parser\EventHandler $eventHandler
62
     *            An event handler, initialized and ready to receive
63
     *            events.
64
     * @param string $mode
65
     */
66 127
    public function __construct($scanner, $eventHandler, $mode = self::CONFORMANT_HTML)
67
    {
68 127
        $this->scanner = $scanner;
69 127
        $this->events = $eventHandler;
70 127
        $this->mode = $mode;
71 127
    }
72
73
    /**
74
     * Begin parsing.
75
     *
76
     * This will begin scanning the document, tokenizing as it goes.
77
     * Tokens are emitted into the event handler.
78
     *
79
     * Tokenizing will continue until the document is completely
80
     * read. Errors are emitted into the event handler, but
81
     * the parser will attempt to continue parsing until the
82
     * entire input stream is read.
83
     */
84 127
    public function parse()
85
    {
86
        do {
87 127
            $this->consumeData();
88
            // FIXME: Add infinite loop protection.
89 127
        } while ($this->carryOn);
90 127
    }
91
92
    /**
93
     * Set the text mode for the character data reader.
94
     *
95
     * HTML5 defines three different modes for reading text:
96
     * - Normal: Read until a tag is encountered.
97
     * - RCDATA: Read until a tag is encountered, but skip a few otherwise-
98
     * special characters.
99
     * - Raw: Read until a special closing tag is encountered (viz. pre, script)
100
     *
101
     * This allows those modes to be set.
102
     *
103
     * Normally, setting is done by the event handler via a special return code on
104
     * startTag(), but it can also be set manually using this function.
105
     *
106
     * @param integer $textmode
107
     *            One of Elements::TEXT_*
108
     * @param string $untilTag
109
     *            The tag that should stop RAW or RCDATA mode. Normal mode does not
110
     *            use this indicator.
111
     */
112 106
    public function setTextMode($textmode, $untilTag = null)
113
    {
114 106
        $this->textMode = $textmode & (Elements::TEXT_RAW | Elements::TEXT_RCDATA);
115 106
        $this->untilTag = $untilTag;
116 106
    }
117
118
    /**
119
     * Consume a character and make a move.
120
     * HTML5 8.2.4.1
121
     */
122 127
    protected function consumeData()
123
    {
124
        // Character reference
125 127
        $this->characterReference();
126
127 127
        $tok = $this->scanner->current();
128
129
        // Parse tag
130 127
        if ($tok === '<') {
131
            // Any buffered text data can go out now.
132 123
            $this->flushBuffer();
133
134 123
            $tok = $this->scanner->next();
135
136 123
            $this->markupDeclaration($tok)
137 120
                || $this->endTag()
138 120
                || $this->processingInstruction()
139 119
                || $this->tagName()
140
                // This always returns false.
141 114
                || $this->parseError("Illegal tag opening")
142 1
                || $this->characterData();
143
144 123
            $tok = $this->scanner->current();
145 123
        }
146
147
        // Handle end of document
148 127
        $this->eof($tok);
149
150
        // Parse character
151 127
        if ($tok !== false) {
152 112
            switch ($this->textMode) {
153 112
                case Elements::TEXT_RAW:
154 8
                    $this->rawText($tok);
155 8
                    break;
156
157 112
                case Elements::TEXT_RCDATA:
158 37
                    $this->rcdata($tok);
159 37
                    break;
160
161 111
                default:
162 111
                    if (!strspn($tok, "<&")) {
163
                        // NULL character
164 87
                        if ($tok === "\00") {
165
                            $this->parseError("Received null character.");
166
                        }
167
168 87
                        $this->text .= $tok;
169 87
                        $this->scanner->next();
170 87
                    }
171 112
            }
172 112
        }
173
174 127
        return $this->carryOn;
175
    }
176
177
    /**
178
     * Parse anything that looks like character data.
179
     *
180
     * Different rules apply based on the current text mode.
181
     *
182
     * @see Elements::TEXT_RAW Elements::TEXT_RCDATA.
183
     */
184 1
    protected function characterData()
185
    {
186 1
        $tok = $this->scanner->current();
187 1
        if ($tok === false) {
188
            return false;
189
        }
190 1
        switch ($this->textMode) {
191 1
            case Elements::TEXT_RAW:
192
                return $this->rawText($tok);
193 1
            case Elements::TEXT_RCDATA:
194
                return $this->rcdata($tok);
195 1
            default:
196 1
                if (strspn($tok, "<&")) {
197
                    return false;
198
                }
199 1
                return $this->text($tok);
200 1
        }
201
    }
202
203
    /**
204
     * This buffers the current token as character data.
205
     *
206
     * @param string $tok The current token.
207
     *
208
     * @return bool
209
     */
210 1
    protected function text($tok)
211
    {
212
        // This should never happen...
213 1
        if ($tok === false) {
214
            return false;
215
        }
216
217
        // NULL character
218 1
        if ($tok === "\00") {
219
            $this->parseError("Received null character.");
220
        }
221
222 1
        $this->buffer($tok);
223 1
        $this->scanner->next();
224
225 1
        return true;
226
    }
227
228
    /**
229
     * Read text in RAW mode.
230
     *
231
     * @param string $tok The current token.
232
     *
233
     * @return bool
234
     */
235 8
    protected function rawText($tok)
236
    {
237 8
        if (is_null($this->untilTag)) {
238
            return $this->text($tok);
239
        }
240
241 8
        $sequence = '</' . $this->untilTag . '>';
242 8
        $txt = $this->readUntilSequence($sequence);
243 8
        $this->events->text($txt);
244 8
        $this->setTextMode(0);
245
246 8
        return $this->endTag();
247
    }
248
249
    /**
250
     * Read text in RCDATA mode.
251
     *
252
     * @param string $tok The current token.
253
     *
254
     * @return bool
255
     */
256 37
    protected function rcdata($tok)
257
    {
258 37
        if (is_null($this->untilTag)) {
259
            return $this->text($tok);
260
        }
261
262 37
        $sequence = '</' . $this->untilTag;
263 37
        $txt = '';
264
265 37
        $caseSensitive = !Elements::isHtml5Element($this->untilTag);
266 37
        while ($tok !== false && ! ($tok == '<' && ($this->scanner->sequenceMatches($sequence, $caseSensitive)))) {
267 35
            if ($tok == '&') {
268 1
                $txt .= $this->decodeCharacterReference();
269 1
                $tok = $this->scanner->current();
270 1
            } else {
271 35
                $txt .= $tok;
272 35
                $tok = $this->scanner->next();
273
            }
274 35
        }
275 37
        $len = strlen($sequence);
276 37
        $this->scanner->consume($len);
277 37
        $len += strlen($this->scanner->whitespace());
278 37
        if ($this->scanner->current() !== '>') {
279
            $this->parseError("Unclosed RCDATA end tag");
280
        }
281
282 37
        $this->scanner->unconsume($len);
283 37
        $this->events->text($txt);
284 37
        $this->setTextMode(0);
285
286 37
        return $this->endTag();
287
    }
288
289
    /**
290
     * If the document is read, emit an EOF event.
291
     */
292 127
    protected function eof($tok)
293
    {
294 127
        if ($tok === false) {
295
            // fprintf(STDOUT, "EOF");
296 127
            $this->flushBuffer();
297 127
            $this->events->eof();
298 127
            $this->carryOn = false;
299
300 127
            return true;
301
        }
302
303 112
        return false;
304
    }
305
306
    /**
307
     * Handle character references (aka entities).
308
     *
309
     * This version is specific to PCDATA, as it buffers data into the
310
     * text buffer. For a generic version, see decodeCharacterReference().
311
     *
312
     * HTML5 8.2.4.2
313
     */
314 127
    protected function characterReference()
315
    {
316 127
        if ($this->scanner->current() !== '&') {
317 127
            return false;
318
        }
319
320 8
        $ref = $this->decodeCharacterReference();
321 8
        $this->buffer($ref);
322 8
        return true;
323
    }
324
325
    /**
326
     * Look for markup.
327
     */
328 123
    protected function markupDeclaration($tok)
329
    {
330 123
        if ($tok != '!') {
331 120
            return false;
332
        }
333
334 101
        $tok = $this->scanner->next();
335
336
        // Comment:
337 101
        if ($tok == '-' && $this->scanner->peek() == '-') {
338 6
            $this->scanner->next(); // Consume the other '-'
339 6
            $this->scanner->next(); // Next char.
340 6
            return $this->comment();
341
        }
342
343 98
        elseif ($tok == 'D' || $tok == 'd') { // Doctype
344 96
            return $this->doctype();
345
        }
346
347 7
        elseif ($tok == '[') { // CDATA section
348 7
            return $this->cdataSection();
349
        }
350
351
        // FINISH
352 1
        $this->parseError("Expected <!--, <![CDATA[, or <!DOCTYPE. Got <!%s", $tok);
353 1
        $this->bogusComment('<!');
354 1
        return true;
355
    }
356
357
    /**
358
     * Consume an end tag.
359
     * 8.2.4.9
360
     */
361 120
    protected function endTag()
362
    {
363 120
        if ($this->scanner->current() != '/') {
364 119
            return false;
365
        }
366 111
        $tok = $this->scanner->next();
367
368
        // a-zA-Z -> tagname
369
        // > -> parse error
370
        // EOF -> parse error
371
        // -> parse error
372 111
        if (! ctype_alpha($tok)) {
373 2
            $this->parseError("Expected tag name, got '%s'", $tok);
374 2
            if ($tok == "\0" || $tok === false) {
375
                return false;
376
            }
377 2
            return $this->bogusComment('</');
378
        }
379
380 110
        $name = $this->scanner->charsUntil("\n\f \t>");
381 110
        $name = $this->mode === self::CONFORMANT_XML ? $name: strtolower($name);
382
        // Trash whitespace.
383 110
        $this->scanner->whitespace();
384
385 110
        $tok = $this->scanner->current();
386 110
        if ($tok != '>') {
387 1
            $this->parseError("Expected >, got '%s'", $tok);
388
            // We just trash stuff until we get to the next tag close.
389 1
            $this->scanner->charsUntil('>');
390 1
        }
391
392 110
        $this->events->endTag($name);
393 110
        $this->scanner->next();
394 110
        return true;
395
    }
396
397
    /**
398
     * Consume a tag name and body.
399
     * 8.2.4.10
400
     */
401 114
    protected function tagName()
402
    {
403 114
        $tok = $this->scanner->current();
404 114
        if (! ctype_alpha($tok)) {
405 1
            return false;
406
        }
407
408
        // We know this is at least one char.
409 114
        $name = $this->scanner->charsWhile(":_-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
410 114
        $name = $this->mode === self::CONFORMANT_XML ? $name : strtolower($name);
411 114
        $attributes = array();
412 114
        $selfClose = false;
413
414
        // Handle attribute parse exceptions here so that we can
415
        // react by trying to build a sensible parse tree.
416
        try {
417
            do {
418 114
                $this->scanner->whitespace();
419 114
                $this->attribute($attributes);
420 114
            } while (! $this->isTagEnd($selfClose));
421 114
        } catch (ParseError $e) {
422 2
            $selfClose = false;
423
        }
424
425 114
        $mode = $this->events->startTag($name, $attributes, $selfClose);
426
427 114
        if (is_int($mode)) {
428 105
            $this->setTextMode($mode, $name);
429 105
        }
430
431 114
        $this->scanner->next();
432
433 114
        return true;
434
    }
435
436
    /**
437
     * Check if the scanner has reached the end of a tag.
438
     */
439 114
    protected function isTagEnd(&$selfClose)
440
    {
441 114
        $tok = $this->scanner->current();
442 114
        if ($tok == '/') {
443 15
            $this->scanner->next();
444 15
            $this->scanner->whitespace();
445 15
            $tok = $this->scanner->current();
446
447 15
            if ($tok == '>') {
448 15
                $selfClose = true;
449 15
                return true;
450
            }
451 2
            if ($tok === false) {
452 1
                $this->parseError("Unexpected EOF inside of tag.");
453 1
                return true;
454
            }
455
            // Basically, we skip the / token and go on.
456
            // See 8.2.4.43.
457 1
            $this->parseError("Unexpected '%s' inside of a tag.", $tok);
458 1
            return false;
459
        }
460
461 114
        if ($tok == '>') {
462 114
            return true;
463
        }
464 32
        if ($tok === false) {
465 2
            $this->parseError("Unexpected EOF inside of tag.");
466 2
            return true;
467
        }
468
469 31
        return false;
470
    }
471
472
    /**
473
     * Parse attributes from inside of a tag.
474
     *
475
     * @param string[] $attributes
476
     *
477
     * @return bool
478
     *
479
     * @throws ParseError
480
     */
481 114
    protected function attribute(&$attributes)
482
    {
483 114
        $tok = $this->scanner->current();
484 114
        if ($tok == '/' || $tok == '>' || $tok === false) {
485 108
            return false;
486
        }
487
488 82
        if ($tok == '<') {
489 2
            $this->parseError("Unexepcted '<' inside of attributes list.");
490
            // Push the < back onto the stack.
491 2
            $this->scanner->unconsume();
492
            // Let the caller figure out how to handle this.
493 2
            throw new ParseError("Start tag inside of attribute.");
494
        }
495
496 82
        $name = strtolower($this->scanner->charsUntil("/>=\n\f\t "));
497
498 82
        if (strlen($name) == 0) {
499 3
            $tok = $this->scanner->current();
500 3
            $this->parseError("Expected an attribute name, got %s.", $tok);
501
            // Really, only '=' can be the char here. Everything else gets absorbed
502
            // under one rule or another.
503 3
            $name = $tok;
504 3
            $this->scanner->next();
505 3
        }
506
507 82
        $isValidAttribute = true;
508
        // Attribute names can contain most Unicode characters for HTML5.
509
        // But method "DOMElement::setAttribute" is throwing exception
510
        // because of it's own internal restriction so these have to be filtered.
511
        // see issue #23: https://github.com/Masterminds/html5-php/issues/23
512
        // and http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#syntax-attribute-name
513 82
        if (preg_match("/[\x1-\x2C\\/\x3B-\x40\x5B-\x5E\x60\x7B-\x7F]/u", $name)) {
514 4
            $this->parseError("Unexpected characters in attribute name: %s", $name);
515 4
            $isValidAttribute = false;
516 4
        }         // There is no limitation for 1st character in HTML5.
517
        // But method "DOMElement::setAttribute" is throwing exception for the
518
        // characters below so they have to be filtered.
519
        // see issue #23: https://github.com/Masterminds/html5-php/issues/23
520
        // and http://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#syntax-attribute-name
521
        else
522 79
            if (preg_match("/^[0-9.-]/u", $name)) {
523 1
                $this->parseError("Unexpected character at the begining of attribute name: %s", $name);
524 1
                $isValidAttribute = false;
525 1
            }
526
        // 8.1.2.3
527 82
        $this->scanner->whitespace();
528
529 82
        $val = $this->attributeValue();
530 82
        if ($isValidAttribute) {
531 79
            $attributes[$name] = $val;
532 79
        }
533 82
        return true;
534
    }
535
536
    /**
537
     * Consume an attribute value.
538
     * 8.2.4.37 and after.
539
     *
540
     * @return string|null
541
     */
542 82
    protected function attributeValue()
543
    {
544 82
        if ($this->scanner->current() != '=') {
545 13
            return null;
546
        }
547 78
        $this->scanner->next();
548
        // 8.1.2.3
549 78
        $this->scanner->whitespace();
550
551 78
        $tok = $this->scanner->current();
552
        switch ($tok) {
553 78
            case "\n":
554 78
            case "\f":
555 78
            case " ":
556 78
            case "\t":
557
                // Whitespace here indicates an empty value.
558
                return null;
559 78
            case '"':
560 78
            case "'":
561 78
                $this->scanner->next();
562 78
                return $this->quotedAttributeValue($tok);
0 ignored issues
show
Security Bug introduced by
It seems like $tok defined by $this->scanner->current() on line 551 can also be of type false; however, Masterminds\HTML5\Parser...:quotedAttributeValue() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
563 1
            case '>':
564
                // case '/': // 8.2.4.37 seems to allow foo=/ as a valid attr.
565 1
                $this->parseError("Expected attribute value, got tag end.");
566 1
                return null;
567 1
            case '=':
568 1
            case '`':
569
                $this->parseError("Expecting quotes, got %s.", $tok);
570
                return $this->unquotedAttributeValue();
571 1
            default:
572 1
                return $this->unquotedAttributeValue();
573 1
        }
574
    }
575
576
    /**
577
     * Get an attribute value string.
578
     *
579
     * @param string $quote
580
     *            IMPORTANT: This is a series of chars! Any one of which will be considered
581
     *            termination of an attribute's value. E.g. "\"'" will stop at either
582
     *            ' or ".
583
     * @return string The attribute value.
584
     */
585 78
    protected function quotedAttributeValue($quote)
586
    {
587 78
        $stoplist = "\f" . $quote;
588 78
        $val = '';
589
590 78
        while (true) {
591 78
            $tokens = $this->scanner->charsUntil($stoplist.'&');
592 78
            if ($tokens !== false) {
593 78
                $val .= $tokens;
594 78
            } else {
595
                break;
596
            }
597
598 78
            $tok = $this->scanner->current();
599 78
            if ($tok == '&') {
600 3
                $val .= $this->decodeCharacterReference(true);
601 3
                continue;
602
            }
603 78
            break;
604
        }
605 78
        $this->scanner->next();
606 78
        return $val;
607
    }
608
609 1
    protected function unquotedAttributeValue()
610
    {
611 1
        $stoplist = "\t\n\f >";
612 1
        $val = '';
613 1
        $tok = $this->scanner->current();
614 1
        while (strspn($tok, $stoplist) == 0 && $tok !== false) {
615 1
            if ($tok == '&') {
616 1
                $val .= $this->decodeCharacterReference(true);
617 1
                $tok = $this->scanner->current();
618 1
            } else {
619 1
                if (strspn($tok, "\"'<=`") > 0) {
620 1
                    $this->parseError("Unexpected chars in unquoted attribute value %s", $tok);
621 1
                }
622 1
                $val .= $tok;
623 1
                $tok = $this->scanner->next();
624
            }
625 1
        }
626 1
        return $val;
627
    }
628
629
    /**
630
     * Consume malformed markup as if it were a comment.
631
     * 8.2.4.44
632
     *
633
     * The spec requires that the ENTIRE tag-like thing be enclosed inside of
634
     * the comment. So this will generate comments like:
635
     *
636
     * &lt;!--&lt/+foo&gt;--&gt;
637
     *
638
     * @param string $leading
639
     *            Prepend any leading characters. This essentially
640
     *            negates the need to backtrack, but it's sort of
641
     *            a hack.
642
     *
643
     * @return bool
644
     */
645 3
    protected function bogusComment($leading = '')
646
    {
647 3
        $comment = $leading;
648 3
        $tokens = $this->scanner->charsUntil('>');
649 3
        if ($tokens !== false) {
650 2
            $comment .= $tokens;
651 2
        }
652 3
        $tok = $this->scanner->current();
653 3
        if ($tok !== false) {
654 2
            $comment .= $tok;
655 2
        }
656
657 3
        $this->flushBuffer();
658 3
        $this->events->comment($comment);
659 3
        $this->scanner->next();
660
661 3
        return true;
662
    }
663
664
    /**
665
     * Read a comment.
666
     *
667
     * Expects the first tok to be inside of the comment.
668
     *
669
     * @return bool
670
     */
671 6
    protected function comment()
672
    {
673 6
        $tok = $this->scanner->current();
674 6
        $comment = '';
675
676
        // <!-->. Emit an empty comment because 8.2.4.46 says to.
677 6
        if ($tok == '>') {
678
            // Parse error. Emit the comment token.
679 1
            $this->parseError("Expected comment data, got '>'");
680 1
            $this->events->comment('');
681 1
            $this->scanner->next();
682 1
            return true;
683
        }
684
685
        // Replace NULL with the replacement char.
686 6
        if ($tok == "\0") {
687
            $tok = UTF8Utils::FFFD;
688
        }
689 6
        while (! $this->isCommentEnd()) {
690 6
            $comment .= $tok;
691 6
            $tok = $this->scanner->next();
692 6
        }
693
694 6
        $this->events->comment($comment);
695 6
        $this->scanner->next();
696 6
        return true;
697
    }
698
699
    /**
700
     * Check if the scanner has reached the end of a comment.
701
     *
702
     * @return bool
703
     */
704 6
    protected function isCommentEnd()
705
    {
706 6
        $tok = $this->scanner->current();
707
708
        // EOF
709 6
        if ($tok === false) {
710
            // Hit the end.
711 1
            $this->parseError("Unexpected EOF in a comment.");
712 1
            return true;
713
        }
714
715
        // If it doesn't start with -, not the end.
716 6
        if ($tok != '-') {
717 6
            return false;
718
        }
719
720
        // Advance one, and test for '->'
721 6
        if ($this->scanner->next() == '-' && $this->scanner->peek() == '>') {
722 6
            $this->scanner->next(); // Consume the last '>'
723 6
            return true;
724
        }
725
        // Unread '-';
726 2
        $this->scanner->unconsume(1);
727 2
        return false;
728
    }
729
730
    /**
731
     * Parse a DOCTYPE.
732
     *
733
     * Parse a DOCTYPE declaration. This method has strong bearing on whether or
734
     * not Quirksmode is enabled on the event handler.
735
     *
736
     * @todo This method is a little long. Should probably refactor.
737
     *
738
     * @return bool
739
     */
740 96
    protected function doctype()
741
    {
742 96
        if (strcasecmp($this->scanner->current(), 'D')) {
743
            return false;
744
        }
745
        // Check that string is DOCTYPE.
746 96
        $chars = $this->scanner->charsWhile("DOCTYPEdoctype");
747 96
        if (strcasecmp($chars, 'DOCTYPE')) {
748 1
            $this->parseError('Expected DOCTYPE, got %s', $chars);
749 1
            return $this->bogusComment('<!' . $chars);
750
        }
751
752 95
        $this->scanner->whitespace();
753 95
        $tok = $this->scanner->current();
754
755
        // EOF: die.
756 95
        if ($tok === false) {
757
            $this->events->doctype('html5', EventHandler::DOCTYPE_NONE, '', true);
758
            return $this->eof($tok);
759
        }
760
761
        // NULL char: convert.
762 95
        if ($tok === "\0") {
763
            $this->parseError("Unexpected null character in DOCTYPE.");
764
        }
765
766 95
        $stop = " \n\f>";
767 95
        $doctypeName = $this->scanner->charsUntil($stop);
768
        // Lowercase ASCII, replace \0 with FFFD
769 95
        $doctypeName = strtolower(strtr($doctypeName, "\0", UTF8Utils::FFFD));
0 ignored issues
show
Security Bug introduced by
It seems like $doctypeName can also be of type false; however, strtr() does only seem to accept string, did you maybe forget to handle an error condition?
Loading history...
770
771 95
        $tok = $this->scanner->current();
772
773
        // If false, emit a parse error, DOCTYPE, and return.
774 95
        if ($tok === false) {
775 1
            $this->parseError('Unexpected EOF in DOCTYPE declaration.');
776 1
            $this->events->doctype($doctypeName, EventHandler::DOCTYPE_NONE, null, true);
777 1
            return true;
778
        }
779
780
        // Short DOCTYPE, like <!DOCTYPE html>
781 95
        if ($tok == '>') {
782
            // DOCTYPE without a name.
783 95
            if (strlen($doctypeName) == 0) {
784 1
                $this->parseError("Expected a DOCTYPE name. Got nothing.");
785 1
                $this->events->doctype($doctypeName, 0, null, true);
786 1
                $this->scanner->next();
787 1
                return true;
788
            }
789 95
            $this->events->doctype($doctypeName);
790 95
            $this->scanner->next();
791 95
            return true;
792
        }
793 1
        $this->scanner->whitespace();
794
795 1
        $pub = strtoupper($this->scanner->getAsciiAlpha());
796 1
        $white = strlen($this->scanner->whitespace());
797
798
        // Get ID, and flag it as pub or system.
799 1
        if (($pub == 'PUBLIC' || $pub == 'SYSTEM') && $white > 0) {
800
            // Get the sys ID.
801 1
            $type = $pub == 'PUBLIC' ? EventHandler::DOCTYPE_PUBLIC : EventHandler::DOCTYPE_SYSTEM;
802 1
            $id = $this->quotedString("\0>");
803 1
            if ($id === false) {
804
                $this->events->doctype($doctypeName, $type, $pub, false);
805
                return false;
806
            }
807
808
            // Premature EOF.
809 1
            if ($this->scanner->current() === false) {
810 1
                $this->parseError("Unexpected EOF in DOCTYPE");
811 1
                $this->events->doctype($doctypeName, $type, $id, true);
812 1
                return true;
813
            }
814
815
            // Well-formed complete DOCTYPE.
816 1
            $this->scanner->whitespace();
817 1
            if ($this->scanner->current() == '>') {
818 1
                $this->events->doctype($doctypeName, $type, $id, false);
819 1
                $this->scanner->next();
820 1
                return true;
821
            }
822
823
            // If we get here, we have <!DOCTYPE foo PUBLIC "bar" SOME_JUNK
824
            // Throw away the junk, parse error, quirks mode, return true.
825 1
            $this->scanner->charsUntil(">");
826 1
            $this->parseError("Malformed DOCTYPE.");
827 1
            $this->events->doctype($doctypeName, $type, $id, true);
828 1
            $this->scanner->next();
829 1
            return true;
830
        }
831
832
        // Else it's a bogus DOCTYPE.
833
        // Consume to > and trash.
834 1
        $this->scanner->charsUntil('>');
835
836 1
        $this->parseError("Expected PUBLIC or SYSTEM. Got %s.", $pub);
837 1
        $this->events->doctype($doctypeName, 0, null, true);
838 1
        $this->scanner->next();
839 1
        return true;
840
    }
841
842
    /**
843
     * Utility for reading a quoted string.
844
     *
845
     * @param string $stopchars
846
     *            Characters (in addition to a close-quote) that should stop the string.
847
     *            E.g. sometimes '>' is higher precedence than '"' or "'".
848
     *
849
     * @return mixed String if one is found (quotations omitted)
850
     */
851 1
    protected function quotedString($stopchars)
852
    {
853 1
        $tok = $this->scanner->current();
854 1
        if ($tok == '"' || $tok == "'") {
855 1
            $this->scanner->next();
856 1
            $ret = $this->scanner->charsUntil($tok . $stopchars);
857 1
            if ($this->scanner->current() == $tok) {
858 1
                $this->scanner->next();
859 1
            } else {
860
                // Parse error because no close quote.
861
                $this->parseError("Expected %s, got %s", $tok, $this->scanner->current());
862
            }
863 1
            return $ret;
864
        }
865
        return false;
866
    }
867
868
    /**
869
     * Handle a CDATA section.
870
     *
871
     * @return bool
872
     */
873 7
    protected function cdataSection()
874
    {
875 7
        if ($this->scanner->current() != '[') {
876
            return false;
877
        }
878 7
        $cdata = '';
879 7
        $this->scanner->next();
880
881 7
        $chars = $this->scanner->charsWhile('CDAT');
882 7
        if ($chars != 'CDATA' || $this->scanner->current() != '[') {
883 1
            $this->parseError('Expected [CDATA[, got %s', $chars);
884 1
            return $this->bogusComment('<![' . $chars);
885
        }
886
887 7
        $tok = $this->scanner->next();
888
        do {
889 7
            if ($tok === false) {
890 2
                $this->parseError('Unexpected EOF inside CDATA.');
891 2
                $this->bogusComment('<![CDATA[' . $cdata);
892 2
                return true;
893
            }
894 7
            $cdata .= $tok;
895 7
            $tok = $this->scanner->next();
896 7
        } while (! $this->scanner->sequenceMatches(']]>'));
897
898
        // Consume ]]>
899 5
        $this->scanner->consume(3);
900
901 5
        $this->events->cdata($cdata);
902 5
        return true;
903
    }
904
905
    // ================================================================
906
    // Non-HTML5
907
    // ================================================================
908
    /**
909
     * Handle a processing instruction.
910
     *
911
     * XML processing instructions are supposed to be ignored in HTML5,
912
     * treated as "bogus comments". However, since we're not a user
913
     * agent, we allow them. We consume until ?> and then issue a
914
     * EventListener::processingInstruction() event.
915
     *
916
     * @return bool
917
     */
918 119
    protected function processingInstruction()
919
    {
920 119
        if ($this->scanner->current() != '?') {
921 114
            return false;
922
        }
923
924 7
        $tok = $this->scanner->next();
925 7
        $procName = $this->scanner->getAsciiAlpha();
926 7
        $white = strlen($this->scanner->whitespace());
927
928
        // If not a PI, send to bogusComment.
929 7
        if (strlen($procName) == 0 || $white == 0 || $this->scanner->current() == false) {
930 1
            $this->parseError("Expected processing instruction name, got $tok");
931 1
            $this->bogusComment('<?' . $tok . $procName);
932 1
            return true;
933
        }
934
935 6
        $data = '';
936
        // As long as it's not the case that the next two chars are ? and >.
937 6
        while (! ($this->scanner->current() == '?' && $this->scanner->peek() == '>')) {
938 6
            $data .= $this->scanner->current();
939
940 6
            $tok = $this->scanner->next();
941 6
            if ($tok === false) {
942
                $this->parseError("Unexpected EOF in processing instruction.");
943
                $this->events->processingInstruction($procName, $data);
0 ignored issues
show
Security Bug introduced by
It seems like $procName defined by $this->scanner->getAsciiAlpha() on line 925 can also be of type false; however, Masterminds\HTML5\Parser...processingInstruction() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
944
                return true;
945
            }
946 6
        }
947
948 6
        $this->scanner->next(); // >
949 6
        $this->scanner->next(); // Next token.
950 6
        $this->events->processingInstruction($procName, $data);
0 ignored issues
show
Security Bug introduced by
It seems like $procName defined by $this->scanner->getAsciiAlpha() on line 925 can also be of type false; however, Masterminds\HTML5\Parser...processingInstruction() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
951 6
        return true;
952
    }
953
954
    // ================================================================
955
    // UTILITY FUNCTIONS
956
    // ================================================================
957
958
    /**
959
     * Read from the input stream until we get to the desired sequene
960
     * or hit the end of the input stream.
961
     *
962
     * @param string $sequence
963
     *
964
     * @return string
965
     */
966 8
    protected function readUntilSequence($sequence)
967
    {
968 8
        $buffer = '';
969
970
        // Optimization for reading larger blocks faster.
971 8
        $first = substr($sequence, 0, 1);
972 8
        while ($this->scanner->current() !== false) {
973 8
            $buffer .= $this->scanner->charsUntil($first);
974
975
            // Stop as soon as we hit the stopping condition.
976 8
            if ($this->scanner->sequenceMatches($sequence, false)) {
977 8
                return $buffer;
978
            }
979 4
            $buffer .= $this->scanner->current();
980 4
            $this->scanner->next();
981 4
        }
982
983
        // If we get here, we hit the EOF.
984 1
        $this->parseError("Unexpected EOF during text read.");
985 1
        return $buffer;
986
    }
987
988
    /**
989
     * Check if upcomming chars match the given sequence.
990
     *
991
     * This will read the stream for the $sequence. If it's
992
     * found, this will return true. If not, return false.
993
     * Since this unconsumes any chars it reads, the caller
994
     * will still need to read the next sequence, even if
995
     * this returns true.
996
     *
997
     * Example: $this->scanner->sequenceMatches('</script>') will
998
     * see if the input stream is at the start of a
999
     * '</script>' string.
1000
     *
1001
     * @param string $sequence
1002
     * @param bool $caseSensitive
1003
     *
1004
     * @return bool
1005
     */
1006
    protected function sequenceMatches($sequence, $caseSensitive = true)
1007
    {
1008
        @trigger_error(__METHOD__ . ' method is deprecated since version 2.4 and will be removed in 3.0. Use Scanner::sequenceMatches() instead.', E_USER_DEPRECATED);
1009
1010
        return $this->scanner->sequenceMatches($sequence, $caseSensitive);
1011
    }
1012
1013
    /**
1014
     * Send a TEXT event with the contents of the text buffer.
1015
     *
1016
     * This emits an EventHandler::text() event with the current contents of the
1017
     * temporary text buffer. (The buffer is used to group as much PCDATA
1018
     * as we can instead of emitting lots and lots of TEXT events.)
1019
     */
1020 127
    protected function flushBuffer()
1021
    {
1022 127
        if ($this->text === '') {
1023 125
            return;
1024
        }
1025 87
        $this->events->text($this->text);
1026 87
        $this->text = '';
1027 87
    }
1028
1029
    /**
1030
     * Add text to the temporary buffer.
1031
     *
1032
     * @see flushBuffer()
1033
     *
1034
     * @param string $str
1035
     */
1036 9
    protected function buffer($str)
1037
    {
1038 9
        $this->text .= $str;
1039 9
    }
1040
1041
    /**
1042
     * Emit a parse error.
1043
     *
1044
     * A parse error always returns false because it never consumes any
1045
     * characters.
1046
     *
1047
     * @param string $msg
1048
     *
1049
     * @return string
1050
     */
1051 15
    protected function parseError($msg)
1052
    {
1053 15
        $args = func_get_args();
1054
1055 15
        if (count($args) > 1) {
1056 11
            array_shift($args);
1057 11
            $msg = vsprintf($msg, $args);
1058 11
        }
1059
1060 15
        $line = $this->scanner->currentLine();
1061 15
        $col = $this->scanner->columnOffset();
1062 15
        $this->events->parseError($msg, $line, $col);
1063
1064 15
        return false;
1065
    }
1066
1067
    /**
1068
     * Decode a character reference and return the string.
1069
     *
1070
     * If $inAttribute is set to true, a bare & will be returned as-is.
1071
     *
1072
     * @param bool $inAttribute
1073
     *            Set to true if the text is inside of an attribute value.
1074
     *            false otherwise.
1075
     *
1076
     * @return string
1077
     */
1078 12
    protected function decodeCharacterReference($inAttribute = false)
1079
    {
1080
        // Next char after &.
1081 12
        $tok = $this->scanner->next();
1082 12
        $start = $this->scanner->position();
1083
1084 12
        if ($tok == false) {
1085 1
            return '&';
1086
        }
1087
1088
        // These indicate not an entity. We return just
1089
        // the &.
1090 12
        if (strspn($tok, static::WHITE . "&<") == 1) {
1091
            // $this->scanner->next();
1092 2
            return '&';
1093
        }
1094
1095
        // Numeric entity
1096 12
        if ($tok == '#') {
1097 2
            $tok = $this->scanner->next();
1098
1099
            // Hexidecimal encoding.
1100
            // X[0-9a-fA-F]+;
1101
            // x[0-9a-fA-F]+;
1102 2
            if ($tok == 'x' || $tok == 'X') {
1103 2
                $tok = $this->scanner->next(); // Consume x
1104
1105
                // Convert from hex code to char.
1106 2
                $hex = $this->scanner->getHex();
1107 2
                if (empty($hex)) {
1108
                    $this->parseError("Expected &#xHEX;, got &#x%s", $tok);
1109
                    // We unconsume because we don't know what parser rules might
1110
                    // be in effect for the remaining chars. For example. '&#>'
1111
                    // might result in a specific parsing rule inside of tag
1112
                    // contexts, while not inside of pcdata context.
1113
                    $this->scanner->unconsume(2);
1114
                    return '&';
1115
                }
1116 2
                $entity = CharacterReference::lookupHex($hex);
1117 2
            }             // Decimal encoding.
1118
            // [0-9]+;
1119
            else {
1120
                // Convert from decimal to char.
1121 1
                $numeric = $this->scanner->getNumeric();
1122 1
                if ($numeric === false) {
1123
                    $this->parseError("Expected &#DIGITS;, got &#%s", $tok);
1124
                    $this->scanner->unconsume(2);
1125
                    return '&';
1126
                }
1127 1
                $entity = CharacterReference::lookupDecimal($numeric);
1128
            }
1129 12
        } elseif ($tok === '=' && $inAttribute) {
1130 1
            return '&';
1131
        } else { // String entity.
1132
1133
            // Attempt to consume a string up to a ';'.
1134
            // [a-zA-Z0-9]+;
1135 11
            $cname = $this->scanner->getAsciiAlphaNum();
1136 11
            $entity = CharacterReference::lookupName($cname);
0 ignored issues
show
Security Bug introduced by
It seems like $cname defined by $this->scanner->getAsciiAlphaNum() on line 1135 can also be of type false; however, Masterminds\HTML5\Parser...Reference::lookupName() does only seem to accept string, did you maybe forget to handle an error condition?

This check looks for type mismatches where the missing type is false. This is usually indicative of an error condtion.

Consider the follow example

<?php

function getDate($date)
{
    if ($date !== null) {
        return new DateTime($date);
    }

    return false;
}

This function either returns a new DateTime object or false, if there was an error. This is a typical pattern in PHP programming to show that an error has occurred without raising an exception. The calling code should check for this returned false before passing on the value to another function or method that may not be able to handle a false.

Loading history...
1137
1138
            // When no entity is found provide the name of the unmatched string
1139
            // and continue on as the & is not part of an entity. The & will
1140
            // be converted to &amp; elsewhere.
1141 11
            if ($entity == null) {
1142 6
                if (!$inAttribute || strlen($cname) === 0) {
1143 5
                    $this->parseError("No match in entity table for '%s'", $cname);
1144 5
                }
1145 6
                $this->scanner->unconsume($this->scanner->position() - $start);
1146 6
                return '&';
1147
            }
1148
        }
1149
1150
        // The scanner has advanced the cursor for us.
1151 9
        $tok = $this->scanner->current();
1152
1153
        // We have an entity. We're done here.
1154 9
        if ($tok == ';') {
1155 9
            $this->scanner->next();
1156 9
            return $entity;
1157
        }
1158
1159
        // If in an attribute, then failing to match ; means unconsume the
1160
        // entire string. Otherwise, failure to match is an error.
1161 1
        if ($inAttribute) {
1162
            $this->scanner->unconsume($this->scanner->position() - $start);
1163
            return '&';
1164
        }
1165
1166 1
        $this->parseError("Expected &ENTITY;, got &ENTITY%s (no trailing ;) ", $tok);
1167 1
        return '&' . $entity;
1168
    }
1169
}
1170