Completed
Push — master ( 398493...46f448 )
by Richard
13:15
created

HTMLPurifier_Lexer_DOMLex::createStartNode()   C

Complexity

Conditions 16
Paths 18

Size

Total Lines 57
Code Lines 35

Duplication

Lines 5
Ratio 8.77 %

Importance

Changes 0
Metric Value
cc 16
eloc 35
nc 18
nop 4
dl 5
loc 57
rs 6.5273
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * Parser that uses PHP 5's DOM extension (part of the core).
5
 *
6
 * In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
7
 * It gives us a forgiving HTML parser, which we use to transform the HTML
8
 * into a DOM, and then into the tokens.  It is blazingly fast (for large
9
 * documents, it performs twenty times faster than
10
 * HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5.
11
 *
12
 * @note Any empty elements will have empty tokens associated with them, even if
13
 * this is prohibited by the spec. This is cannot be fixed until the spec
14
 * comes into play.
15
 *
16
 * @note PHP's DOM extension does not actually parse any entities, we use
17
 *       our own function to do that.
18
 *
19
 * @warning DOM tends to drop whitespace, which may wreak havoc on indenting.
20
 *          If this is a huge problem, due to the fact that HTML is hand
21
 *          edited and you are unable to get a parser cache that caches the
22
 *          the output of HTML Purifier while keeping the original HTML lying
23
 *          around, you may want to run Tidy on the resulting output or use
24
 *          HTMLPurifier_DirectLex
25
 */
26
27
class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer
28
{
29
30
    /**
31
     * @type HTMLPurifier_TokenFactory
32
     */
33
    private $factory;
34
35
    public function __construct()
36
    {
37
        // setup the factory
38
        parent::__construct();
39
        $this->factory = new HTMLPurifier_TokenFactory();
40
    }
41
42
    /**
43
     * @param string $html
44
     * @param HTMLPurifier_Config $config
45
     * @param HTMLPurifier_Context $context
46
     * @return HTMLPurifier_Token[]
47
     */
48
    public function tokenizeHTML($html, $config, $context)
49
    {
50
        $html = $this->normalize($html, $config, $context);
51
52
        // attempt to armor stray angled brackets that cannot possibly
53
        // form tags and thus are probably being used as emoticons
54
        if ($config->get('Core.AggressivelyFixLt')) {
55
            $char = '[^a-z!\/]';
56
            $comment = "/<!--(.*?)(-->|\z)/is";
57
            $html = preg_replace_callback($comment, array($this, 'callbackArmorCommentEntities'), $html);
58
            do {
59
                $old = $html;
60
                $html = preg_replace("/<($char)/i", '&lt;\\1', $html);
61
            } while ($html !== $old);
62
            $html = preg_replace_callback($comment, array($this, 'callbackUndoCommentSubst'), $html); // fix comments
63
        }
64
65
        // preprocess html, essential for UTF-8
66
        $html = $this->wrapHTML($html, $config, $context);
67
68
        $doc = new DOMDocument();
69
        $doc->encoding = 'UTF-8'; // theoretically, the above has this covered
70
71
        set_error_handler(array($this, 'muteErrorHandler'));
72
        $doc->loadHTML($html);
73
        restore_error_handler();
74
75
        $body = $doc->getElementsByTagName('html')->item(0)-> // <html>
76
                      getElementsByTagName('body')->item(0);  // <body>
77
78
        $div = $body->getElementsByTagName('div')->item(0); // <div>
79
        $tokens = array();
80
        $this->tokenizeDOM($div, $tokens, $config);
81
        // If the div has a sibling, that means we tripped across
82
        // a premature </div> tag.  So remove the div we parsed,
83
        // and then tokenize the rest of body.  We can't tokenize
84
        // the sibling directly as we'll lose the tags in that case.
85
        if ($div->nextSibling) {
86
            $body->removeChild($div);
87
            $this->tokenizeDOM($body, $tokens, $config);
88
        }
89
        return $tokens;
90
    }
91
92
    /**
93
     * Iterative function that tokenizes a node, putting it into an accumulator.
94
     * To iterate is human, to recurse divine - L. Peter Deutsch
95
     * @param DOMNode $node DOMNode to be tokenized.
96
     * @param HTMLPurifier_Token[] $tokens   Array-list of already tokenized tokens.
97
     * @return HTMLPurifier_Token of node appended to previously passed tokens.
0 ignored issues
show
Documentation introduced by
Should the return type not be HTMLPurifier_Token|null?

This check compares the return type specified in the @return annotation of a function or method doc comment with the types returned by the function and raises an issue if they mismatch.

Loading history...
98
     */
99
    protected function tokenizeDOM($node, &$tokens, $config)
100
    {
101
        $level = 0;
102
        $nodes = array($level => new HTMLPurifier_Queue(array($node)));
103
        $closingNodes = array();
104
        do {
105
            while (!$nodes[$level]->isEmpty()) {
106
                $node = $nodes[$level]->shift(); // FIFO
107
                $collect = $level > 0 ? true : false;
108
                $needEndingTag = $this->createStartNode($node, $tokens, $collect, $config);
109
                if ($needEndingTag) {
110
                    $closingNodes[$level][] = $node;
111
                }
112
                if ($node->childNodes && $node->childNodes->length) {
113
                    $level++;
114
                    $nodes[$level] = new HTMLPurifier_Queue();
115
                    foreach ($node->childNodes as $childNode) {
116
                        $nodes[$level]->push($childNode);
117
                    }
118
                }
119
            }
120
            $level--;
121
            if ($level && isset($closingNodes[$level])) {
122
                while ($node = array_pop($closingNodes[$level])) {
123
                    $this->createEndNode($node, $tokens);
124
                }
125
            }
126
        } while ($level > 0);
127
    }
128
129
    /**
130
     * Portably retrieve the tag name of a node; deals with older versions
131
     * of libxml like 2.7.6
132
     * @param DOMNode $node
133
     */
134
    protected function getTagName($node)
0 ignored issues
show
Documentation introduced by
The return type could not be reliably inferred; please add a @return annotation.

Our type inference engine in quite powerful, but sometimes the code does not provide enough clues to go by. In these cases we request you to add a @return annotation as described here.

Loading history...
135
    {
136
        if (property_exists($node, 'tagName')) {
137
            return $node->tagName;
0 ignored issues
show
Bug introduced by
The property tagName does not seem to exist in DOMNode.

An attempt at access to an undefined property has been detected. This may either be a typographical error or the property has been renamed but there are still references to its old name.

If you really want to allow access to undefined properties, you can define magic methods to allow access. See the php core documentation on Overloading.

Loading history...
138
        } else if (property_exists($node, 'nodeName')) {
139
            return $node->nodeName;
140
        } else if (property_exists($node, 'localName')) {
141
            return $node->localName;
142
        }
143
        return null;
144
    }
145
146
    /**
147
     * Portably retrieve the data of a node; deals with older versions
148
     * of libxml like 2.7.6
149
     * @param DOMNode $node
150
     */
151
    protected function getData($node)
0 ignored issues
show
Documentation introduced by
The return type could not be reliably inferred; please add a @return annotation.

Our type inference engine in quite powerful, but sometimes the code does not provide enough clues to go by. In these cases we request you to add a @return annotation as described here.

Loading history...
152
    {
153
        if (property_exists($node, 'data')) {
154
            return $node->data;
0 ignored issues
show
Bug introduced by
The property data does not seem to exist in DOMNode.

An attempt at access to an undefined property has been detected. This may either be a typographical error or the property has been renamed but there are still references to its old name.

If you really want to allow access to undefined properties, you can define magic methods to allow access. See the php core documentation on Overloading.

Loading history...
155
        } else if (property_exists($node, 'nodeValue')) {
156
            return $node->nodeValue;
157
        } else if (property_exists($node, 'textContent')) {
158
            return $node->textContent;
159
        }
160
        return null;
161
    }
162
163
164
    /**
165
     * @param DOMNode $node DOMNode to be tokenized.
166
     * @param HTMLPurifier_Token[] $tokens   Array-list of already tokenized tokens.
167
     * @param bool $collect  Says whether or start and close are collected, set to
168
     *                    false at first recursion because it's the implicit DIV
169
     *                    tag you're dealing with.
170
     * @return bool if the token needs an endtoken
171
     * @todo data and tagName properties don't seem to exist in DOMNode?
172
     */
173
    protected function createStartNode($node, &$tokens, $collect, $config)
174
    {
175
        // intercept non element nodes. WE MUST catch all of them,
176
        // but we're not getting the character reference nodes because
177
        // those should have been preprocessed
178
        if ($node->nodeType === XML_TEXT_NODE) {
179
            $data = $this->getData($node); // Handle variable data property
180
            if ($data !== null) {
181
              $tokens[] = $this->factory->createText($data);
182
            }
183
            return false;
184
        } elseif ($node->nodeType === XML_CDATA_SECTION_NODE) {
185
            // undo libxml's special treatment of <script> and <style> tags
186
            $last = end($tokens);
187
            $data = $node->data;
0 ignored issues
show
Bug introduced by
The property data does not seem to exist in DOMNode.

An attempt at access to an undefined property has been detected. This may either be a typographical error or the property has been renamed but there are still references to its old name.

If you really want to allow access to undefined properties, you can define magic methods to allow access. See the php core documentation on Overloading.

Loading history...
188
            // (note $node->tagname is already normalized)
189
            if ($last instanceof HTMLPurifier_Token_Start && ($last->name == 'script' || $last->name == 'style')) {
190
                $new_data = trim($data);
191
                if (substr($new_data, 0, 4) === '<!--') {
192
                    $data = substr($new_data, 4);
193 View Code Duplication
                    if (substr($data, -3) === '-->') {
194
                        $data = substr($data, 0, -3);
195
                    } else {
0 ignored issues
show
Unused Code introduced by
This else statement is empty and can be removed.

This check looks for the else branches of if statements that have no statements or where all statements have been commented out. This may be the result of changes for debugging or the code may simply be obsolete.

These else branches can be removed.

if (rand(1, 6) > 3) {
print "Check failed";
} else {
    //print "Check succeeded";
}

could be turned into

if (rand(1, 6) > 3) {
    print "Check failed";
}

This is much more concise to read.

Loading history...
196
                        // Highly suspicious! Not sure what to do...
197
                    }
198
                }
199
            }
200
            $tokens[] = $this->factory->createText($this->parseText($data, $config));
201
            return false;
202
        } elseif ($node->nodeType === XML_COMMENT_NODE) {
203
            // this is code is only invoked for comments in script/style in versions
204
            // of libxml pre-2.6.28 (regular comments, of course, are still
205
            // handled regularly)
206
            $tokens[] = $this->factory->createComment($node->data);
207
            return false;
208
        } elseif ($node->nodeType !== XML_ELEMENT_NODE) {
209
            // not-well tested: there may be other nodes we have to grab
210
            return false;
211
        }
212
        $attr = $node->hasAttributes() ? $this->transformAttrToAssoc($node->attributes) : array();
213
        $tag_name = $this->getTagName($node); // Handle variable tagName property
214
        if (empty($tag_name)) {
215
            return (bool) $node->childNodes->length;
216
        }
217
        // We still have to make sure that the element actually IS empty
218
        if (!$node->childNodes->length) {
219
            if ($collect) {
220
                $tokens[] = $this->factory->createEmpty($tag_name, $attr);
221
            }
222
            return false;
223
        } else {
224
            if ($collect) {
225
                $tokens[] = $this->factory->createStart($tag_name, $attr);
226
            }
227
            return true;
228
        }
229
    }
230
231
    /**
232
     * @param DOMNode $node
233
     * @param HTMLPurifier_Token[] $tokens
234
     */
235
    protected function createEndNode($node, &$tokens)
236
    {
237
        $tag_name = $this->getTagName($node); // Handle variable tagName property
238
        $tokens[] = $this->factory->createEnd($tag_name);
239
    }
240
241
    /**
242
     * Converts a DOMNamedNodeMap of DOMAttr objects into an assoc array.
243
     *
244
     * @param DOMNamedNodeMap $node_map DOMNamedNodeMap of DOMAttr objects.
245
     * @return array Associative array of attributes.
246
     */
247
    protected function transformAttrToAssoc($node_map)
248
    {
249
        // NamedNodeMap is documented very well, so we're using undocumented
250
        // features, namely, the fact that it implements Iterator and
251
        // has a ->length attribute
252
        if ($node_map->length === 0) {
0 ignored issues
show
Bug introduced by
The property length does not seem to exist in DOMNamedNodeMap.

An attempt at access to an undefined property has been detected. This may either be a typographical error or the property has been renamed but there are still references to its old name.

If you really want to allow access to undefined properties, you can define magic methods to allow access. See the php core documentation on Overloading.

Loading history...
253
            return array();
254
        }
255
        $array = array();
256
        foreach ($node_map as $attr) {
257
            $array[$attr->name] = $attr->value;
258
        }
259
        return $array;
260
    }
261
262
    /**
263
     * An error handler that mutes all errors
264
     * @param int $errno
265
     * @param string $errstr
266
     */
267
    public function muteErrorHandler($errno, $errstr)
0 ignored issues
show
Unused Code introduced by
The parameter $errno is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
Unused Code introduced by
The parameter $errstr is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
268
    {
269
    }
270
271
    /**
272
     * Callback function for undoing escaping of stray angled brackets
273
     * in comments
274
     * @param array $matches
275
     * @return string
276
     */
277
    public function callbackUndoCommentSubst($matches)
278
    {
279
        return '<!--' . strtr($matches[1], array('&amp;' => '&', '&lt;' => '<')) . $matches[2];
280
    }
281
282
    /**
283
     * Callback function that entity-izes ampersands in comments so that
284
     * callbackUndoCommentSubst doesn't clobber them
285
     * @param array $matches
286
     * @return string
287
     */
288
    public function callbackArmorCommentEntities($matches)
289
    {
290
        return '<!--' . str_replace('&', '&amp;', $matches[1]) . $matches[2];
291
    }
292
293
    /**
294
     * Wraps an HTML fragment in the necessary HTML
295
     * @param string $html
296
     * @param HTMLPurifier_Config $config
297
     * @param HTMLPurifier_Context $context
298
     * @return string
299
     */
300
    protected function wrapHTML($html, $config, $context, $use_div = true)
0 ignored issues
show
Unused Code introduced by
The parameter $context is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
301
    {
302
        $def = $config->getDefinition('HTML');
303
        $ret = '';
304
305
        if (!empty($def->doctype->dtdPublic) || !empty($def->doctype->dtdSystem)) {
306
            $ret .= '<!DOCTYPE html ';
307
            if (!empty($def->doctype->dtdPublic)) {
308
                $ret .= 'PUBLIC "' . $def->doctype->dtdPublic . '" ';
0 ignored issues
show
Bug introduced by
The property doctype does not seem to exist in HTMLPurifier_Definition.

An attempt at access to an undefined property has been detected. This may either be a typographical error or the property has been renamed but there are still references to its old name.

If you really want to allow access to undefined properties, you can define magic methods to allow access. See the php core documentation on Overloading.

Loading history...
309
            }
310
            if (!empty($def->doctype->dtdSystem)) {
311
                $ret .= '"' . $def->doctype->dtdSystem . '" ';
312
            }
313
            $ret .= '>';
314
        }
315
316
        $ret .= '<html><head>';
317
        $ret .= '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />';
318
        // No protection if $html contains a stray </div>!
319
        $ret .= '</head><body>';
320
        if ($use_div) $ret .= '<div>';
321
        $ret .= $html;
322
        if ($use_div) $ret .= '</div>';
323
        $ret .= '</body></html>';
324
        return $ret;
325
    }
326
}
327
328
// vim: et sw=4 sts=4
329