HTMLPurifier_Lexer_DOMLex::wrapHTML() - Code Metrics - Hannesalm/EventCalendar - Measure and Improve Code Quality continuously with Scrutinizer

HTMLPurifier_Lexer_DOMLex::wrapHTML() B
last analyzed 2016-06-19 14:11 UTC

↳ Parent: HTMLPurifier_Lexer_DOMLex

Complexity

Conditions	5
Paths	5

Size

Total Lines	22
Code Lines	14

Duplication

Lines	0
Ratio	0 %

Importance

Changes	1
Bugs	0	Features	0

Metric	Value
cc	5
eloc	14
c	1
b	0
f	0
nc	5
nop	3
dl	0
loc	22
rs	8.6737

<?php

/**
 * Parser that uses PHP 5's DOM extension (part of the core).
 *
 * In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
 * It gives us a forgiving HTML parser, which we use to transform the HTML
 * into a DOM, and then into the tokens.  It is blazingly fast (for large
 * documents, it performs twenty times faster than
 * HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5.
 *
 * @note Any empty elements will have empty tokens associated with them, even if
 * this is prohibited by the spec. This is cannot be fixed until the spec
 * comes into play.
 *
 * @note PHP's DOM extension does not actually parse any entities, we use
 *       our own function to do that.
 *
 * @warning DOM tends to drop whitespace, which may wreak havoc on indenting.
 *          If this is a huge problem, due to the fact that HTML is hand
 *          edited and you are unable to get a parser cache that caches the
 *          the output of HTML Purifier while keeping the original HTML lying
 *          around, you may want to run Tidy on the resulting output or use
 *          HTMLPurifier_DirectLex
 */

class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer
namespace YourVendor;

class YourClass { }
{

    /**
     * @type HTMLPurifier_TokenFactory
     */
    private $factory;

    public function __construct()
    {
        // setup the factory
        parent::__construct();
        $this->factory = new HTMLPurifier_TokenFactory();
    }

    /**
     * @param string $html
     * @param HTMLPurifier_Config $config
     * @param HTMLPurifier_Context $context
     * @return HTMLPurifier_Token[]
     */
    public function tokenizeHTML($html, $config, $context)
    {
        $html = $this->normalize($html, $config, $context);

        // attempt to armor stray angled brackets that cannot possibly
        // form tags and thus are probably being used as emoticons
        if ($config->get('Core.AggressivelyFixLt')) {
            $char = '[^a-z!\/]';
            $comment = "/<!--(.*?)(-->|\z)/is";
            $html = preg_replace_callback($comment, array($this, 'callbackArmorCommentEntities'), $html);
            do {
                $old = $html;
                $html = preg_replace("/<($char)/i", '&lt;\\1', $html);
            } while ($html !== $old);
            $html = preg_replace_callback($comment, array($this, 'callbackUndoCommentSubst'), $html); // fix comments
        }

        // preprocess html, essential for UTF-8
        $html = $this->wrapHTML($html, $config, $context);

        $doc = new DOMDocument();
        $doc->encoding = 'UTF-8'; // theoretically, the above has this covered

        set_error_handler(array($this, 'muteErrorHandler'));
        $doc->loadHTML($html);
        restore_error_handler();

        $tokens = array();
        $this->tokenizeDOM(
            $doc->getElementsByTagName('html')->item(0)-> // <html>
            getElementsByTagName('body')->item(0), //   <body>
            $tokens
        );
        return $tokens;
    }

    /**
     * Iterative function that tokenizes a node, putting it into an accumulator.
     * To iterate is human, to recurse divine - L. Peter Deutsch
     * @param DOMNode $node DOMNode to be tokenized.
     * @param HTMLPurifier_Token[] $tokens   Array-list of already tokenized tokens.
     * @return HTMLPurifier_Token of node appended to previously passed tokens.
     */
    protected function tokenizeDOM($node, &$tokens)
    {
        $level = 0;
        $nodes = array($level => new HTMLPurifier_Queue(array($node)));
        $closingNodes = array();
        do {
            while (!$nodes[$level]->isEmpty()) {
                $node = $nodes[$level]->shift(); // FIFO
                $collect = $level > 0 ? true : false;
                $needEndingTag = $this->createStartNode($node, $tokens, $collect);
                if ($needEndingTag) {
                    $closingNodes[$level][] = $node;
                }
                if ($node->childNodes && $node->childNodes->length) {
                    $level++;
                    $nodes[$level] = new HTMLPurifier_Queue();
                    foreach ($node->childNodes as $childNode) {
                        $nodes[$level]->push($childNode);
                    }
                }
            }
            $level--;
            if ($level && isset($closingNodes[$level])) {
                while ($node = array_pop($closingNodes[$level])) {
                    $this->createEndNode($node, $tokens);
                }
            }
        } while ($level > 0);
    }

    /**
     * @param DOMNode $node DOMNode to be tokenized.
     * @param HTMLPurifier_Token[] $tokens   Array-list of already tokenized tokens.
     * @param bool $collect  Says whether or start and close are collected, set to
     *                    false at first recursion because it's the implicit DIV
     *                    tag you're dealing with.
     * @return bool if the token needs an endtoken
     * @todo data and tagName properties don't seem to exist in DOMNode?
     */
    protected function createStartNode($node, &$tokens, $collect)
    {
        // intercept non element nodes. WE MUST catch all of them,
        // but we're not getting the character reference nodes because
        // those should have been preprocessed
        if ($node->nodeType === XML_TEXT_NODE) {
            $tokens[] = $this->factory->createText($node->data);

            return false;
        } elseif ($node->nodeType === XML_CDATA_SECTION_NODE) {
            // undo libxml's special treatment of <script> and <style> tags
            $last = end($tokens);
            $data = $node->data;
            // (note $node->tagname is already normalized)
            if ($last instanceof HTMLPurifier_Token_Start && ($last->name == 'script' || $last->name == 'style')) {
                $new_data = trim($data);
                if (substr($new_data, 0, 4) === '<!--') {
                    $data = substr($new_data, 4);
                    if (substr($data, -3) === '-->') {
                        $data = substr($data, 0, -3);
                    } else {
if (rand(1, 6) > 3) {
print "Check failed";
} else {
    //print "Check succeeded";
}
                        // Highly suspicious! Not sure what to do...
                    }
                }
            }
            $tokens[] = $this->factory->createText($this->parseData($data));
            return false;
        } elseif ($node->nodeType === XML_COMMENT_NODE) {
            // this is code is only invoked for comments in script/style in versions
            // of libxml pre-2.6.28 (regular comments, of course, are still
            // handled regularly)
            $tokens[] = $this->factory->createComment($node->data);
            return false;
        } elseif ($node->nodeType !== XML_ELEMENT_NODE) {
            // not-well tested: there may be other nodes we have to grab
            return false;
        }

        $attr = $node->hasAttributes() ? $this->transformAttrToAssoc($node->attributes) : array();

        // We still have to make sure that the element actually IS empty
        if (!$node->childNodes->length) {
            if ($collect) {
                $tokens[] = $this->factory->createEmpty($node->tagName, $attr);

            }
            return false;
        } else {
            if ($collect) {
                $tokens[] = $this->factory->createStart(
                    $tag_name = $node->tagName, // somehow, it get's dropped
                    $attr
                );
            }
            return true;
        }
    }

    /**
     * @param DOMNode $node
     * @param HTMLPurifier_Token[] $tokens
     */
    protected function createEndNode($node, &$tokens)
    {
        $tokens[] = $this->factory->createEnd($node->tagName);

    }


    /**
     * Converts a DOMNamedNodeMap of DOMAttr objects into an assoc array.
     *
     * @param DOMNamedNodeMap $node_map DOMNamedNodeMap of DOMAttr objects.
     * @return array Associative array of attributes.
     */
    protected function transformAttrToAssoc($node_map)
    {
        // NamedNodeMap is documented very well, so we're using undocumented
        // features, namely, the fact that it implements Iterator and
        // has a ->length attribute
        if ($node_map->length === 0) {

            return array();
        }
        $array = array();
        foreach ($node_map as $attr) {
            $array[$attr->name] = $attr->value;
        }
        return $array;
    }

    /**
     * An error handler that mutes all errors
     * @param int $errno
     * @param string $errstr
     */
    public function muteErrorHandler($errno, $errstr)

    {
    }

    /**
     * Callback function for undoing escaping of stray angled brackets
     * in comments
     * @param array $matches
     * @return string
     */
    public function callbackUndoCommentSubst($matches)
    {
        return '<!--' . strtr($matches[1], array('&amp;' => '&', '&lt;' => '<')) . $matches[2];
    }

    /**
     * Callback function that entity-izes ampersands in comments so that
     * callbackUndoCommentSubst doesn't clobber them
     * @param array $matches
     * @return string
     */
    public function callbackArmorCommentEntities($matches)
    {
        return '<!--' . str_replace('&', '&amp;', $matches[1]) . $matches[2];
    }

    /**
     * Wraps an HTML fragment in the necessary HTML
     * @param string $html
     * @param HTMLPurifier_Config $config
     * @param HTMLPurifier_Context $context
     * @return string
     */
    protected function wrapHTML($html, $config, $context)

    {
        $def = $config->getDefinition('HTML');
        $ret = '';

        if (!empty($def->doctype->dtdPublic) || !empty($def->doctype->dtdSystem)) {
            $ret .= '<!DOCTYPE html ';
            if (!empty($def->doctype->dtdPublic)) {
                $ret .= 'PUBLIC "' . $def->doctype->dtdPublic . '" ';

            }
            if (!empty($def->doctype->dtdSystem)) {
                $ret .= '"' . $def->doctype->dtdSystem . '" ';
            }
            $ret .= '>';
        }

        $ret .= '<html><head>';
        $ret .= '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />';
        // No protection if $html contains a stray </div>!
        $ret .= '</head><body>' . $html . '</body></html>';
        return $ret;
    }
}

// vim: et sw=4 sts=4


1			<?php
2
3			/**
4			* Parser that uses PHP 5's DOM extension (part of the core).
5			*
6			* In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
7			* It gives us a forgiving HTML parser, which we use to transform the HTML
8			* into a DOM, and then into the tokens. It is blazingly fast (for large
9			* documents, it performs twenty times faster than
10			* HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5.
11			*
12			* @note Any empty elements will have empty tokens associated with them, even if
13			* this is prohibited by the spec. This is cannot be fixed until the spec
14			* comes into play.
15			*
16			* @note PHP's DOM extension does not actually parse any entities, we use
17			* our own function to do that.
18			*
19			* @warning DOM tends to drop whitespace, which may wreak havoc on indenting.
20			* If this is a huge problem, due to the fact that HTML is hand
21			* edited and you are unable to get a parser cache that caches the
22			* the output of HTML Purifier while keeping the original HTML lying
23			* around, you may want to run Tidy on the resulting output or use
24			* HTMLPurifier_DirectLex
25			*/
26
27			class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer
			0 ignored issues – show Coding Style Compatibility introduced 2016-06-19 13:37 UTC by Report Bug Copy Issue Report PSR1 recommends that each class must be in a namespace of at least one level to avoid collisions. You can fix this by adding a namespace to your class: namespace YourVendor; class YourClass { } When choosing a vendor namespace, try to pick something that is not too generic to avoid conflicts with other libraries. Loading history...
28			{
29
30			/**
31			* @type HTMLPurifier_TokenFactory
32			*/
33			private $factory;
34
35			public function __construct()
36			{
37			// setup the factory
38			parent::__construct();
39			$this->factory = new HTMLPurifier_TokenFactory();
40			}
41
42			/**
43			* @param string $html
44			* @param HTMLPurifier_Config $config
45			* @param HTMLPurifier_Context $context
46			* @return HTMLPurifier_Token[]
47			*/
48			public function tokenizeHTML($html, $config, $context)
49			{
50			$html = $this->normalize($html, $config, $context);
51
52			// attempt to armor stray angled brackets that cannot possibly
53			// form tags and thus are probably being used as emoticons
54			if ($config->get('Core.AggressivelyFixLt')) {
55			$char = '[^a-z!\/]';
56			$comment = "/<!--(.*?)(-->\|\z)/is";
57			$html = preg_replace_callback($comment, array($this, 'callbackArmorCommentEntities'), $html);
58			do {
59			$old = $html;
60			$html = preg_replace("/<($char)/i", '<\\1', $html);
61			} while ($html !== $old);
62			$html = preg_replace_callback($comment, array($this, 'callbackUndoCommentSubst'), $html); // fix comments
63			}
64
65			// preprocess html, essential for UTF-8
66			$html = $this->wrapHTML($html, $config, $context);
67
68			$doc = new DOMDocument();
69			$doc->encoding = 'UTF-8'; // theoretically, the above has this covered
70
71			set_error_handler(array($this, 'muteErrorHandler'));
72			$doc->loadHTML($html);
73			restore_error_handler();
74
75			$tokens = array();
76			$this->tokenizeDOM(
77			$doc->getElementsByTagName('html')->item(0)-> // <html>
78			getElementsByTagName('body')->item(0), // <body>
79			$tokens
80			);
81			return $tokens;
82			}
83
84			/**
85			* Iterative function that tokenizes a node, putting it into an accumulator.
86			* To iterate is human, to recurse divine - L. Peter Deutsch
87			* @param DOMNode $node DOMNode to be tokenized.
88			* @param HTMLPurifier_Token[] $tokens Array-list of already tokenized tokens.
89			* @return HTMLPurifier_Token of node appended to previously passed tokens.
90			*/
91			protected function tokenizeDOM($node, &$tokens)
92			{
93			$level = 0;
94			$nodes = array($level => new HTMLPurifier_Queue(array($node)));
95			$closingNodes = array();
96			do {
97			while (!$nodes[$level]->isEmpty()) {
98			$node = $nodes[$level]->shift(); // FIFO
99			$collect = $level > 0 ? true : false;
100			$needEndingTag = $this->createStartNode($node, $tokens, $collect);
101			if ($needEndingTag) {
102			$closingNodes[$level][] = $node;
103			}
104			if ($node->childNodes && $node->childNodes->length) {
105			$level++;
106			$nodes[$level] = new HTMLPurifier_Queue();
107			foreach ($node->childNodes as $childNode) {
108			$nodes[$level]->push($childNode);
109			}
110			}
111			}
112			$level--;
113			if ($level && isset($closingNodes[$level])) {
114			while ($node = array_pop($closingNodes[$level])) {
115			$this->createEndNode($node, $tokens);
116			}
117			}
118			} while ($level > 0);
119			}
120
121			/**
122			* @param DOMNode $node DOMNode to be tokenized.
123			* @param HTMLPurifier_Token[] $tokens Array-list of already tokenized tokens.
124			* @param bool $collect Says whether or start and close are collected, set to
125			* false at first recursion because it's the implicit DIV
126			* tag you're dealing with.
127			* @return bool if the token needs an endtoken
128			* @todo data and tagName properties don't seem to exist in DOMNode?
129			*/
130			protected function createStartNode($node, &$tokens, $collect)
131			{
132			// intercept non element nodes. WE MUST catch all of them,
133			// but we're not getting the character reference nodes because
134			// those should have been preprocessed
135			if ($node->nodeType === XML_TEXT_NODE) {
136			$tokens[] = $this->factory->createText($node->data);
			0 ignored issues – show Bug introduced 2016-06-19 13:37 UTC by Report Bug Copy Issue Report The property `data` does not seem to exist in `DOMNode`. An attempt at access to an undefined property has been detected. This may either be a typographical error or the property has been renamed but there are still references to its old name. If you really want to allow access to undefined properties, you can define magic methods to allow access. See the php core documentation on Overloading. Loading history...
137			return false;
138			} elseif ($node->nodeType === XML_CDATA_SECTION_NODE) {
139			// undo libxml's special treatment of <script> and <style> tags
140			$last = end($tokens);
141			$data = $node->data;
142			// (note $node->tagname is already normalized)
143			if ($last instanceof HTMLPurifier_Token_Start && ($last->name == 'script' \|\| $last->name == 'style')) {
144			$new_data = trim($data);
145			if (substr($new_data, 0, 4) === '<!--') {
146			$data = substr($new_data, 4);
147			if (substr($data, -3) === '-->') {
148			$data = substr($data, 0, -3);
149			} else {
			0 ignored issues – show Unused Code introduced 2016-06-19 13:37 UTC by Report Bug Copy Issue Report This `else` statement is empty and can be removed. This check looks for the `else` branches of `if` statements that have no statements or where all statements have been commented out. This may be the result of changes for debugging or the code may simply be obsolete. These `else` branches can be removed. if (rand(1, 6) > 3) { print "Check failed"; } else { //print "Check succeeded"; } could be turned into if (rand(1, 6) > 3) { print "Check failed"; } This is much more concise to read. Loading history...
150			// Highly suspicious! Not sure what to do...
151			}
152			}
153			}
154			$tokens[] = $this->factory->createText($this->parseData($data));
155			return false;
156			} elseif ($node->nodeType === XML_COMMENT_NODE) {
157			// this is code is only invoked for comments in script/style in versions
158			// of libxml pre-2.6.28 (regular comments, of course, are still
159			// handled regularly)
160			$tokens[] = $this->factory->createComment($node->data);
161			return false;
162			} elseif ($node->nodeType !== XML_ELEMENT_NODE) {
163			// not-well tested: there may be other nodes we have to grab
164			return false;
165			}
166
167			$attr = $node->hasAttributes() ? $this->transformAttrToAssoc($node->attributes) : array();
168
169			// We still have to make sure that the element actually IS empty
170			if (!$node->childNodes->length) {
171			if ($collect) {
172			$tokens[] = $this->factory->createEmpty($node->tagName, $attr);
			0 ignored issues – show Bug introduced 2016-06-19 13:37 UTC by Report Bug Copy Issue Report The property `tagName` does not seem to exist in `DOMNode`. An attempt at access to an undefined property has been detected. This may either be a typographical error or the property has been renamed but there are still references to its old name. If you really want to allow access to undefined properties, you can define magic methods to allow access. See the php core documentation on Overloading. Loading history...
173			}
174			return false;
175			} else {
176			if ($collect) {
177			$tokens[] = $this->factory->createStart(
178			$tag_name = $node->tagName, // somehow, it get's dropped
179			$attr
180			);
181			}
182			return true;
183			}
184			}
185
186			/**
187			* @param DOMNode $node
188			* @param HTMLPurifier_Token[] $tokens
189			*/
190			protected function createEndNode($node, &$tokens)
191			{
192			$tokens[] = $this->factory->createEnd($node->tagName);
			0 ignored issues – show Bug introduced 2016-06-19 13:37 UTC by Report Bug Copy Issue Report The property `tagName` does not seem to exist in `DOMNode`. An attempt at access to an undefined property has been detected. This may either be a typographical error or the property has been renamed but there are still references to its old name. If you really want to allow access to undefined properties, you can define magic methods to allow access. See the php core documentation on Overloading. Loading history...
193			}
194
195
196			/**
197			* Converts a DOMNamedNodeMap of DOMAttr objects into an assoc array.
198			*
199			* @param DOMNamedNodeMap $node_map DOMNamedNodeMap of DOMAttr objects.
200			* @return array Associative array of attributes.
201			*/
202			protected function transformAttrToAssoc($node_map)
203			{
204			// NamedNodeMap is documented very well, so we're using undocumented
205			// features, namely, the fact that it implements Iterator and
206			// has a ->length attribute
207			if ($node_map->length === 0) {
			0 ignored issues – show Bug introduced 2016-06-19 13:37 UTC by Report Bug Copy Issue Report The property `length` does not seem to exist in `DOMNamedNodeMap`. An attempt at access to an undefined property has been detected. This may either be a typographical error or the property has been renamed but there are still references to its old name. If you really want to allow access to undefined properties, you can define magic methods to allow access. See the php core documentation on Overloading. Loading history...
208			return array();
209			}
210			$array = array();
211			foreach ($node_map as $attr) {
212			$array[$attr->name] = $attr->value;
213			}
214			return $array;
215			}
216
217			/**
218			* An error handler that mutes all errors
219			* @param int $errno
220			* @param string $errstr
221			*/
222			public function muteErrorHandler($errno, $errstr)
			0 ignored issues – show Unused Code introduced 2016-06-19 13:37 UTC by Report Bug Copy Issue Report The parameter `$errno` is not used and could be removed. This check looks from parameters that have been defined for a function or method, but which are not used in the method body. Loading history... Unused Code introduced 2016-06-19 13:37 UTC by Report Bug Copy Issue Report The parameter `$errstr` is not used and could be removed. This check looks from parameters that have been defined for a function or method, but which are not used in the method body. Loading history...
223			{
224			}
225
226			/**
227			* Callback function for undoing escaping of stray angled brackets
228			* in comments
229			* @param array $matches
230			* @return string
231			*/
232			public function callbackUndoCommentSubst($matches)
233			{
234			return '<!--' . strtr($matches[1], array('&' => '&', '<' => '<')) . $matches[2];
235			}
236
237			/**
238			* Callback function that entity-izes ampersands in comments so that
239			* callbackUndoCommentSubst doesn't clobber them
240			* @param array $matches
241			* @return string
242			*/
243			public function callbackArmorCommentEntities($matches)
244			{
245			return '<!--' . str_replace('&', '&', $matches[1]) . $matches[2];
246			}
247
248			/**
249			* Wraps an HTML fragment in the necessary HTML
250			* @param string $html
251			* @param HTMLPurifier_Config $config
252			* @param HTMLPurifier_Context $context
253			* @return string
254			*/
255			protected function wrapHTML($html, $config, $context)
			0 ignored issues – show Unused Code introduced 2016-06-19 13:37 UTC by Report Bug Copy Issue Report The parameter `$context` is not used and could be removed. This check looks from parameters that have been defined for a function or method, but which are not used in the method body. Loading history...
256			{
257			$def = $config->getDefinition('HTML');
258			$ret = '';
259
260			if (!empty($def->doctype->dtdPublic) \|\| !empty($def->doctype->dtdSystem)) {
261			$ret .= '<!DOCTYPE html ';
262			if (!empty($def->doctype->dtdPublic)) {
263			$ret .= 'PUBLIC "' . $def->doctype->dtdPublic . '" ';
			0 ignored issues – show Bug introduced 2016-06-19 13:37 UTC by Report Bug Copy Issue Report The property `doctype` does not seem to exist in `HTMLPurifier_Definition`. An attempt at access to an undefined property has been detected. This may either be a typographical error or the property has been renamed but there are still references to its old name. If you really want to allow access to undefined properties, you can define magic methods to allow access. See the php core documentation on Overloading. Loading history...
264			}
265			if (!empty($def->doctype->dtdSystem)) {
266			$ret .= '"' . $def->doctype->dtdSystem . '" ';
267			}
268			$ret .= '>';
269			}
270
271			$ret .= '<html><head>';
272			$ret .= '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />';
273			// No protection if $html contains a stray </div>!
274			$ret .= '</head><body>' . $html . '</body></html>';
275			return $ret;
276			}
277			}
278
279			// vim: et sw=4 sts=4
280

Hannesalm / EventCalendar

GitHub Access Token became invalid

HTMLPurifier_Lexer_DOMLex::wrapHTML() B last analyzed 2016-06-19 14:11 UTC

Complexity

Size

Duplication

Importance

Duplication Side-by-Side

Filter issues like

HTMLPurifier_Lexer_DOMLex::wrapHTML() B
last analyzed 2016-06-19 14:11 UTC