Completed
Push — master ( e162f1...95ce61 )
by Richard
13s
created

HTMLPurifier_Lexer   B

Complexity

Total Complexity 50

Size/Duplication

Total Lines 339
Duplicated Lines 0 %

Coupling/Cohesion

Components 1
Dependencies 8

Importance

Changes 0
Metric Value
dl 0
loc 339
rs 8.6206
c 0
b 0
f 0
wmc 50
lcom 1
cbo 8

12 Methods

Rating   Name   Duplication   Size   Complexity  
D create() 0 80 16
A __construct() 0 4 1
A parseText() 0 3 1
A parseAttr() 0 3 1
C parseData() 0 37 8
A tokenizeHTML() 0 4 1
A escapeCDATA() 0 8 1
A escapeCommentedCDATA() 0 8 1
A removeIEConditional() 0 8 1
A CDATACallback() 0 5 1
C normalize() 0 55 13
B extractBody() 0 15 5

How to fix   Complexity   

Complex Class

Complex classes like HTMLPurifier_Lexer often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use HTMLPurifier_Lexer, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/**
4
 * Forgivingly lexes HTML (SGML-style) markup into tokens.
5
 *
6
 * A lexer parses a string of SGML-style markup and converts them into
7
 * corresponding tokens.  It doesn't check for well-formedness, although its
8
 * internal mechanism may make this automatic (such as the case of
9
 * HTMLPurifier_Lexer_DOMLex).  There are several implementations to choose
10
 * from.
11
 *
12
 * A lexer is HTML-oriented: it might work with XML, but it's not
13
 * recommended, as we adhere to a subset of the specification for optimization
14
 * reasons. This might change in the future. Also, most tokenizers are not
15
 * expected to handle DTDs or PIs.
16
 *
17
 * This class should not be directly instantiated, but you may use create() to
18
 * retrieve a default copy of the lexer.  Being a supertype, this class
19
 * does not actually define any implementation, but offers commonly used
20
 * convenience functions for subclasses.
21
 *
22
 * @note The unit tests will instantiate this class for testing purposes, as
23
 *       many of the utility functions require a class to be instantiated.
24
 *       This means that, even though this class is not runnable, it will
25
 *       not be declared abstract.
26
 *
27
 * @par
28
 *
29
 * @note
30
 * We use tokens rather than create a DOM representation because DOM would:
31
 *
32
 * @par
33
 *  -# Require more processing and memory to create,
34
 *  -# Is not streamable, and
35
 *  -# Has the entire document structure (html and body not needed).
36
 *
37
 * @par
38
 * However, DOM is helpful in that it makes it easy to move around nodes
39
 * without a lot of lookaheads to see when a tag is closed. This is a
40
 * limitation of the token system and some workarounds would be nice.
41
 */
42
class HTMLPurifier_Lexer
43
{
44
45
    /**
46
     * Whether or not this lexer implements line-number/column-number tracking.
47
     * If it does, set to true.
48
     */
49
    public $tracksLineNumbers = false;
50
51
    // -- STATIC ----------------------------------------------------------
52
53
    /**
54
     * Retrieves or sets the default Lexer as a Prototype Factory.
55
     *
56
     * By default HTMLPurifier_Lexer_DOMLex will be returned. There are
57
     * a few exceptions involving special features that only DirectLex
58
     * implements.
59
     *
60
     * @note The behavior of this class has changed, rather than accepting
61
     *       a prototype object, it now accepts a configuration object.
62
     *       To specify your own prototype, set %Core.LexerImpl to it.
63
     *       This change in behavior de-singletonizes the lexer object.
64
     *
65
     * @param HTMLPurifier_Config $config
66
     * @return HTMLPurifier_Lexer
67
     * @throws HTMLPurifier_Exception
68
     */
69
    public static function create($config)
70
    {
71
        if (!($config instanceof HTMLPurifier_Config)) {
72
            $lexer = $config;
73
            trigger_error(
74
                "Passing a prototype to
75
                HTMLPurifier_Lexer::create() is deprecated, please instead
76
                use %Core.LexerImpl",
77
                E_USER_WARNING
78
            );
79
        } else {
80
            $lexer = $config->get('Core.LexerImpl');
81
        }
82
83
        $needs_tracking =
84
            $config->get('Core.MaintainLineNumbers') ||
85
            $config->get('Core.CollectErrors');
86
87
        $inst = null;
0 ignored issues
show
Unused Code introduced by
$inst is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
88
        if (is_object($lexer)) {
89
            $inst = $lexer;
90
        } else {
91
            if (is_null($lexer)) {
92
                do {
93
                    // auto-detection algorithm
94
                    if ($needs_tracking) {
95
                        $lexer = 'DirectLex';
96
                        break;
97
                    }
98
99
                    if (class_exists('DOMDocument', false) &&
100
                        method_exists('DOMDocument', 'loadHTML') &&
101
                        !extension_loaded('domxml')
102
                    ) {
103
                        // check for DOM support, because while it's part of the
104
                        // core, it can be disabled compile time. Also, the PECL
105
                        // domxml extension overrides the default DOM, and is evil
106
                        // and nasty and we shan't bother to support it
107
                        $lexer = 'DOMLex';
108
                    } else {
109
                        $lexer = 'DirectLex';
110
                    }
111
                } while (0);
112
            } // do..while so we can break
113
114
            // instantiate recognized string names
115
            switch ($lexer) {
116
                case 'DOMLex':
117
                    $inst = new HTMLPurifier_Lexer_DOMLex();
118
                    break;
119
                case 'DirectLex':
120
                    $inst = new HTMLPurifier_Lexer_DirectLex();
121
                    break;
122
                case 'PH5P':
123
                    $inst = new HTMLPurifier_Lexer_PH5P();
124
                    break;
125
                default:
126
                    throw new HTMLPurifier_Exception(
127
                        "Cannot instantiate unrecognized Lexer type " .
128
                        htmlspecialchars($lexer)
129
                    );
130
            }
131
        }
132
133
        if (!$inst) {
134
            throw new HTMLPurifier_Exception('No lexer was instantiated');
135
        }
136
137
        // once PHP DOM implements native line numbers, or we
138
        // hack out something using XSLT, remove this stipulation
139
        if ($needs_tracking && !$inst->tracksLineNumbers) {
140
            throw new HTMLPurifier_Exception(
141
                'Cannot use lexer that does not support line numbers with ' .
142
                'Core.MaintainLineNumbers or Core.CollectErrors (use DirectLex instead)'
143
            );
144
        }
145
146
        return $inst;
147
148
    }
149
150
    // -- CONVENIENCE MEMBERS ---------------------------------------------
151
152
    public function __construct()
153
    {
154
        $this->_entity_parser = new HTMLPurifier_EntityParser();
0 ignored issues
show
Bug introduced by
The property _entity_parser does not exist. Did you maybe forget to declare it?

In PHP it is possible to write to properties without declaring them. For example, the following is perfectly valid PHP code:

class MyClass { }

$x = new MyClass();
$x->foo = true;

Generally, it is a good practice to explictly declare properties to avoid accidental typos and provide IDE auto-completion:

class MyClass {
    public $foo;
}

$x = new MyClass();
$x->foo = true;
Loading history...
155
    }
156
157
    /**
158
     * Most common entity to raw value conversion table for special entities.
159
     * @type array
160
     */
161
    protected $_special_entity2str =
162
        array(
163
            '&quot;' => '"',
164
            '&amp;' => '&',
165
            '&lt;' => '<',
166
            '&gt;' => '>',
167
            '&#39;' => "'",
168
            '&#039;' => "'",
169
            '&#x27;' => "'"
170
        );
171
172
    public function parseText($string, $config) {
173
        return $this->parseData($string, false, $config);
174
    }
175
176
    public function parseAttr($string, $config) {
177
        return $this->parseData($string, true, $config);
178
    }
179
180
    /**
181
     * Parses special entities into the proper characters.
182
     *
183
     * This string will translate escaped versions of the special characters
184
     * into the correct ones.
185
     *
186
     * @param string $string String character data to be parsed.
187
     * @return string Parsed character data.
188
     */
189
    public function parseData($string, $is_attr, $config)
190
    {
191
        // following functions require at least one character
192
        if ($string === '') {
193
            return '';
194
        }
195
196
        // subtracts amps that cannot possibly be escaped
197
        $num_amp = substr_count($string, '&') - substr_count($string, '& ') -
198
            ($string[strlen($string) - 1] === '&' ? 1 : 0);
199
200
        if (!$num_amp) {
201
            return $string;
202
        } // abort if no entities
203
        $num_esc_amp = substr_count($string, '&amp;');
204
        $string = strtr($string, $this->_special_entity2str);
205
206
        // code duplication for sake of optimization, see above
207
        $num_amp_2 = substr_count($string, '&') - substr_count($string, '& ') -
208
            ($string[strlen($string) - 1] === '&' ? 1 : 0);
209
210
        if ($num_amp_2 <= $num_esc_amp) {
211
            return $string;
212
        }
213
214
        // hmm... now we have some uncommon entities. Use the callback.
215
        if ($config->get('Core.LegacyEntityDecoder')) {
216
        $string = $this->_entity_parser->substituteSpecialEntities($string);
217
        } else {
218
            if ($is_attr) {
219
                $string = $this->_entity_parser->substituteAttrEntities($string);
220
            } else {
221
                $string = $this->_entity_parser->substituteTextEntities($string);
222
            }
223
        }
224
        return $string;
225
    }
226
227
    /**
228
     * Lexes an HTML string into tokens.
229
     * @param $string String HTML.
230
     * @param HTMLPurifier_Config $config
231
     * @param HTMLPurifier_Context $context
232
     * @return HTMLPurifier_Token[] array representation of HTML.
0 ignored issues
show
Documentation introduced by
Should the return type not be HTMLPurifier_Token[]|null?

This check compares the return type specified in the @return annotation of a function or method doc comment with the types returned by the function and raises an issue if they mismatch.

Loading history...
233
     */
234
    public function tokenizeHTML($string, $config, $context)
235
    {
236
        trigger_error('Call to abstract class', E_USER_ERROR);
237
    }
238
239
    /**
240
     * Translates CDATA sections into regular sections (through escaping).
241
     * @param string $string HTML string to process.
242
     * @return string HTML with CDATA sections escaped.
243
     */
244
    protected static function escapeCDATA($string)
245
    {
246
        return preg_replace_callback(
247
            '/<!\[CDATA\[(.+?)\]\]>/s',
248
            array('HTMLPurifier_Lexer', 'CDATACallback'),
249
            $string
250
        );
251
    }
252
253
    /**
254
     * Special CDATA case that is especially convoluted for <script>
255
     * @param string $string HTML string to process.
256
     * @return string HTML with CDATA sections escaped.
257
     */
258
    protected static function escapeCommentedCDATA($string)
259
    {
260
        return preg_replace_callback(
261
            '#<!--//--><!\[CDATA\[//><!--(.+?)//--><!\]\]>#s',
262
            array('HTMLPurifier_Lexer', 'CDATACallback'),
263
            $string
264
        );
265
    }
266
267
    /**
268
     * Special Internet Explorer conditional comments should be removed.
269
     * @param string $string HTML string to process.
270
     * @return string HTML with conditional comments removed.
271
     */
272
    protected static function removeIEConditional($string)
273
    {
274
        return preg_replace(
275
            '#<!--\[if [^>]+\]>.*?<!\[endif\]-->#si', // probably should generalize for all strings
276
            '',
277
            $string
278
        );
279
    }
280
281
    /**
282
     * Callback function for escapeCDATA() that does the work.
283
     *
284
     * @warning Though this is public in order to let the callback happen,
285
     *          calling it directly is not recommended.
286
     * @param array $matches PCRE matches array, with index 0 the entire match
287
     *                  and 1 the inside of the CDATA section.
288
     * @return string Escaped internals of the CDATA section.
289
     */
290
    protected static function CDATACallback($matches)
291
    {
292
        // not exactly sure why the character set is needed, but whatever
293
        return htmlspecialchars($matches[1], ENT_COMPAT, 'UTF-8');
294
    }
295
296
    /**
297
     * Takes a piece of HTML and normalizes it by converting entities, fixing
298
     * encoding, extracting bits, and other good stuff.
299
     * @param string $html HTML.
300
     * @param HTMLPurifier_Config $config
301
     * @param HTMLPurifier_Context $context
302
     * @return string
303
     * @todo Consider making protected
304
     */
305
    public function normalize($html, $config, $context)
306
    {
307
        // normalize newlines to \n
308
        if ($config->get('Core.NormalizeNewlines')) {
309
            $html = str_replace("\r\n", "\n", $html);
310
            $html = str_replace("\r", "\n", $html);
311
        }
312
313
        if ($config->get('HTML.Trusted')) {
314
            // escape convoluted CDATA
315
            $html = $this->escapeCommentedCDATA($html);
316
        }
317
318
        // escape CDATA
319
        $html = $this->escapeCDATA($html);
320
321
        $html = $this->removeIEConditional($html);
322
323
        // extract body from document if applicable
324
        if ($config->get('Core.ConvertDocumentToFragment')) {
325
            $e = false;
326
            if ($config->get('Core.CollectErrors')) {
327
                $e =& $context->get('ErrorCollector');
328
            }
329
            $new_html = $this->extractBody($html);
330
            if ($e && $new_html != $html) {
331
                $e->send(E_WARNING, 'Lexer: Extracted body');
332
            }
333
            $html = $new_html;
334
        }
335
336
        // expand entities that aren't the big five
337
        if ($config->get('Core.LegacyEntityDecoder')) {
338
        $html = $this->_entity_parser->substituteNonSpecialEntities($html);
339
        }
340
341
        // clean into wellformed UTF-8 string for an SGML context: this has
342
        // to be done after entity expansion because the entities sometimes
343
        // represent non-SGML characters (horror, horror!)
344
        $html = HTMLPurifier_Encoder::cleanUTF8($html);
345
346
        // if processing instructions are to removed, remove them now
347
        if ($config->get('Core.RemoveProcessingInstructions')) {
348
            $html = preg_replace('#<\?.+?\?>#s', '', $html);
349
        }
350
351
        $hidden_elements = $config->get('Core.HiddenElements');
352
        if ($config->get('Core.AggressivelyRemoveScript') &&
353
            !($config->get('HTML.Trusted') || !$config->get('Core.RemoveScriptContents')
354
            || empty($hidden_elements["script"]))) {
355
            $html = preg_replace('#<script[^>]*>.*?</script>#i', '', $html);
356
        }
357
358
        return $html;
359
    }
360
361
    /**
362
     * Takes a string of HTML (fragment or document) and returns the content
363
     * @todo Consider making protected
364
     */
365
    public function extractBody($html)
0 ignored issues
show
Documentation introduced by
The return type could not be reliably inferred; please add a @return annotation.

Our type inference engine in quite powerful, but sometimes the code does not provide enough clues to go by. In these cases we request you to add a @return annotation as described here.

Loading history...
366
    {
367
        $matches = array();
368
        $result = preg_match('|(.*?)<body[^>]*>(.*)</body>|is', $html, $matches);
369
        if ($result) {
370
            // Make sure it's not in a comment
371
            $comment_start = strrpos($matches[1], '<!--');
372
            $comment_end   = strrpos($matches[1], '-->');
373
            if ($comment_start === false ||
374
                ($comment_end !== false && $comment_end > $comment_start)) {
375
                return $matches[2];
376
            }
377
        }
378
        return $html;
379
    }
380
}
381
382
// vim: et sw=4 sts=4
383