Complex classes like HtmlTokenizer often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.
Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.
While breaking up the class, it is a good idea to analyze how other classes use HtmlTokenizer, and based on these observations, apply Extract Interface, too.
1 | <?php |
||
19 | class HtmlTokenizer |
||
20 | { |
||
21 | /** |
||
22 | * Current tokenizer position. Tokenizer is a linear processor (no regular expression is |
||
23 | * involved). This slows it down, but the results are much more reliable. |
||
24 | */ |
||
25 | const POSITION_PLAIN_TEXT = 0x001; |
||
26 | const POSITION_IN_TAG = 0x002; |
||
27 | const POSITION_IN_QUOTAS = 0x003; |
||
28 | |||
29 | /** |
||
30 | * Token types detected and processed by tokenizer. |
||
31 | */ |
||
32 | const PLAIN_TEXT = 'plain'; |
||
33 | const TAG_OPEN = 'open'; |
||
34 | const TAG_CLOSE = 'close'; |
||
35 | const TAG_SHORT = 'short'; |
||
36 | const TAG_VOID = 'void'; |
||
37 | |||
38 | /** |
||
39 | * Token fields. There are a lot of tokens in HTML (up to 10,000 different ones). We better to |
||
40 | * use numeric keys for array than any text fields or even objects. |
||
41 | */ |
||
42 | const TOKEN_NAME = 0; |
||
43 | const TOKEN_TYPE = 1; |
||
44 | const TOKEN_CONTENT = 2; |
||
45 | const TOKEN_ATTRIBUTES = 3; |
||
46 | |||
47 | /** |
||
48 | * List of void tags. |
||
49 | * |
||
50 | * @link http://www.w3.org/TR/html5/syntax.html#void-elements |
||
51 | * @var array |
||
52 | */ |
||
53 | protected $voidTags = [ |
||
54 | 'area', |
||
55 | 'base', |
||
56 | 'br', |
||
57 | 'col', |
||
58 | 'embed', |
||
59 | 'hr', |
||
60 | 'img', |
||
61 | 'input', |
||
62 | 'keygen', |
||
63 | 'link', |
||
64 | 'meta', |
||
65 | 'param', |
||
66 | 'source', |
||
67 | 'track', |
||
68 | 'wbr' |
||
69 | ]; |
||
70 | |||
71 | /** |
||
72 | * Array of parsed tokens. Every token has fields name, type, content and arguments. |
||
73 | * |
||
74 | * @var array |
||
75 | */ |
||
76 | protected $tokens = []; |
||
77 | |||
78 | /** |
||
79 | * PHP block should be isolated while parsing, Keep enabled. |
||
80 | * |
||
81 | * @var bool |
||
82 | */ |
||
83 | protected $isolatePHP = false; |
||
84 | |||
85 | /** |
||
86 | * PHP Blocks isolator, which holds all existing PHP blocks and restores them in output. |
||
87 | * |
||
88 | * @var Isolator|null |
||
89 | */ |
||
90 | protected $isolator = null; |
||
91 | |||
92 | /** |
||
93 | * @param bool $isolatePHP PHP block should be isolated and enabled by default |
||
94 | * @param Isolator $isolator |
||
95 | */ |
||
96 | public function __construct(bool $isolatePHP = true, Isolator $isolator = null) |
||
101 | |||
102 | /** |
||
103 | * Parse HTML content and return it's tokens. |
||
104 | * |
||
105 | * @param string $source HTML source. |
||
106 | * |
||
107 | * @return array |
||
108 | */ |
||
109 | public function parse(string $source): array |
||
184 | |||
185 | /** |
||
186 | * Compile all parsed tokens back into html form. |
||
187 | * |
||
188 | * @return string |
||
189 | */ |
||
190 | public function compile(): string |
||
199 | |||
200 | /** |
||
201 | * Compile parsed token. |
||
202 | * |
||
203 | * @param array $token |
||
204 | * |
||
205 | * @return string |
||
206 | */ |
||
207 | public function compileToken(array $token): string |
||
235 | |||
236 | /** |
||
237 | * Parses tag body for arguments, name, etc. |
||
238 | * |
||
239 | * @param string $content Tag content to be parsed (from < till >). |
||
240 | * |
||
241 | * @return array |
||
242 | */ |
||
243 | protected function parseToken(string $content): array |
||
312 | |||
313 | /** |
||
314 | * Handles single token and passes it to a callback function if specified. |
||
315 | * |
||
316 | * @param int|null $tokenType Token type. |
||
317 | * @param string $content Non parsed token content. |
||
318 | */ |
||
319 | protected function handleToken($tokenType, string $content) |
||
336 | |||
337 | /** |
||
338 | * Will restore all existing PHP blocks to their original content. |
||
339 | * |
||
340 | * @param string $source |
||
341 | * |
||
342 | * @return string |
||
343 | */ |
||
344 | protected function repairPHP(string $source): string |
||
352 | } |
||
353 |
This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.
If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress.
In this case you can add the
@ignore
PhpDoc annotation to the duplicate definition and it will be ignored.