Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.
Common duplication problems, and corresponding solutions are:
Complex classes like HTMLPurifier_Encoder often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.
Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.
While breaking up the class, it is a good idea to analyze how other classes use HTMLPurifier_Encoder, and based on these observations, apply Extract Interface, too.
1 | <?php |
||
7 | class HTMLPurifier_Encoder |
||
|
|||
8 | { |
||
9 | |||
10 | /** |
||
11 | * Constructor throws fatal error if you attempt to instantiate class |
||
12 | */ |
||
13 | private function __construct() |
||
17 | |||
18 | /** |
||
19 | * Error-handler that mutes errors, alternative to shut-up operator. |
||
20 | */ |
||
21 | public static function muteErrorHandler() |
||
24 | |||
25 | /** |
||
26 | * iconv wrapper which mutes errors, but doesn't work around bugs. |
||
27 | * @param string $in Input encoding |
||
28 | * @param string $out Output encoding |
||
29 | * @param string $text The text to convert |
||
30 | * @return string |
||
31 | */ |
||
32 | public static function unsafeIconv($in, $out, $text) |
||
39 | |||
40 | /** |
||
41 | * iconv wrapper which mutes errors and works around bugs. |
||
42 | * @param string $in Input encoding |
||
43 | * @param string $out Output encoding |
||
44 | * @param string $text The text to convert |
||
45 | * @param int $max_chunk_size |
||
46 | * @return string |
||
47 | */ |
||
48 | public static function iconv($in, $out, $text, $max_chunk_size = 8000) |
||
97 | |||
98 | /** |
||
99 | * Cleans a UTF-8 string for well-formedness and SGML validity |
||
100 | * |
||
101 | * It will parse according to UTF-8 and return a valid UTF8 string, with |
||
102 | * non-SGML codepoints excluded. |
||
103 | * |
||
104 | * @param string $str The string to clean |
||
105 | * @param bool $force_php |
||
106 | * @return string |
||
107 | * |
||
108 | * @note Just for reference, the non-SGML code points are 0 to 31 and |
||
109 | * 127 to 159, inclusive. However, we allow code points 9, 10 |
||
110 | * and 13, which are the tab, line feed and carriage return |
||
111 | * respectively. 128 and above the code points map to multibyte |
||
112 | * UTF-8 representations. |
||
113 | * |
||
114 | * @note Fallback code adapted from utf8ToUnicode by Henri Sivonen and |
||
115 | * [email protected] at <http://iki.fi/hsivonen/php-utf8/> under the |
||
116 | * LGPL license. Notes on what changed are inside, but in general, |
||
117 | * the original code transformed UTF-8 text into an array of integer |
||
118 | * Unicode codepoints. Understandably, transforming that back to |
||
119 | * a string would be somewhat expensive, so the function was modded to |
||
120 | * directly operate on the string. However, this discourages code |
||
121 | * reuse, and the logic enumerated here would be useful for any |
||
122 | * function that needs to be able to understand UTF-8 characters. |
||
123 | * As of right now, only smart lossless character encoding converters |
||
124 | * would need that, and I'm probably not going to implement them. |
||
125 | * Once again, PHP 6 should solve all our problems. |
||
126 | */ |
||
127 | public static function cleanUTF8($str, $force_php = false) |
||
282 | |||
283 | /** |
||
284 | * Translates a Unicode codepoint into its corresponding UTF-8 character. |
||
285 | * @note Based on Feyd's function at |
||
286 | * <http://forums.devnetwork.net/viewtopic.php?p=191404#191404>, |
||
287 | * which is in public domain. |
||
288 | * @note While we're going to do code point parsing anyway, a good |
||
289 | * optimization would be to refuse to translate code points that |
||
290 | * are non-SGML characters. However, this could lead to duplication. |
||
291 | * @note This is very similar to the unichr function in |
||
292 | * maintenance/generate-entity-file.php (although this is superior, |
||
293 | * due to its sanity checks). |
||
294 | */ |
||
295 | |||
296 | // +----------+----------+----------+----------+ |
||
297 | // | 33222222 | 22221111 | 111111 | | |
||
298 | // | 10987654 | 32109876 | 54321098 | 76543210 | bit |
||
299 | // +----------+----------+----------+----------+ |
||
300 | // | | | | 0xxxxxxx | 1 byte 0x00000000..0x0000007F |
||
301 | // | | | 110yyyyy | 10xxxxxx | 2 byte 0x00000080..0x000007FF |
||
302 | // | | 1110zzzz | 10yyyyyy | 10xxxxxx | 3 byte 0x00000800..0x0000FFFF |
||
303 | // | 11110www | 10wwzzzz | 10yyyyyy | 10xxxxxx | 4 byte 0x00010000..0x0010FFFF |
||
304 | // +----------+----------+----------+----------+ |
||
305 | // | 00000000 | 00011111 | 11111111 | 11111111 | Theoretical upper limit of legal scalars: 2097151 (0x001FFFFF) |
||
306 | // | 00000000 | 00010000 | 11111111 | 11111111 | Defined upper limit of legal scalar codes |
||
307 | // +----------+----------+----------+----------+ |
||
308 | |||
309 | public static function unichr($code) |
||
352 | |||
353 | /** |
||
354 | * @return bool |
||
355 | */ |
||
356 | public static function iconvAvailable() |
||
364 | |||
365 | /** |
||
366 | * Convert a string to UTF-8 based on configuration. |
||
367 | * @param string $str The string to convert |
||
368 | * @param HTMLPurifier_Config $config |
||
369 | * @param HTMLPurifier_Context $context |
||
370 | * @return string |
||
371 | */ |
||
372 | public static function convertToUTF8($str, $config, $context) |
||
410 | |||
411 | /** |
||
412 | * Converts a string from UTF-8 based on configuration. |
||
413 | * @param string $str The string to convert |
||
414 | * @param HTMLPurifier_Config $config |
||
415 | * @param HTMLPurifier_Context $context |
||
416 | * @return string |
||
417 | * @note Currently, this is a lossy conversion, with unexpressable |
||
418 | * characters being omitted. |
||
419 | */ |
||
420 | public static function convertFromUTF8($str, $config, $context) |
||
457 | |||
458 | /** |
||
459 | * Lossless (character-wise) conversion of HTML to ASCII |
||
460 | * @param string $str UTF-8 string to be converted to ASCII |
||
461 | * @return string ASCII encoded string with non-ASCII character entity-ized |
||
462 | * @warning Adapted from MediaWiki, claiming fair use: this is a common |
||
463 | * algorithm. If you disagree with this license fudgery, |
||
464 | * implement it yourself. |
||
465 | * @note Uses decimal numeric entities since they are best supported. |
||
466 | * @note This is a DUMB function: it has no concept of keeping |
||
467 | * character entities that the projected character encoding |
||
468 | * can allow. We could possibly implement a smart version |
||
469 | * but that would require it to also know which Unicode |
||
470 | * codepoints the charset supported (not an easy task). |
||
471 | * @note Sort of with cleanUTF8() but it assumes that $str is |
||
472 | * well-formed UTF-8 |
||
473 | */ |
||
474 | public static function convertToASCIIDumbLossless($str) |
||
505 | |||
506 | /** No bugs detected in iconv. */ |
||
507 | const ICONV_OK = 0; |
||
508 | |||
509 | /** Iconv truncates output if converting from UTF-8 to another |
||
510 | * character set with //IGNORE, and a non-encodable character is found */ |
||
511 | const ICONV_TRUNCATES = 1; |
||
512 | |||
513 | /** Iconv does not support //IGNORE, making it unusable for |
||
514 | * transcoding purposes */ |
||
515 | const ICONV_UNUSABLE = 2; |
||
516 | |||
517 | /** |
||
518 | * glibc iconv has a known bug where it doesn't handle the magic |
||
519 | * //IGNORE stanza correctly. In particular, rather than ignore |
||
520 | * characters, it will return an EILSEQ after consuming some number |
||
521 | * of characters, and expect you to restart iconv as if it were |
||
522 | * an E2BIG. Old versions of PHP did not respect the errno, and |
||
523 | * returned the fragment, so as a result you would see iconv |
||
524 | * mysteriously truncating output. We can work around this by |
||
525 | * manually chopping our input into segments of about 8000 |
||
526 | * characters, as long as PHP ignores the error code. If PHP starts |
||
527 | * paying attention to the error code, iconv becomes unusable. |
||
528 | * |
||
529 | * @return int Error code indicating severity of bug. |
||
530 | */ |
||
531 | public static function testIconvTruncateBug() |
||
553 | |||
554 | /** |
||
555 | * This expensive function tests whether or not a given character |
||
556 | * encoding supports ASCII. 7/8-bit encodings like Shift_JIS will |
||
557 | * fail this test, and require special processing. Variable width |
||
558 | * encodings shouldn't ever fail. |
||
559 | * |
||
560 | * @param string $encoding Encoding name to test, as per iconv format |
||
561 | * @param bool $bypass Whether or not to bypass the precompiled arrays. |
||
562 | * @return Array of UTF-8 characters to their corresponding ASCII, |
||
563 | * which can be used to "undo" any overzealous iconv action. |
||
564 | */ |
||
565 | public static function testEncodingSupportsASCII($encoding, $bypass = false) |
||
609 | } |
||
610 | |||
612 |
You can fix this by adding a namespace to your class:
When choosing a vendor namespace, try to pick something that is not too generic to avoid conflicts with other libraries.