GitHub Access Token became invalid

It seems like the GitHub access token used for retrieving details about this repository from GitHub became invalid. This might prevent certain types of inspections from being run (in particular, everything related to pull requests).
Please ask an admin of your repository to re-new the access token on this website.

HTMLPurifier_Encoder::convertFromUTF8()   D
last analyzed

Complexity

Conditions 10
Paths 18

Size

Total Lines 37
Code Lines 23

Duplication

Lines 0
Ratio 0 %

Importance

Changes 1
Bugs 0 Features 0
Metric Value
cc 10
eloc 23
c 1
b 0
f 0
nc 18
nop 3
dl 0
loc 37
rs 4.8196

How to fix   Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * A UTF-8 specific character encoder that handles cleaning and transforming.
5
 * @note All functions in this class should be static.
6
 */
7
class HTMLPurifier_Encoder
0 ignored issues
show
Coding Style Compatibility introduced by
PSR1 recommends that each class must be in a namespace of at least one level to avoid collisions.

You can fix this by adding a namespace to your class:

namespace YourVendor;

class YourClass { }

When choosing a vendor namespace, try to pick something that is not too generic to avoid conflicts with other libraries.

Loading history...
8
{
9
10
    /**
11
     * Constructor throws fatal error if you attempt to instantiate class
12
     */
13
    private function __construct()
14
    {
15
        trigger_error('Cannot instantiate encoder, call methods statically', E_USER_ERROR);
16
    }
17
18
    /**
19
     * Error-handler that mutes errors, alternative to shut-up operator.
20
     */
21
    public static function muteErrorHandler()
22
    {
23
    }
24
25
    /**
26
     * iconv wrapper which mutes errors, but doesn't work around bugs.
27
     * @param string $in Input encoding
28
     * @param string $out Output encoding
29
     * @param string $text The text to convert
30
     * @return string
31
     */
32
    public static function unsafeIconv($in, $out, $text)
33
    {
34
        set_error_handler(array('HTMLPurifier_Encoder', 'muteErrorHandler'));
35
        $r = iconv($in, $out, $text);
36
        restore_error_handler();
37
        return $r;
38
    }
39
40
    /**
41
     * iconv wrapper which mutes errors and works around bugs.
42
     * @param string $in Input encoding
43
     * @param string $out Output encoding
44
     * @param string $text The text to convert
45
     * @param int $max_chunk_size
46
     * @return string
47
     */
48
    public static function iconv($in, $out, $text, $max_chunk_size = 8000)
49
    {
50
        $code = self::testIconvTruncateBug();
51
        if ($code == self::ICONV_OK) {
52
            return self::unsafeIconv($in, $out, $text);
53
        } elseif ($code == self::ICONV_TRUNCATES) {
54
            // we can only work around this if the input character set
55
            // is utf-8
56
            if ($in == 'utf-8') {
57
                if ($max_chunk_size < 4) {
58
                    trigger_error('max_chunk_size is too small', E_USER_WARNING);
59
                    return false;
60
                }
61
                // split into 8000 byte chunks, but be careful to handle
62
                // multibyte boundaries properly
63
                if (($c = strlen($text)) <= $max_chunk_size) {
64
                    return self::unsafeIconv($in, $out, $text);
65
                }
66
                $r = '';
67
                $i = 0;
68
                while (true) {
69
                    if ($i + $max_chunk_size >= $c) {
70
                        $r .= self::unsafeIconv($in, $out, substr($text, $i));
71
                        break;
72
                    }
73
                    // wibble the boundary
74
                    if (0x80 != (0xC0 & ord($text[$i + $max_chunk_size]))) {
75
                        $chunk_size = $max_chunk_size;
76 View Code Duplication
                    } elseif (0x80 != (0xC0 & ord($text[$i + $max_chunk_size - 1]))) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
77
                        $chunk_size = $max_chunk_size - 1;
78
                    } elseif (0x80 != (0xC0 & ord($text[$i + $max_chunk_size - 2]))) {
79
                        $chunk_size = $max_chunk_size - 2;
80 View Code Duplication
                    } elseif (0x80 != (0xC0 & ord($text[$i + $max_chunk_size - 3]))) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
81
                        $chunk_size = $max_chunk_size - 3;
82
                    } else {
83
                        return false; // rather confusing UTF-8...
84
                    }
85
                    $chunk = substr($text, $i, $chunk_size); // substr doesn't mind overlong lengths
86
                    $r .= self::unsafeIconv($in, $out, $chunk);
87
                    $i += $chunk_size;
88
                }
89
                return $r;
90
            } else {
91
                return false;
92
            }
93
        } else {
94
            return false;
95
        }
96
    }
97
98
    /**
99
     * Cleans a UTF-8 string for well-formedness and SGML validity
100
     *
101
     * It will parse according to UTF-8 and return a valid UTF8 string, with
102
     * non-SGML codepoints excluded.
103
     *
104
     * @param string $str The string to clean
105
     * @param bool $force_php
106
     * @return string
107
     *
108
     * @note Just for reference, the non-SGML code points are 0 to 31 and
109
     *       127 to 159, inclusive.  However, we allow code points 9, 10
110
     *       and 13, which are the tab, line feed and carriage return
111
     *       respectively. 128 and above the code points map to multibyte
112
     *       UTF-8 representations.
113
     *
114
     * @note Fallback code adapted from utf8ToUnicode by Henri Sivonen and
115
     *       [email protected] at <http://iki.fi/hsivonen/php-utf8/> under the
116
     *       LGPL license.  Notes on what changed are inside, but in general,
117
     *       the original code transformed UTF-8 text into an array of integer
118
     *       Unicode codepoints. Understandably, transforming that back to
119
     *       a string would be somewhat expensive, so the function was modded to
120
     *       directly operate on the string.  However, this discourages code
121
     *       reuse, and the logic enumerated here would be useful for any
122
     *       function that needs to be able to understand UTF-8 characters.
123
     *       As of right now, only smart lossless character encoding converters
124
     *       would need that, and I'm probably not going to implement them.
125
     *       Once again, PHP 6 should solve all our problems.
126
     */
127
    public static function cleanUTF8($str, $force_php = false)
0 ignored issues
show
Unused Code introduced by
The parameter $force_php is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
128
    {
129
        // UTF-8 validity is checked since PHP 4.3.5
130
        // This is an optimization: if the string is already valid UTF-8, no
131
        // need to do PHP stuff. 99% of the time, this will be the case.
132
        // The regexp matches the XML char production, as well as well as excluding
133
        // non-SGML codepoints U+007F to U+009F
134
        if (preg_match(
135
            '/^[\x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]*$/Du',
136
            $str
137
        )) {
138
            return $str;
139
        }
140
141
        $mState = 0; // cached expected number of octets after the current octet
142
                     // until the beginning of the next UTF8 character sequence
143
        $mUcs4  = 0; // cached Unicode character
144
        $mBytes = 1; // cached expected number of octets in the current sequence
145
146
        // original code involved an $out that was an array of Unicode
147
        // codepoints.  Instead of having to convert back into UTF-8, we've
148
        // decided to directly append valid UTF-8 characters onto a string
149
        // $out once they're done.  $char accumulates raw bytes, while $mUcs4
150
        // turns into the Unicode code point, so there's some redundancy.
151
152
        $out = '';
153
        $char = '';
154
155
        $len = strlen($str);
156
        for ($i = 0; $i < $len; $i++) {
157
            $in = ord($str{$i});
158
            $char .= $str[$i]; // append byte to char
159
            if (0 == $mState) {
160
                // When mState is zero we expect either a US-ASCII character
161
                // or a multi-octet sequence.
162
                if (0 == (0x80 & ($in))) {
163
                    // US-ASCII, pass straight through.
164
                    if (($in <= 31 || $in == 127) &&
0 ignored issues
show
Unused Code introduced by
This if statement is empty and can be removed.

This check looks for the bodies of if statements that have no statements or where all statements have been commented out. This may be the result of changes for debugging or the code may simply be obsolete.

These if bodies can be removed. If you have an empty if but statements in the else branch, consider inverting the condition.

if (rand(1, 6) > 3) {
//print "Check failed";
} else {
    print "Check succeeded";
}

could be turned into

if (rand(1, 6) <= 3) {
    print "Check succeeded";
}

This is much more concise to read.

Loading history...
165
                        !($in == 9 || $in == 13 || $in == 10) // save \r\t\n
166
                    ) {
167
                        // control characters, remove
168
                    } else {
169
                        $out .= $char;
170
                    }
171
                    // reset
172
                    $char = '';
173
                    $mBytes = 1;
174 View Code Duplication
                } elseif (0xC0 == (0xE0 & ($in))) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
175
                    // First octet of 2 octet sequence
176
                    $mUcs4 = ($in);
177
                    $mUcs4 = ($mUcs4 & 0x1F) << 6;
178
                    $mState = 1;
179
                    $mBytes = 2;
180
                } elseif (0xE0 == (0xF0 & ($in))) {
181
                    // First octet of 3 octet sequence
182
                    $mUcs4 = ($in);
183
                    $mUcs4 = ($mUcs4 & 0x0F) << 12;
184
                    $mState = 2;
185
                    $mBytes = 3;
186 View Code Duplication
                } elseif (0xF0 == (0xF8 & ($in))) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
187
                    // First octet of 4 octet sequence
188
                    $mUcs4 = ($in);
189
                    $mUcs4 = ($mUcs4 & 0x07) << 18;
190
                    $mState = 3;
191
                    $mBytes = 4;
192
                } elseif (0xF8 == (0xFC & ($in))) {
193
                    // First octet of 5 octet sequence.
194
                    //
195
                    // This is illegal because the encoded codepoint must be
196
                    // either:
197
                    // (a) not the shortest form or
198
                    // (b) outside the Unicode range of 0-0x10FFFF.
199
                    // Rather than trying to resynchronize, we will carry on
200
                    // until the end of the sequence and let the later error
201
                    // handling code catch it.
202
                    $mUcs4 = ($in);
203
                    $mUcs4 = ($mUcs4 & 0x03) << 24;
204
                    $mState = 4;
205
                    $mBytes = 5;
206 View Code Duplication
                } elseif (0xFC == (0xFE & ($in))) {
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated across your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
207
                    // First octet of 6 octet sequence, see comments for 5
208
                    // octet sequence.
209
                    $mUcs4 = ($in);
210
                    $mUcs4 = ($mUcs4 & 1) << 30;
211
                    $mState = 5;
212
                    $mBytes = 6;
213
                } else {
214
                    // Current octet is neither in the US-ASCII range nor a
215
                    // legal first octet of a multi-octet sequence.
216
                    $mState = 0;
217
                    $mUcs4  = 0;
218
                    $mBytes = 1;
219
                    $char = '';
220
                }
221
            } else {
222
                // When mState is non-zero, we expect a continuation of the
223
                // multi-octet sequence
224
                if (0x80 == (0xC0 & ($in))) {
225
                    // Legal continuation.
226
                    $shift = ($mState - 1) * 6;
227
                    $tmp = $in;
228
                    $tmp = ($tmp & 0x0000003F) << $shift;
229
                    $mUcs4 |= $tmp;
230
231
                    if (0 == --$mState) {
232
                        // End of the multi-octet sequence. mUcs4 now contains
233
                        // the final Unicode codepoint to be output
234
235
                        // Check for illegal sequences and codepoints.
236
237
                        // From Unicode 3.1, non-shortest form is illegal
238
                        if (((2 == $mBytes) && ($mUcs4 < 0x0080)) ||
0 ignored issues
show
Unused Code introduced by
This if statement is empty and can be removed.

This check looks for the bodies of if statements that have no statements or where all statements have been commented out. This may be the result of changes for debugging or the code may simply be obsolete.

These if bodies can be removed. If you have an empty if but statements in the else branch, consider inverting the condition.

if (rand(1, 6) > 3) {
//print "Check failed";
} else {
    print "Check succeeded";
}

could be turned into

if (rand(1, 6) <= 3) {
    print "Check succeeded";
}

This is much more concise to read.

Loading history...
239
                            ((3 == $mBytes) && ($mUcs4 < 0x0800)) ||
240
                            ((4 == $mBytes) && ($mUcs4 < 0x10000)) ||
241
                            (4 < $mBytes) ||
242
                            // From Unicode 3.2, surrogate characters = illegal
243
                            (($mUcs4 & 0xFFFFF800) == 0xD800) ||
244
                            // Codepoints outside the Unicode range are illegal
245
                            ($mUcs4 > 0x10FFFF)
246
                        ) {
247
248
                        } elseif (0xFEFF != $mUcs4 && // omit BOM
249
                            // check for valid Char unicode codepoints
250
                            (
251
                                0x9 == $mUcs4 ||
252
                                0xA == $mUcs4 ||
253
                                0xD == $mUcs4 ||
254
                                (0x20 <= $mUcs4 && 0x7E >= $mUcs4) ||
255
                                // 7F-9F is not strictly prohibited by XML,
256
                                // but it is non-SGML, and thus we don't allow it
257
                                (0xA0 <= $mUcs4 && 0xD7FF >= $mUcs4) ||
258
                                (0x10000 <= $mUcs4 && 0x10FFFF >= $mUcs4)
259
                            )
260
                        ) {
261
                            $out .= $char;
262
                        }
263
                        // initialize UTF8 cache (reset)
264
                        $mState = 0;
265
                        $mUcs4  = 0;
266
                        $mBytes = 1;
267
                        $char = '';
268
                    }
269
                } else {
270
                    // ((0xC0 & (*in) != 0x80) && (mState != 0))
0 ignored issues
show
Unused Code Comprehensibility introduced by
52% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
271
                    // Incomplete multi-octet sequence.
272
                    // used to result in complete fail, but we'll reset
273
                    $mState = 0;
274
                    $mUcs4  = 0;
275
                    $mBytes = 1;
276
                    $char ='';
277
                }
278
            }
279
        }
280
        return $out;
281
    }
282
283
    /**
284
     * Translates a Unicode codepoint into its corresponding UTF-8 character.
285
     * @note Based on Feyd's function at
286
     *       <http://forums.devnetwork.net/viewtopic.php?p=191404#191404>,
287
     *       which is in public domain.
288
     * @note While we're going to do code point parsing anyway, a good
289
     *       optimization would be to refuse to translate code points that
290
     *       are non-SGML characters.  However, this could lead to duplication.
291
     * @note This is very similar to the unichr function in
292
     *       maintenance/generate-entity-file.php (although this is superior,
293
     *       due to its sanity checks).
294
     */
295
296
    // +----------+----------+----------+----------+
297
    // | 33222222 | 22221111 | 111111   |          |
0 ignored issues
show
Unused Code Comprehensibility introduced by
50% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
298
    // | 10987654 | 32109876 | 54321098 | 76543210 | bit
0 ignored issues
show
Unused Code Comprehensibility introduced by
45% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
299
    // +----------+----------+----------+----------+
300
    // |          |          |          | 0xxxxxxx | 1 byte 0x00000000..0x0000007F
0 ignored issues
show
Unused Code Comprehensibility introduced by
41% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
301
    // |          |          | 110yyyyy | 10xxxxxx | 2 byte 0x00000080..0x000007FF
0 ignored issues
show
Unused Code Comprehensibility introduced by
40% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
302
    // |          | 1110zzzz | 10yyyyyy | 10xxxxxx | 3 byte 0x00000800..0x0000FFFF
0 ignored issues
show
Unused Code Comprehensibility introduced by
40% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
303
    // | 11110www | 10wwzzzz | 10yyyyyy | 10xxxxxx | 4 byte 0x00010000..0x0010FFFF
0 ignored issues
show
Unused Code Comprehensibility introduced by
39% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
304
    // +----------+----------+----------+----------+
305
    // | 00000000 | 00011111 | 11111111 | 11111111 | Theoretical upper limit of legal scalars: 2097151 (0x001FFFFF)
0 ignored issues
show
Unused Code Comprehensibility introduced by
37% of this comment could be valid code. Did you maybe forget this after debugging?

Sometimes obsolete code just ends up commented out instead of removed. In this case it is better to remove the code once you have checked you do not need it.

The code might also have been commented out for debugging purposes. In this case it is vital that someone uncomments it again or your project may behave in very unexpected ways in production.

This check looks for comments that seem to be mostly valid code and reports them.

Loading history...
306
    // | 00000000 | 00010000 | 11111111 | 11111111 | Defined upper limit of legal scalar codes
307
    // +----------+----------+----------+----------+
308
309
    public static function unichr($code)
310
    {
311
        if ($code > 1114111 or $code < 0 or
0 ignored issues
show
Comprehensibility Best Practice introduced by
Using logical operators such as or instead of || is generally not recommended.

PHP has two types of connecting operators (logical operators, and boolean operators):

  Logical Operators Boolean Operator
AND - meaning and &&
OR - meaning or ||

The difference between these is the order in which they are executed. In most cases, you would want to use a boolean operator like &&, or ||.

Let’s take a look at a few examples:

// Logical operators have lower precedence:
$f = false or true;

// is executed like this:
($f = false) or true;


// Boolean operators have higher precedence:
$f = false || true;

// is executed like this:
$f = (false || true);

Logical Operators are used for Control-Flow

One case where you explicitly want to use logical operators is for control-flow such as this:

$x === 5
    or die('$x must be 5.');

// Instead of
if ($x !== 5) {
    die('$x must be 5.');
}

Since die introduces problems of its own, f.e. it makes our code hardly testable, and prevents any kind of more sophisticated error handling; you probably do not want to use this in real-world code. Unfortunately, logical operators cannot be combined with throw at this point:

// The following is currently a parse error.
$x === 5
    or throw new RuntimeException('$x must be 5.');

These limitations lead to logical operators rarely being of use in current PHP code.

Loading history...
312
          ($code >= 55296 and $code <= 57343) ) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
Using logical operators such as and instead of && is generally not recommended.

PHP has two types of connecting operators (logical operators, and boolean operators):

  Logical Operators Boolean Operator
AND - meaning and &&
OR - meaning or ||

The difference between these is the order in which they are executed. In most cases, you would want to use a boolean operator like &&, or ||.

Let’s take a look at a few examples:

// Logical operators have lower precedence:
$f = false or true;

// is executed like this:
($f = false) or true;


// Boolean operators have higher precedence:
$f = false || true;

// is executed like this:
$f = (false || true);

Logical Operators are used for Control-Flow

One case where you explicitly want to use logical operators is for control-flow such as this:

$x === 5
    or die('$x must be 5.');

// Instead of
if ($x !== 5) {
    die('$x must be 5.');
}

Since die introduces problems of its own, f.e. it makes our code hardly testable, and prevents any kind of more sophisticated error handling; you probably do not want to use this in real-world code. Unfortunately, logical operators cannot be combined with throw at this point:

// The following is currently a parse error.
$x === 5
    or throw new RuntimeException('$x must be 5.');

These limitations lead to logical operators rarely being of use in current PHP code.

Loading history...
313
            // bits are set outside the "valid" range as defined
314
            // by UNICODE 4.1.0
315
            return '';
316
        }
317
318
        $x = $y = $z = $w = 0;
0 ignored issues
show
Unused Code introduced by
$x is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
319
        if ($code < 128) {
320
            // regular ASCII character
321
            $x = $code;
322
        } else {
323
            // set up bits for UTF-8
324
            $x = ($code & 63) | 128;
325
            if ($code < 2048) {
326
                $y = (($code & 2047) >> 6) | 192;
327
            } else {
328
                $y = (($code & 4032) >> 6) | 128;
329
                if ($code < 65536) {
330
                    $z = (($code >> 12) & 15) | 224;
331
                } else {
332
                    $z = (($code >> 12) & 63) | 128;
333
                    $w = (($code >> 18) & 7)  | 240;
334
                }
335
            }
336
        }
337
        // set up the actual character
338
        $ret = '';
339
        if ($w) {
340
            $ret .= chr($w);
341
        }
342
        if ($z) {
343
            $ret .= chr($z);
344
        }
345
        if ($y) {
346
            $ret .= chr($y);
347
        }
348
        $ret .= chr($x);
349
350
        return $ret;
351
    }
352
353
    /**
354
     * @return bool
355
     */
356
    public static function iconvAvailable()
357
    {
358
        static $iconv = null;
359
        if ($iconv === null) {
360
            $iconv = function_exists('iconv') && self::testIconvTruncateBug() != self::ICONV_UNUSABLE;
361
        }
362
        return $iconv;
363
    }
364
365
    /**
366
     * Convert a string to UTF-8 based on configuration.
367
     * @param string $str The string to convert
368
     * @param HTMLPurifier_Config $config
369
     * @param HTMLPurifier_Context $context
370
     * @return string
371
     */
372
    public static function convertToUTF8($str, $config, $context)
0 ignored issues
show
Unused Code introduced by
The parameter $context is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
373
    {
374
        $encoding = $config->get('Core.Encoding');
375
        if ($encoding === 'utf-8') {
376
            return $str;
377
        }
378
        static $iconv = null;
379
        if ($iconv === null) {
380
            $iconv = self::iconvAvailable();
381
        }
382
        if ($iconv && !$config->get('Test.ForceNoIconv')) {
383
            // unaffected by bugs, since UTF-8 support all characters
384
            $str = self::unsafeIconv($encoding, 'utf-8//IGNORE', $str);
385
            if ($str === false) {
386
                // $encoding is not a valid encoding
387
                trigger_error('Invalid encoding ' . $encoding, E_USER_ERROR);
388
                return '';
389
            }
390
            // If the string is bjorked by Shift_JIS or a similar encoding
391
            // that doesn't support all of ASCII, convert the naughty
392
            // characters to their true byte-wise ASCII/UTF-8 equivalents.
393
            $str = strtr($str, self::testEncodingSupportsASCII($encoding));
394
            return $str;
395
        } elseif ($encoding === 'iso-8859-1') {
396
            $str = utf8_encode($str);
397
            return $str;
398
        }
399
        $bug = HTMLPurifier_Encoder::testIconvTruncateBug();
400
        if ($bug == self::ICONV_OK) {
401
            trigger_error('Encoding not supported, please install iconv', E_USER_ERROR);
402
        } else {
403
            trigger_error(
404
                'You have a buggy version of iconv, see https://bugs.php.net/bug.php?id=48147 ' .
405
                'and http://sourceware.org/bugzilla/show_bug.cgi?id=13541',
406
                E_USER_ERROR
407
            );
408
        }
409
    }
410
411
    /**
412
     * Converts a string from UTF-8 based on configuration.
413
     * @param string $str The string to convert
414
     * @param HTMLPurifier_Config $config
415
     * @param HTMLPurifier_Context $context
416
     * @return string
417
     * @note Currently, this is a lossy conversion, with unexpressable
418
     *       characters being omitted.
419
     */
420
    public static function convertFromUTF8($str, $config, $context)
0 ignored issues
show
Unused Code introduced by
The parameter $context is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
421
    {
422
        $encoding = $config->get('Core.Encoding');
423
        if ($escape = $config->get('Core.EscapeNonASCIICharacters')) {
424
            $str = self::convertToASCIIDumbLossless($str);
425
        }
426
        if ($encoding === 'utf-8') {
427
            return $str;
428
        }
429
        static $iconv = null;
430
        if ($iconv === null) {
431
            $iconv = self::iconvAvailable();
432
        }
433
        if ($iconv && !$config->get('Test.ForceNoIconv')) {
434
            // Undo our previous fix in convertToUTF8, otherwise iconv will barf
435
            $ascii_fix = self::testEncodingSupportsASCII($encoding);
436
            if (!$escape && !empty($ascii_fix)) {
437
                $clear_fix = array();
438
                foreach ($ascii_fix as $utf8 => $native) {
439
                    $clear_fix[$utf8] = '';
440
                }
441
                $str = strtr($str, $clear_fix);
442
            }
443
            $str = strtr($str, array_flip($ascii_fix));
444
            // Normal stuff
445
            $str = self::iconv('utf-8', $encoding . '//IGNORE', $str);
446
            return $str;
447
        } elseif ($encoding === 'iso-8859-1') {
448
            $str = utf8_decode($str);
449
            return $str;
450
        }
451
        trigger_error('Encoding not supported', E_USER_ERROR);
452
        // You might be tempted to assume that the ASCII representation
453
        // might be OK, however, this is *not* universally true over all
454
        // encodings.  So we take the conservative route here, rather
455
        // than forcibly turn on %Core.EscapeNonASCIICharacters
456
    }
457
458
    /**
459
     * Lossless (character-wise) conversion of HTML to ASCII
460
     * @param string $str UTF-8 string to be converted to ASCII
461
     * @return string ASCII encoded string with non-ASCII character entity-ized
462
     * @warning Adapted from MediaWiki, claiming fair use: this is a common
463
     *       algorithm. If you disagree with this license fudgery,
464
     *       implement it yourself.
465
     * @note Uses decimal numeric entities since they are best supported.
466
     * @note This is a DUMB function: it has no concept of keeping
467
     *       character entities that the projected character encoding
468
     *       can allow. We could possibly implement a smart version
469
     *       but that would require it to also know which Unicode
470
     *       codepoints the charset supported (not an easy task).
471
     * @note Sort of with cleanUTF8() but it assumes that $str is
472
     *       well-formed UTF-8
473
     */
474
    public static function convertToASCIIDumbLossless($str)
475
    {
476
        $bytesleft = 0;
477
        $result = '';
478
        $working = 0;
479
        $len = strlen($str);
480
        for ($i = 0; $i < $len; $i++) {
481
            $bytevalue = ord($str[$i]);
482
            if ($bytevalue <= 0x7F) { //0xxx xxxx
483
                $result .= chr($bytevalue);
484
                $bytesleft = 0;
485
            } elseif ($bytevalue <= 0xBF) { //10xx xxxx
486
                $working = $working << 6;
487
                $working += ($bytevalue & 0x3F);
488
                $bytesleft--;
489
                if ($bytesleft <= 0) {
490
                    $result .= "&#" . $working . ";";
491
                }
492
            } elseif ($bytevalue <= 0xDF) { //110x xxxx
493
                $working = $bytevalue & 0x1F;
494
                $bytesleft = 1;
495
            } elseif ($bytevalue <= 0xEF) { //1110 xxxx
496
                $working = $bytevalue & 0x0F;
497
                $bytesleft = 2;
498
            } else { //1111 0xxx
499
                $working = $bytevalue & 0x07;
500
                $bytesleft = 3;
501
            }
502
        }
503
        return $result;
504
    }
505
506
    /** No bugs detected in iconv. */
507
    const ICONV_OK = 0;
508
509
    /** Iconv truncates output if converting from UTF-8 to another
510
     *  character set with //IGNORE, and a non-encodable character is found */
511
    const ICONV_TRUNCATES = 1;
512
513
    /** Iconv does not support //IGNORE, making it unusable for
514
     *  transcoding purposes */
515
    const ICONV_UNUSABLE = 2;
516
517
    /**
518
     * glibc iconv has a known bug where it doesn't handle the magic
519
     * //IGNORE stanza correctly.  In particular, rather than ignore
520
     * characters, it will return an EILSEQ after consuming some number
521
     * of characters, and expect you to restart iconv as if it were
522
     * an E2BIG.  Old versions of PHP did not respect the errno, and
523
     * returned the fragment, so as a result you would see iconv
524
     * mysteriously truncating output. We can work around this by
525
     * manually chopping our input into segments of about 8000
526
     * characters, as long as PHP ignores the error code.  If PHP starts
527
     * paying attention to the error code, iconv becomes unusable.
528
     *
529
     * @return int Error code indicating severity of bug.
530
     */
531
    public static function testIconvTruncateBug()
532
    {
533
        static $code = null;
534
        if ($code === null) {
535
            // better not use iconv, otherwise infinite loop!
536
            $r = self::unsafeIconv('utf-8', 'ascii//IGNORE', "\xCE\xB1" . str_repeat('a', 9000));
537
            if ($r === false) {
538
                $code = self::ICONV_UNUSABLE;
539
            } elseif (($c = strlen($r)) < 9000) {
540
                $code = self::ICONV_TRUNCATES;
541
            } elseif ($c > 9000) {
542
                trigger_error(
543
                    'Your copy of iconv is extremely buggy. Please notify HTML Purifier maintainers: ' .
544
                    'include your iconv version as per phpversion()',
545
                    E_USER_ERROR
546
                );
547
            } else {
548
                $code = self::ICONV_OK;
549
            }
550
        }
551
        return $code;
552
    }
553
554
    /**
555
     * This expensive function tests whether or not a given character
556
     * encoding supports ASCII. 7/8-bit encodings like Shift_JIS will
557
     * fail this test, and require special processing. Variable width
558
     * encodings shouldn't ever fail.
559
     *
560
     * @param string $encoding Encoding name to test, as per iconv format
561
     * @param bool $bypass Whether or not to bypass the precompiled arrays.
562
     * @return Array of UTF-8 characters to their corresponding ASCII,
563
     *      which can be used to "undo" any overzealous iconv action.
564
     */
565
    public static function testEncodingSupportsASCII($encoding, $bypass = false)
566
    {
567
        // All calls to iconv here are unsafe, proof by case analysis:
568
        // If ICONV_OK, no difference.
569
        // If ICONV_TRUNCATE, all calls involve one character inputs,
570
        // so bug is not triggered.
571
        // If ICONV_UNUSABLE, this call is irrelevant
572
        static $encodings = array();
573
        if (!$bypass) {
574
            if (isset($encodings[$encoding])) {
575
                return $encodings[$encoding];
576
            }
577
            $lenc = strtolower($encoding);
578
            switch ($lenc) {
579
                case 'shift_jis':
580
                    return array("\xC2\xA5" => '\\', "\xE2\x80\xBE" => '~');
581
                case 'johab':
582
                    return array("\xE2\x82\xA9" => '\\');
583
            }
584
            if (strpos($lenc, 'iso-8859-') === 0) {
585
                return array();
586
            }
587
        }
588
        $ret = array();
589
        if (self::unsafeIconv('UTF-8', $encoding, 'a') === false) {
590
            return false;
0 ignored issues
show
Bug Best Practice introduced by
The return type of return false; (false) is incompatible with the return type documented by HTMLPurifier_Encoder::testEncodingSupportsASCII of type array.

If you return a value from a function or method, it should be a sub-type of the type that is given by the parent type f.e. an interface, or abstract method. This is more formally defined by the Lizkov substitution principle, and guarantees that classes that depend on the parent type can use any instance of a child type interchangably. This principle also belongs to the SOLID principles for object oriented design.

Let’s take a look at an example:

class Author {
    private $name;

    public function __construct($name) {
        $this->name = $name;
    }

    public function getName() {
        return $this->name;
    }
}

abstract class Post {
    public function getAuthor() {
        return 'Johannes';
    }
}

class BlogPost extends Post {
    public function getAuthor() {
        return new Author('Johannes');
    }
}

class ForumPost extends Post { /* ... */ }

function my_function(Post $post) {
    echo strtoupper($post->getAuthor());
}

Our function my_function expects a Post object, and outputs the author of the post. The base class Post returns a simple string and outputting a simple string will work just fine. However, the child class BlogPost which is a sub-type of Post instead decided to return an object, and is therefore violating the SOLID principles. If a BlogPost were passed to my_function, PHP would not complain, but ultimately fail when executing the strtoupper call in its body.

Loading history...
591
        }
592
        for ($i = 0x20; $i <= 0x7E; $i++) { // all printable ASCII chars
593
            $c = chr($i); // UTF-8 char
594
            $r = self::unsafeIconv('UTF-8', "$encoding//IGNORE", $c); // initial conversion
595
            if ($r === '' ||
596
                // This line is needed for iconv implementations that do not
597
                // omit characters that do not exist in the target character set
598
                ($r === $c && self::unsafeIconv($encoding, 'UTF-8//IGNORE', $r) !== $c)
599
            ) {
600
                // Reverse engineer: what's the UTF-8 equiv of this byte
601
                // sequence? This assumes that there's no variable width
602
                // encoding that doesn't support ASCII.
603
                $ret[self::unsafeIconv($encoding, 'UTF-8//IGNORE', $c)] = $c;
604
            }
605
        }
606
        $encodings[$encoding] = $ret;
607
        return $ret;
608
    }
609
}
610
611
// vim: et sw=4 sts=4
612