Completed
Push — master ( 7cd0ea...1cc8fa )
by Rob
02:09
created

Utf8::isUtf8()   F

Complexity

Conditions 26
Paths 29

Size

Total Lines 133

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 0
CRAP Score 702

Importance

Changes 0
Metric Value
dl 0
loc 133
ccs 0
cts 65
cp 0
rs 3.3333
c 0
b 0
f 0
cc 26
nc 29
nop 1
crap 702

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
namespace devtoolboxuk\soteria\voku\Resources;
4
5
class Utf8 extends Resources
0 ignored issues
show
Coding Style introduced by
The property $BROKEN_UTF8_FIX is not named in camelCase.

This check marks property names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
The property $WIN1252_TO_UTF8 is not named in camelCase.

This check marks property names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
The property $BIDI_UNI_CODE_CONTROLS_TABLE is not named in camelCase.

This check marks property names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
The property $WHITESPACE_TABLE is not named in camelCase.

This check marks property names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Complexity introduced by
This class has 1976 lines of code which exceeds the configured maximum of 1000.

Really long classes often contain too much logic and violate the single responsibility principle.

We suggest to take a look at the “Code” section for options on how to refactor this code.

Loading history...
Complexity introduced by
This class has a complexity of 347 which exceeds the configured maximum of 50.

The class complexity is the sum of the complexity of all methods. A very high value is usually an indication that your class does not follow the single reponsibility principle and does more than one job.

Some resources for further reading:

You can also find more detailed suggestions for refactoring in the “Code” section of your repository.

Loading history...
6
{
7
8
    private $system;
9
    private $ENCODINGS;
10
    private $SUPPORT = [];
11
    private $BROKEN_UTF8_FIX;
12
    private $ORD;
13
    private $CHR;
14
    private $WIN1252_TO_UTF8;
15
    private $BOM = [
16
        "\xef\xbb\xbf" => 3, // UTF-8 BOM
17
        '' => 6, // UTF-8 BOM as "WINDOWS-1252" (one char has [maybe] more then one byte ...)
18
        "\x00\x00\xfe\xff" => 4, // UTF-32 (BE) BOM
19
        '  þÿ' => 6, // UTF-32 (BE) BOM as "WINDOWS-1252"
20
        "\xff\xfe\x00\x00" => 4, // UTF-32 (LE) BOM
21
        'ÿþ  ' => 6, // UTF-32 (LE) BOM as "WINDOWS-1252"
22
        "\xfe\xff" => 2, // UTF-16 (BE) BOM
23
        'þÿ' => 4, // UTF-16 (BE) BOM as "WINDOWS-1252"
24
        "\xff\xfe" => 2, // UTF-16 (LE) BOM
25
        'ÿþ' => 4, // UTF-16 (LE) BOM as "WINDOWS-1252"
26
    ];
27
28
    private $BIDI_UNI_CODE_CONTROLS_TABLE = [
29
        // LEFT-TO-RIGHT EMBEDDING (use -> dir = "ltr")
30
        8234 => "\xE2\x80\xAA",
31
        // RIGHT-TO-LEFT EMBEDDING (use -> dir = "rtl")
32
        8235 => "\xE2\x80\xAB",
33
        // POP DIRECTIONAL FORMATTING // (use -> </bdo>)
34
        8236 => "\xE2\x80\xAC",
35
        // LEFT-TO-RIGHT OVERRIDE // (use -> <bdo dir = "ltr">)
36
        8237 => "\xE2\x80\xAD",
37
        // RIGHT-TO-LEFT OVERRIDE // (use -> <bdo dir = "rtl">)
38
        8238 => "\xE2\x80\xAE",
39
        // LEFT-TO-RIGHT ISOLATE // (use -> dir = "ltr")
40
        8294 => "\xE2\x81\xA6",
41
        // RIGHT-TO-LEFT ISOLATE // (use -> dir = "rtl")
42
        8295 => "\xE2\x81\xA7",
43
        // FIRST STRONG ISOLATE // (use -> dir = "auto")
44
        8296 => "\xE2\x81\xA8",
45
        // POP DIRECTIONAL ISOLATE
46
        8297 => "\xE2\x81\xA9",
47
    ];
48
    private $WHITESPACE = [
0 ignored issues
show
Unused Code introduced by
The property $WHITESPACE is not used and could be removed.

This check marks private properties in classes that are never used. Those properties can be removed.

Loading history...
49
        // NUL Byte
50
        0 => "\x0",
51
        // Tab
52
        9 => "\x9",
53
        // New Line
54
        10 => "\xa",
55
        // Vertical Tab
56
        11 => "\xb",
57
        // Carriage Return
58
        13 => "\xd",
59
        // Ordinary Space
60
        32 => "\x20",
61
        // NO-BREAK SPACE
62
        160 => "\xc2\xa0",
63
        // OGHAM SPACE MARK
64
        5760 => "\xe1\x9a\x80",
65
        // MONGOLIAN VOWEL SEPARATOR
66
        6158 => "\xe1\xa0\x8e",
67
        // EN QUAD
68
        8192 => "\xe2\x80\x80",
69
        // EM QUAD
70
        8193 => "\xe2\x80\x81",
71
        // EN SPACE
72
        8194 => "\xe2\x80\x82",
73
        // EM SPACE
74
        8195 => "\xe2\x80\x83",
75
        // THREE-PER-EM SPACE
76
        8196 => "\xe2\x80\x84",
77
        // FOUR-PER-EM SPACE
78
        8197 => "\xe2\x80\x85",
79
        // SIX-PER-EM SPACE
80
        8198 => "\xe2\x80\x86",
81
        // FIGURE SPACE
82
        8199 => "\xe2\x80\x87",
83
        // PUNCTUATION SPACE
84
        8200 => "\xe2\x80\x88",
85
        // THIN SPACE
86
        8201 => "\xe2\x80\x89",
87
        //HAIR SPACE
88
        8202 => "\xe2\x80\x8a",
89
        // LINE SEPARATOR
90
        8232 => "\xe2\x80\xa8",
91
        // PARAGRAPH SEPARATOR
92
        8233 => "\xe2\x80\xa9",
93
        // NARROW NO-BREAK SPACE
94
        8239 => "\xe2\x80\xaf",
95
        // MEDIUM MATHEMATICAL SPACE
96
        8287 => "\xe2\x81\x9f",
97
        // IDEOGRAPHIC SPACE
98
        12288 => "\xe3\x80\x80",
99
    ];
100
    /**
101
     * @var array
102
     */
103
    private $WHITESPACE_TABLE = [
104
        'SPACE' => "\x20",
105
        'NO-BREAK SPACE' => "\xc2\xa0",
106
        'OGHAM SPACE MARK' => "\xe1\x9a\x80",
107
        'EN QUAD' => "\xe2\x80\x80",
108
        'EM QUAD' => "\xe2\x80\x81",
109
        'EN SPACE' => "\xe2\x80\x82",
110
        'EM SPACE' => "\xe2\x80\x83",
111
        'THREE-PER-EM SPACE' => "\xe2\x80\x84",
112
        'FOUR-PER-EM SPACE' => "\xe2\x80\x85",
113
        'SIX-PER-EM SPACE' => "\xe2\x80\x86",
114
        'FIGURE SPACE' => "\xe2\x80\x87",
115
        'PUNCTUATION SPACE' => "\xe2\x80\x88",
116
        'THIN SPACE' => "\xe2\x80\x89",
117
        'HAIR SPACE' => "\xe2\x80\x8a",
118
        'LINE SEPARATOR' => "\xe2\x80\xa8",
119
        'PARAGRAPH SEPARATOR' => "\xe2\x80\xa9",
120
        'ZERO WIDTH SPACE' => "\xe2\x80\x8b",
121
        'NARROW NO-BREAK SPACE' => "\xe2\x80\xaf",
122
        'MEDIUM MATHEMATICAL SPACE' => "\xe2\x81\x9f",
123
        'IDEOGRAPHIC SPACE' => "\xe3\x80\x80",
124
    ];
125
126 6
    function __construct()
0 ignored issues
show
Best Practice introduced by
It is generally recommended to explicitly declare the visibility for methods.

Adding explicit visibility (private, protected, or public) is generally recommend to communicate to other developers how, and from where this method is intended to be used.

Loading history...
Comprehensibility Best Practice introduced by
It is recommend to declare an explicit visibility for __construct.

Generally, we recommend to declare visibility for all methods in your source code. This has the advantage of clearly communication to other developers, and also yourself, how this method should be consumed.

If you are not sure which visibility to choose, it is a good idea to start with the most restrictive visibility, and then raise visibility as needed, i.e. start with private, and only raise it to protected if a sub-class needs to have access, or public if an external class needs access.

Loading history...
127
    {
128 6
        $this->system = new System();
129 6
        $this->checkForSupport();
130 6
    }
131
132 6
    private function checkForSupport()
0 ignored issues
show
Complexity introduced by
This operation has 13 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
133
    {
134 6
        if (!isset($this->SUPPORT['already_checked_via_portable_utf8'])) {
135 6
            $this->SUPPORT['already_checked_via_portable_utf8'] = true;
136
137
            // http://php.net/manual/en/book.mbstring.php
138 6
            $this->SUPPORT['mbstring'] = $this->system->mbstring_loaded();
139 6
            $this->SUPPORT['mbstring_func_overload'] = $this->system->mbstring_overloaded();
140 6
            if ($this->SUPPORT['mbstring'] === true) {
141 6
                \mb_internal_encoding('UTF-8');
142
                /** @noinspection UnusedFunctionResultInspection */
143
                /** @noinspection PhpComposerExtensionStubsInspection */
144 6
                \mb_regex_encoding('UTF-8');
145 6
                $this->SUPPORT['mbstring_internal_encoding'] = 'UTF-8';
146
            }
147
148
            // http://php.net/manual/en/book.iconv.php
149 6
            $this->SUPPORT['iconv'] = $this->system->iconv_loaded();
150
151
            // http://php.net/manual/en/book.intl.php
152 6
            $this->SUPPORT['intl'] = $this->system->intl_loaded();
153 6
            $this->SUPPORT['intl__transliterator_list_ids'] = [];
154
155
            if (
156 6
                $this->SUPPORT['intl'] === true
157
                &&
158 6
                \function_exists('transliterator_list_ids') === true
159
            ) {
160
                /** @noinspection PhpComposerExtensionStubsInspection */
161 6
                $this->SUPPORT['intl__transliterator_list_ids'] = \transliterator_list_ids();
162
            }
163
164
            // http://php.net/manual/en/class.intlchar.php
165 6
            $this->SUPPORT['intlChar'] = $this->system->intlChar_loaded();
166
167
            // http://php.net/manual/en/book.ctype.php
168 6
            $this->SUPPORT['ctype'] = $this->system->ctype_loaded();
169
170
            // http://php.net/manual/en/class.finfo.php
171 6
            $this->SUPPORT['finfo'] = $this->system->finfo_loaded();
172
173
            // http://php.net/manual/en/book.json.php
174 6
            $this->SUPPORT['json'] = $this->system->json_loaded();
175
176
            // http://php.net/manual/en/book.pcre.php
177 6
            $this->SUPPORT['pcre_utf8'] = $this->system->pcre_utf8_support();
178
179 6
            $this->SUPPORT['symfony_polyfill_used'] = $this->system->symfony_polyfill_used();
180 6
            if ($this->SUPPORT['symfony_polyfill_used'] === true) {
181
                \mb_internal_encoding('UTF-8');
182
                $this->SUPPORT['mbstring_internal_encoding'] = 'UTF-8';
183
            }
184
        }
185 6
    }
186
187 6
    public function rawurldecode($str, $multi_decode = true)
0 ignored issues
show
introduced by
The method rawurldecode has a boolean flag argument $multi_decode, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Coding Style Naming introduced by
The parameter $multi_decode is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $multi_decode is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $str_compare is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Complexity introduced by
This operation has 60 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
188
    {
189 6
        if ($str === '') {
190
            return '';
191
        }
192
193 6
        if (strpos($str, '&') === false && strpos($str, '%') === false && strpos($str, '+') === false && strpos($str, '\u') === false) {
194 6
            return $this->fixSimpleUtf8($str);
195
        }
196
197 6
        $pattern = '/%u([0-9a-fA-F]{3,4})/';
198 6
        if (preg_match($pattern, $str)) {
199
            $str = (string)preg_replace($pattern, '&#x\\1;', rawurldecode($str));
200
        }
201
202 6
        $flags = \ENT_QUOTES | \ENT_HTML5;
203
204 6
        if ($multi_decode === true) {
205
            do {
206 6
                $str_compare = $str;
207
208
                /**
209
                 * @psalm-suppress PossiblyInvalidArgument
210
                 */
211 6
                $str = $this->fixSimpleUtf8(rawurldecode($this->htmlEntityDecode($this->toUtf8($str), $flags)));
212 6
            } while ($str_compare !== $str);
213
        }
214
215 6
        return $str;
216
    }
217
218 6
    private function fixSimpleUtf8($str)
0 ignored issues
show
Coding Style Naming introduced by
The variable $BROKEN_UTF8_TO_UTF8_KEYS_CACHE is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $BROKEN_UTF8_TO_UTF8_VALUES_CACHE is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
219
    {
220 6
        if ($str === '') {
221
            return '';
222
        }
223
224 6
        static $BROKEN_UTF8_TO_UTF8_KEYS_CACHE = null;
0 ignored issues
show
Comprehensibility Naming introduced by
The variable name $BROKEN_UTF8_TO_UTF8_KEYS_CACHE exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
225 6
        static $BROKEN_UTF8_TO_UTF8_VALUES_CACHE = null;
0 ignored issues
show
Comprehensibility Naming introduced by
The variable name $BROKEN_UTF8_TO_UTF8_VALUES_CACHE exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
226
227 6
        if ($BROKEN_UTF8_TO_UTF8_KEYS_CACHE === null) {
228 1
            if ($this->BROKEN_UTF8_FIX === null) {
229 1
                $this->BROKEN_UTF8_FIX = $this->getData('utf8_fix');
230
            }
231
232 1
            $BROKEN_UTF8_TO_UTF8_KEYS_CACHE = array_keys($this->BROKEN_UTF8_FIX);
233 1
            $BROKEN_UTF8_TO_UTF8_VALUES_CACHE = array_values($this->BROKEN_UTF8_FIX);
234
        }
235
236 6
        return str_replace($BROKEN_UTF8_TO_UTF8_KEYS_CACHE, $BROKEN_UTF8_TO_UTF8_VALUES_CACHE, $str);
237
    }
238
239 2
    private function getData($file)
240
    {
241
242 2
        return include __DIR__ . '/../Data/' . $file . '.php';
243
    }
244
245 6
    private function htmlEntityDecode($str, $flags = null, $encoding = 'UTF-8')
0 ignored issues
show
Complexity introduced by
This operation has 1440 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style Naming introduced by
The variable $str_compare is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
246
    {
247
        if (
248 6
            !isset($str[3]) // examples: &; || &x;
249
            ||
250 6
            strpos($str, '&') === false // no "&"
251
        ) {
252 6
            return $str;
253
        }
254
255 6
        if ($encoding !== 'UTF-8' && $encoding !== 'CP850') {
256
            $encoding = $this->normalize_encoding($encoding, 'UTF-8');
257
        }
258
259 6
        if ($flags === null) {
260
            $flags = \ENT_QUOTES | \ENT_HTML5;
261
        }
262
263 6
        if ($encoding !== 'UTF-8' && $encoding !== 'ISO-8859-1' && $encoding !== 'WINDOWS-1252' && $this->SUPPORT['mbstring'] === false) {
264
            trigger_error('UTF8::htmlEntityDecode() without mbstring cannot handle "' . $encoding . '" encoding', \E_USER_WARNING);
265
        }
266
267
        do {
268 6
            $str_compare = $str;
269
270
            // INFO: http://stackoverflow.com/questions/35854535/better-explanation-of-convmap-in-mb-encode-numericentity
271 6
            if ($this->SUPPORT['mbstring'] === true) {
272 6
                if ($encoding === 'UTF-8') {
273 6
                    $str = mb_decode_numericentity($str, [0x80, 0xfffff, 0, 0xfffff, 0]);
274
                } else {
0 ignored issues
show
Coding Style introduced by
The method htmlEntityDecode uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
275 6
                    $str = mb_decode_numericentity($str, [0x80, 0xfffff, 0, 0xfffff, 0], $encoding);
276
                }
277
            } else {
0 ignored issues
show
Coding Style introduced by
The method htmlEntityDecode uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
278
                $str = (string)preg_replace_callback(
279
                    "/&#\d{2,6};/",
280
                    /**
281
                     * @param string[] $matches
282
                     *
283
                     * @return string
284
                     */
285
                    static function ($matches) use ($encoding) {
286
                        $returnTmp = \mb_convert_encoding($matches[0], $encoding, 'HTML-ENTITIES');
287
                        if ($returnTmp !== '"' && $returnTmp !== "'") {
288
                            return $returnTmp;
289
                        }
290
291
                        return $matches[0];
292
                    },
293
                    $str
294
                );
295
            }
296
297 6
            if (strpos($str, '&') !== false) {
298 6
                if (strpos($str, '&#') !== false) {
299
                    // decode also numeric & UTF16 two byte entities
300 6
                    $str = (string)preg_replace('/(&#(?:x0*[0-9a-fA-F]{2,6}(?![0-9a-fA-F;])|(?:0*\d{2,6}(?![0-9;]))))/S', '$1;', $str);
301
                }
302
303 6
                $str = html_entity_decode($str, $flags, $encoding);
304
            }
305 6
        } while ($str_compare !== $str);
306
307 6
        return $str;
308
    }
309
310
    private function normalize_encoding($encoding, $fallback = '')
0 ignored issues
show
Complexity introduced by
This operation has 2592 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style Naming introduced by
The method normalize_encoding is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $STATIC_NORMALIZE_ENCODING_CACHE is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
Method name "Utf8::normalize_encoding" is not in camel caps format
Loading history...
311
    {
312
        static $STATIC_NORMALIZE_ENCODING_CACHE = [];
0 ignored issues
show
Comprehensibility Naming introduced by
The variable name $STATIC_NORMALIZE_ENCODING_CACHE exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
313
314
        // init
315
        $encoding = (string)$encoding;
316
317
        if (!$encoding) {
318
            return $fallback;
319
        }
320
321
        if ($encoding === 'UTF-8' || $encoding === 'UTF8') {
322
            return 'UTF-8';
323
        }
324
325
        if ($encoding === '8BIT' || $encoding === 'BINARY') {
326
            return 'CP850';
327
        }
328
329
        if ($encoding === 'HTML' || $encoding === 'HTML-ENTITIES') {
330
            return 'HTML-ENTITIES';
331
        }
332
333
        if (
334
            $encoding === '1' // only a fallback, for non "strict_types" usage ...
335
            ||
336
            $encoding === '0' // only a fallback, for non "strict_types" usage ...
337
        ) {
338
            return $fallback;
339
        }
340
341
        if (isset($STATIC_NORMALIZE_ENCODING_CACHE[$encoding])) {
342
            return $STATIC_NORMALIZE_ENCODING_CACHE[$encoding];
343
        }
344
345
        if ($this->ENCODINGS === null) {
346
            $this->ENCODINGS = $this->getData('encodings');
347
        }
348
349
        if (in_array($encoding, $this->ENCODINGS, true)) {
350
            $STATIC_NORMALIZE_ENCODING_CACHE[$encoding] = $encoding;
351
352
            return $encoding;
353
        }
354
355
        $encodingOrig = $encoding;
356
        $encoding = strtoupper($encoding);
357
        $encodingUpperHelper = (string)preg_replace('/[^a-zA-Z0-9\s]/u', '', $encoding);
358
359
        $equivalences = [
360
            'ISO8859' => 'ISO-8859-1',
361
            'ISO88591' => 'ISO-8859-1',
362
            'ISO' => 'ISO-8859-1',
363
            'LATIN' => 'ISO-8859-1',
364
            'LATIN1' => 'ISO-8859-1', // Western European
365
            'ISO88592' => 'ISO-8859-2',
366
            'LATIN2' => 'ISO-8859-2', // Central European
367
            'ISO88593' => 'ISO-8859-3',
368
            'LATIN3' => 'ISO-8859-3', // Southern European
369
            'ISO88594' => 'ISO-8859-4',
370
            'LATIN4' => 'ISO-8859-4', // Northern European
371
            'ISO88595' => 'ISO-8859-5',
372
            'ISO88596' => 'ISO-8859-6', // Greek
373
            'ISO88597' => 'ISO-8859-7',
374
            'ISO88598' => 'ISO-8859-8', // Hebrew
375
            'ISO88599' => 'ISO-8859-9',
376
            'LATIN5' => 'ISO-8859-9', // Turkish
377
            'ISO885911' => 'ISO-8859-11',
378
            'TIS620' => 'ISO-8859-11', // Thai
379
            'ISO885910' => 'ISO-8859-10',
380
            'LATIN6' => 'ISO-8859-10', // Nordic
381
            'ISO885913' => 'ISO-8859-13',
382
            'LATIN7' => 'ISO-8859-13', // Baltic
383
            'ISO885914' => 'ISO-8859-14',
384
            'LATIN8' => 'ISO-8859-14', // Celtic
385
            'ISO885915' => 'ISO-8859-15',
386
            'LATIN9' => 'ISO-8859-15', // Western European (with some extra chars e.g. €)
387
            'ISO885916' => 'ISO-8859-16',
388
            'LATIN10' => 'ISO-8859-16', // Southeast European
389
            'CP1250' => 'WINDOWS-1250',
390
            'WIN1250' => 'WINDOWS-1250',
391
            'WINDOWS1250' => 'WINDOWS-1250',
392
            'CP1251' => 'WINDOWS-1251',
393
            'WIN1251' => 'WINDOWS-1251',
394
            'WINDOWS1251' => 'WINDOWS-1251',
395
            'CP1252' => 'WINDOWS-1252',
396
            'WIN1252' => 'WINDOWS-1252',
397
            'WINDOWS1252' => 'WINDOWS-1252',
398
            'CP1253' => 'WINDOWS-1253',
399
            'WIN1253' => 'WINDOWS-1253',
400
            'WINDOWS1253' => 'WINDOWS-1253',
401
            'CP1254' => 'WINDOWS-1254',
402
            'WIN1254' => 'WINDOWS-1254',
403
            'WINDOWS1254' => 'WINDOWS-1254',
404
            'CP1255' => 'WINDOWS-1255',
405
            'WIN1255' => 'WINDOWS-1255',
406
            'WINDOWS1255' => 'WINDOWS-1255',
407
            'CP1256' => 'WINDOWS-1256',
408
            'WIN1256' => 'WINDOWS-1256',
409
            'WINDOWS1256' => 'WINDOWS-1256',
410
            'CP1257' => 'WINDOWS-1257',
411
            'WIN1257' => 'WINDOWS-1257',
412
            'WINDOWS1257' => 'WINDOWS-1257',
413
            'CP1258' => 'WINDOWS-1258',
414
            'WIN1258' => 'WINDOWS-1258',
415
            'WINDOWS1258' => 'WINDOWS-1258',
416
            'UTF16' => 'UTF-16',
417
            'UTF32' => 'UTF-32',
418
            'UTF8' => 'UTF-8',
419
            'UTF' => 'UTF-8',
420
            'UTF7' => 'UTF-7',
421
            '8BIT' => 'CP850',
422
            'BINARY' => 'CP850',
423
        ];
424
425
        if (!empty($equivalences[$encodingUpperHelper])) {
426
            $encoding = $equivalences[$encodingUpperHelper];
427
        }
428
429
        $STATIC_NORMALIZE_ENCODING_CACHE[$encodingOrig] = $encoding;
430
431
        return $encoding;
432
    }
433
434 6
    private function toUtf8($str, $decodeHtmlEntityToUtf8 = false)
0 ignored issues
show
introduced by
The method toUtf8 has a boolean flag argument $decodeHtmlEntityToUtf8, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Complexity introduced by
This operation has 196032 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Comprehensibility Naming introduced by
The variable name $decodeHtmlEntityToUtf8 exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
435
    {
436
437 6
        if (is_array($str) === true) {
438
            foreach ($str as $k => $v) {
439
                $str[$k] = $this->toUtf8($v, $decodeHtmlEntityToUtf8);
440
            }
441
            return $str;
442
        }
443
444
445 6
        $str = (string)$str;
446 6
        if ($str === '') {
447
            return $str;
448
        }
449
450 6
        $max = \strlen($str);
451 6
        $buf = '';
452
453 6
        for ($i = 0; $i < $max; ++$i) {
454 6
            $c1 = $str[$i];
0 ignored issues
show
Comprehensibility introduced by
Avoid variables with short names like $c1. Configured minimum length is 3.

Short variable names may make your code harder to understand. Variable names should be self-descriptive. This check looks for variable names who are shorter than a configured minimum.

Loading history...
455
456 6
            if ($c1 >= "\xC0") { // should be converted to UTF8, if it's not UTF8 already
457
458
                if ($c1 <= "\xDF") { // looks like 2 bytes UTF8
459
460
                    $c2 = $i + 1 >= $max ? "\x00" : $str[$i + 1];
0 ignored issues
show
Comprehensibility introduced by
Avoid variables with short names like $c2. Configured minimum length is 3.

Short variable names may make your code harder to understand. Variable names should be self-descriptive. This check looks for variable names who are shorter than a configured minimum.

Loading history...
461
462
                    if ($c2 >= "\x80" && $c2 <= "\xBF") { // yeah, almost sure it's UTF8 already
463
                        $buf .= $c1 . $c2;
464
                        ++$i;
465
                    } else { // not valid UTF8 - convert it
0 ignored issues
show
Coding Style introduced by
The method toUtf8 uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
466
                        $buf .= $this->toUtf8ConvertHelper($c1);
467
                    }
468
                } elseif ($c1 >= "\xE0" && $c1 <= "\xEF") { // looks like 3 bytes UTF8
469
470
                    $c2 = $i + 1 >= $max ? "\x00" : $str[$i + 1];
471
                    $c3 = $i + 2 >= $max ? "\x00" : $str[$i + 2];
0 ignored issues
show
Comprehensibility introduced by
Avoid variables with short names like $c3. Configured minimum length is 3.

Short variable names may make your code harder to understand. Variable names should be self-descriptive. This check looks for variable names who are shorter than a configured minimum.

Loading history...
472
473
                    if ($c2 >= "\x80" && $c2 <= "\xBF" && $c3 >= "\x80" && $c3 <= "\xBF") { // yeah, almost sure it's UTF8 already
474
                        $buf .= $c1 . $c2 . $c3;
475
                        $i += 2;
476
                    } else { // not valid UTF8 - convert it
0 ignored issues
show
Coding Style introduced by
The method toUtf8 uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
477
                        $buf .= $this->toUtf8ConvertHelper($c1);
478
                    }
479
                } elseif ($c1 >= "\xF0" && $c1 <= "\xF7") { // looks like 4 bytes UTF8
480
481
                    $c2 = $i + 1 >= $max ? "\x00" : $str[$i + 1];
482
                    $c3 = $i + 2 >= $max ? "\x00" : $str[$i + 2];
483
                    $c4 = $i + 3 >= $max ? "\x00" : $str[$i + 3];
0 ignored issues
show
Comprehensibility introduced by
Avoid variables with short names like $c4. Configured minimum length is 3.

Short variable names may make your code harder to understand. Variable names should be self-descriptive. This check looks for variable names who are shorter than a configured minimum.

Loading history...
484
485
                    if ($c2 >= "\x80" && $c2 <= "\xBF" && $c3 >= "\x80" && $c3 <= "\xBF" && $c4 >= "\x80" && $c4 <= "\xBF") { // yeah, almost sure it's UTF8 already
486
                        $buf .= $c1 . $c2 . $c3 . $c4;
487
                        $i += 3;
488
                    } else { // not valid UTF8 - convert it
0 ignored issues
show
Coding Style introduced by
The method toUtf8 uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
489
                        $buf .= $this->toUtf8ConvertHelper($c1);
490
                    }
491
                } else { // doesn't look like UTF8, but should be converted
0 ignored issues
show
Coding Style introduced by
The method toUtf8 uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
492
493
                    $buf .= $this->toUtf8ConvertHelper($c1);
494
                }
495 6
            } elseif (($c1 & "\xC0") === "\x80") { // needs conversion
496
497
                $buf .= $this->toUtf8ConvertHelper($c1);
498
            } else { // it doesn't need conversion
0 ignored issues
show
Coding Style introduced by
The method toUtf8 uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
499
500 6
                $buf .= $c1;
501
            }
502
        }
503
504
        // decode unicode escape sequences + unicode surrogate pairs
505 6
        $buf = preg_replace_callback(
506 6
            '/\\\\u([dD][89abAB][0-9a-fA-F]{2})\\\\u([dD][cdefCDEF][\da-fA-F]{2})|\\\\u([0-9a-fA-F]{4})/',
507
            /**
508
             * @param array $matches
509
             *
510
             * @return string
511
             */
512
            function (array $matches) {
513 1
                if (isset($matches[3])) {
514 1
                    $cp = (int)hexdec($matches[3]);
0 ignored issues
show
Comprehensibility introduced by
Avoid variables with short names like $cp. Configured minimum length is 3.

Short variable names may make your code harder to understand. Variable names should be self-descriptive. This check looks for variable names who are shorter than a configured minimum.

Loading history...
515
                } else {
0 ignored issues
show
Coding Style introduced by
The method toUtf8 uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
516
                    // http://unicode.org/faq/utf_bom.html#utf16-4
517
                    $cp = ((int)hexdec($matches[1]) << 10)
518
                        + (int)hexdec($matches[2])
519
                        + 0x10000
520
                        - (0xD800 << 10)
521
                        - 0xDC00;
522
                }
523
524
                // https://github.com/php/php-src/blob/php-7.3.2/ext/standard/html.c#L471
525
                //
526
                // php_utf32_utf8(unsigned char *buf, unsigned k)
527
528 1
                if ($cp < 0x80) {
529 1
                    return (string)$this->chr($cp);
530
                }
531
532
                if ($cp < 0xA0) {
533
                    /** @noinspection UnnecessaryCastingInspection */
534
                    return (string)$this->chr(0xC0 | $cp >> 6) . (string)$this->chr(0x80 | $cp & 0x3F);
535
                }
536
537
                return $this->decimalToChr($cp);
538 6
            },
539 6
            $buf
540
        );
541
542 6
        if ($buf === null) {
543
            return '';
544
        }
545
546
        // decode UTF-8 codepoints
547 6
        if ($decodeHtmlEntityToUtf8 === true) {
548
            $buf = $this->htmlEntityDecode($buf);
549
        }
550
551 6
        return $buf;
552
    }
553
554
    private function toUtf8ConvertHelper($input)
0 ignored issues
show
Complexity introduced by
This operation has 16 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
555
    {
556
        // init
557
        $buf = '';
558
559
        if ($this->ORD === null) {
560
            $this->ORD = $this->getData('ord');
561
        }
562
563
        if ($this->CHR === null) {
564
            $this->CHR = $this->getData('chr');
565
        }
566
567
        if ($this->WIN1252_TO_UTF8 === null) {
568
            $this->WIN1252_TO_UTF8 = $this->getData('win1252_to_utf8');
569
        }
570
571
        $ordC1 = $this->ORD[$input];
572
        if (isset($this->WIN1252_TO_UTF8[$ordC1])) { // found in Windows-1252 special cases
573
            $buf .= $this->WIN1252_TO_UTF8[$ordC1];
574
        } else {
0 ignored issues
show
Coding Style introduced by
The method toUtf8ConvertHelper uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
575
            $cc1 = $this->CHR[$ordC1 / 64] | "\xC0";
576
            $cc2 = ((string)$input & "\x3F") | "\x80";
577
            $buf .= $cc1 . $cc2;
578
        }
579
580
        return $buf;
581
    }
582
583 1
    private function chr($code_point, $encoding = 'UTF-8')
0 ignored issues
show
Complexity introduced by
This operation has 7200 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style Naming introduced by
The parameter $code_point is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $code_point is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $CHAR_CACHE is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
584
    {
585
        // init
586 1
        static $CHAR_CACHE = [];
587
588 1
        if ($encoding !== 'UTF-8' && $encoding !== 'CP850') {
589
            $encoding = $this->normalize_encoding($encoding, 'UTF-8');
590
        }
591
592 1
        if ($encoding !== 'UTF-8' && $encoding !== 'ISO-8859-1' && $encoding !== 'WINDOWS-1252' && $this->SUPPORT['mbstring'] === false) {
593
            trigger_error('UTF8::chr() without mbstring cannot handle "' . $encoding . '" encoding', \E_USER_WARNING);
594
        }
595
596 1
        $cacheKey = $code_point . $encoding;
597 1
        if (isset($CHAR_CACHE[$cacheKey]) === true) {
598
            return $CHAR_CACHE[$cacheKey];
599
        }
600
601 1
        if ($code_point <= 127) { // use "simple"-char only until "\x80"
602
603 1
            if ($this->CHR === null) {
604 1
                $this->CHR = (array)$this->getData('chr');
605
            }
606
607
            /**
608
             * @psalm-suppress PossiblyNullArrayAccess
609
             */
610 1
            $chr = $this->CHR[$code_point];
611
612 1
            if ($encoding !== 'UTF-8') {
613
                $chr = $this->encode($encoding, $chr);
614
            }
615
616 1
            return $CHAR_CACHE[$cacheKey] = $chr;
617
        }
618
619
        //
620
        // fallback via "IntlChar"
621
        //
622
623
        if ($this->SUPPORT['intlChar'] === true) {
624
            /** @noinspection PhpComposerExtensionStubsInspection */
625
            $chr = IntlChar::chr($code_point);
626
627
            if ($encoding !== 'UTF-8') {
628
                $chr = $this->encode($encoding, $chr);
629
            }
630
631
            return $CHAR_CACHE[$cacheKey] = $chr;
632
        }
633
634
        //
635
        // fallback via vanilla php
636
        //
637
638
        if ($this->CHR === null) {
639
            $this->CHR = (array)$this->getData('chr');
640
        }
641
642
        $code_point = (int)$code_point;
643
        if ($code_point <= 0x7F) {
644
            /**
645
             * @psalm-suppress PossiblyNullArrayAccess
646
             */
647
            $chr = $this->CHR[$code_point];
648
        } elseif ($code_point <= 0x7FF) {
649
            /**
650
             * @psalm-suppress PossiblyNullArrayAccess
651
             */
652
            $chr = $this->CHR[($code_point >> 6) + 0xC0] .
653
                $this->CHR[($code_point & 0x3F) + 0x80];
654
        } elseif ($code_point <= 0xFFFF) {
655
            /**
656
             * @psalm-suppress PossiblyNullArrayAccess
657
             */
658
            $chr = $this->CHR[($code_point >> 12) + 0xE0] .
659
                $this->CHR[(($code_point >> 6) & 0x3F) + 0x80] .
660
                $this->CHR[($code_point & 0x3F) + 0x80];
661
        } else {
0 ignored issues
show
Coding Style introduced by
The method chr uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
662
            /**
663
             * @psalm-suppress PossiblyNullArrayAccess
664
             */
665
            $chr = $this->CHR[($code_point >> 18) + 0xF0] .
666
                $this->CHR[(($code_point >> 12) & 0x3F) + 0x80] .
667
                $this->CHR[(($code_point >> 6) & 0x3F) + 0x80] .
668
                $this->CHR[($code_point & 0x3F) + 0x80];
669
        }
670
671
        if ($encoding !== 'UTF-8') {
672
            $chr = $this->encode($encoding, $chr);
673
        }
674
675
        return $CHAR_CACHE[$cacheKey] = $chr;
676
    }
677
678
    private function encode($toEncoding, $str)
0 ignored issues
show
Complexity introduced by
This operation has 540 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
679
    {
680
        if ($str === '' || $toEncoding === '') {
681
            return $str;
682
        }
683
684
        if ($toEncoding !== 'UTF-8' && $toEncoding !== 'CP850') {
685
            $toEncoding = $this->normalize_encoding($toEncoding, 'UTF-8');
686
        }
687
688
//        if ($fromEncoding && $fromEncoding !== 'UTF-8' && $fromEncoding !== 'CP850') {
689
//            $fromEncoding = $this->normalize_encoding($fromEncoding, null);
690
//        }
691
692
//        if ($toEncoding && $fromEncoding && $fromEncoding === $toEncoding) {
693
//            return $str;
694
//        }
695
696
        if ($toEncoding === 'JSON') {
697
            $return = $this->json_encode($str);
698
            if ($return === false) {
699
                throw new InvalidArgumentException('The input string [' . $str . '] can not be used for json_encode().');
700
            }
701
702
            return $return;
703
        }
704
//        if ($fromEncoding === 'JSON') {
705
//            $str = $this->json_decode($str);
706
//            $fromEncoding = '';
707
//        }
708
709
        if ($toEncoding === 'BASE64') {
710
            return base64_encode($str);
711
        }
712
//        if ($fromEncoding === 'BASE64') {
713
//            $str = base64_decode($str, true);
714
//            $fromEncoding = '';
715
//        }
716
717
        if ($toEncoding === 'HTML-ENTITIES') {
718
            return $this->htmlEncode($str, true, 'UTF-8');
719
        }
720
//        if ($fromEncoding === 'HTML-ENTITIES') {
721
//            $str = $this->html_decode($str, \ENT_COMPAT, 'UTF-8');
722
//            $fromEncoding = '';
723
//        }
724
725
        $fromEncodingDetected = false;
0 ignored issues
show
Unused Code introduced by
$fromEncodingDetected is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
726
//        if ($autodetectFromEncoding === true || !$fromEncoding) {
727
//            $fromEncodingDetected = $this->str_detect_encoding($str);
728
//        }
729
730
        // DEBUG
731
        //var_dump($toEncoding, $fromEncoding, $fromEncodingDetected, $str, "\n\n");
732
733
//        if ($fromEncodingDetected !== false) {
734
//            $fromEncoding = $fromEncodingDetected;
735
//        } elseif ($autodetectFromEncoding === true) {
736
//            // fallback for the "autodetect"-mode
737
//            return $this->toUtf8($str);
738
//        }
739
740
//        if (!$fromEncoding || $fromEncoding === $toEncoding) {
741
//            return $str;
742
//        }
743
744
//        if ($toEncoding === 'UTF-8' && ($fromEncoding === 'WINDOWS-1252' || $fromEncoding === 'ISO-8859-1')) {
745
//            return $this->toUtf8($str);
746
//        }
747
748
//        if ($toEncoding === 'ISO-8859-1' && ($fromEncoding === 'WINDOWS-1252' || $fromEncoding === 'UTF-8')) {
749
//            return $this->to_iso8859($str);
750
//        }
751
752
        if ($toEncoding !== 'UTF-8' && $toEncoding !== 'ISO-8859-1' && $toEncoding !== 'WINDOWS-1252' && $this->SUPPORT['mbstring'] === false) {
753
            trigger_error('UTF8::encode() without mbstring cannot handle "' . $toEncoding . '" encoding', E_USER_WARNING);
754
        }
755
//
756
//        if ($this->SUPPORT['mbstring'] === true) {
757
//            // warning: do not use the symfony polyfill here
758
//            $strEncoded = mb_convert_encoding(
759
//                $str,
760
//                $toEncoding,
761
//                $fromEncoding
762
//            );
763
//
764
//            if ($strEncoded) {
765
//                return $strEncoded;
766
//            }
767
//        }
768
//
769
//        $return = \iconv($fromEncoding, $toEncoding, $str);
770
//        if ($return !== false) {
771
//            return $return;
772
//        }
773
774
        return $str;
775
    }
776
777
    private function json_encode($value, $options = 0, $depth = 512)
0 ignored issues
show
Coding Style Naming introduced by
The method json_encode is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
Method name "Utf8::json_encode" is not in camel caps format
Loading history...
778
    {
779
        $value = $this->filter($value);
780
781
        if ($this->SUPPORT['json'] === false) {
782
            throw new \RuntimeException('ext-json: is not installed');
783
        }
784
785
        /** @noinspection PhpComposerExtensionStubsInspection */
786
        return json_encode($value, $options, $depth);
787
    }
788
789
    private function filter($var, $normalization_form = \Normalizer::NFC, $leading_combining = '◌')
0 ignored issues
show
Coding Style Naming introduced by
The parameter $normalization_form is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The parameter $leading_combining is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $normalization_form is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $leading_combining is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Complexity introduced by
This operation has 30 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
790
    {
791
        switch (\gettype($var)) {
792
            case 'array':
793
                foreach ($var as $k => $v) {
794
                    $var[$k] = $this->filter($v, $normalization_form, $leading_combining);
795
                }
796
                unset($v);
797
798
                break;
799
            case 'object':
800
                foreach ($var as $k => $v) {
801
                    $str[$k] = $this->filter($v, $normalization_form, $leading_combining);
0 ignored issues
show
Coding Style Comprehensibility introduced by
$str was never initialized. Although not strictly required by PHP, it is generally a good practice to add $str = array(); before regardless.

Adding an explicit array definition is generally preferable to implicit array definition as it guarantees a stable state of the code.

Let’s take a look at an example:

foreach ($collection as $item) {
    $myArray['foo'] = $item->getFoo();

    if ($item->hasBar()) {
        $myArray['bar'] = $item->getBar();
    }

    // do something with $myArray
}

As you can see in this example, the array $myArray is initialized the first time when the foreach loop is entered. You can also see that the value of the bar key is only written conditionally; thus, its value might result from a previous iteration.

This might or might not be intended. To make your intention clear, your code more readible and to avoid accidental bugs, we recommend to add an explicit initialization $myArray = array() either outside or inside the foreach loop.

Loading history...
802
                }
803
                unset($v);
804
805
                break;
806
            case 'string':
0 ignored issues
show
Coding Style introduced by
The case body in a switch statement must start on the line following the statement.

According to the PSR-2, the body of a case statement must start on the line immediately following the case statement.

switch ($expr) {
case "A":
    doSomething(); //right
    break;
case "B":

    doSomethingElse(); //wrong
    break;

}

To learn more about the PSR-2 coding standard, please refer to the PHP-Fig.

Loading history...
807
808
                if (strpos($var, "\r") !== false) {
809
                    // Workaround https://bugs.php.net/65732
810
                    $var = $this->normalize_line_ending($var);
811
                }
812
813
                if ($this->isAscii($var) === false) {
814
                    if (\Normalizer::isNormalized($var, $normalization_form)) {
815
                        $n = '-';
0 ignored issues
show
Comprehensibility introduced by
Avoid variables with short names like $n. Configured minimum length is 3.

Short variable names may make your code harder to understand. Variable names should be self-descriptive. This check looks for variable names who are shorter than a configured minimum.

Loading history...
816
                    } else {
0 ignored issues
show
Coding Style introduced by
The method filter uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
817
                        $n = \Normalizer::normalize($var, $normalization_form);
818
819
                        if (isset($n[0])) {
820
                            $var = $n;
821
                        } else {
0 ignored issues
show
Coding Style introduced by
The method filter uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
822
                            $var = $this->encode('UTF-8', $var, true);
0 ignored issues
show
Unused Code introduced by
The call to Utf8::encode() has too many arguments starting with true.

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress.

In this case you can add the @ignore PhpDoc annotation to the duplicate definition and it will be ignored.

Loading history...
823
                        }
824
                    }
825
826
                    if (
827
                        $var[0] >= "\x80"
828
                        &&
829
                        isset($n[0], $leading_combining[0])
830
                        &&
831
                        preg_match('/^\p{Mn}/u', $var)
832
                    ) {
833
                        // Prevent leading combining chars
834
                        // for NFC-safe concatenations.
835
                        $var = $leading_combining . $var;
836
                    }
837
                }
838
839
                break;
840
        }
841
842
        return $var;
843
    }
844
845
    private function normalize_line_ending($str)
0 ignored issues
show
Coding Style Naming introduced by
The method normalize_line_ending is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
Method name "Utf8::normalize_line_ending" is not in camel caps format
Loading history...
846
    {
847
        return str_replace(["\r\n", "\r"], "\n", $str);
848
    }
849
850
    private function isAscii($str)
851
    {
852
        if ($str === '') {
853
            return true;
854
        }
855
856
        return !preg_match('/[^\x09\x10\x13\x0A\x0D\x20-\x7E]/', $str);
857
    }
858
859
    private function htmlEncode($str, $keepAsciiChars = false, $encoding = 'UTF-8')
0 ignored issues
show
introduced by
The method htmlEncode has a boolean flag argument $keepAsciiChars, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Complexity introduced by
This operation has 30 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
860
    {
861
        if ($str === '') {
862
            return '';
863
        }
864
865
        if ($encoding !== 'UTF-8' && $encoding !== 'CP850') {
866
            $encoding = $this->normalize_encoding($encoding, 'UTF-8');
867
        }
868
869
        // INFO: http://stackoverflow.com/questions/35854535/better-explanation-of-convmap-in-mb-encode-numericentity
870
        if ($this->SUPPORT['mbstring'] === true) {
871
            $startCode = 0x00;
872
            if ($keepAsciiChars === true) {
873
                $startCode = 0x80;
874
            }
875
876
            if ($encoding === 'UTF-8') {
877
                return mb_encode_numericentity(
878
                    $str,
879
                    [$startCode, 0xfffff, 0, 0xfffff, 0]
880
                );
881
            }
882
883
            return mb_encode_numericentity(
884
                $str,
885
                [$startCode, 0xfffff, 0, 0xfffff, 0],
886
                $encoding
887
            );
888
        }
889
890
        return implode(
891
            '',
892
            \array_map(
893
                function (string $chr) use ($keepAsciiChars, $encoding) {
894
                    return $this->singleChrHtmlEncode($chr, $keepAsciiChars, $encoding);
895
                },
896
                $this->strSplit($str)
897
            )
898
        );
899
    }
900
901
    private function singleChrHtmlEncode($char, $keepAsciiChars = false, $encoding = 'UTF-8')
0 ignored issues
show
introduced by
The method singleChrHtmlEncode has a boolean flag argument $keepAsciiChars, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
902
    {
903
        if ($char === '') {
904
            return '';
905
        }
906
907
        if ($keepAsciiChars === true && $this->isAscii($char) === true) {
908
            return $char;
909
        }
910
911
        return '&#' . $this->ord($char, $encoding) . ';';
912
    }
913
914
    private function ord($chr, $encoding = 'UTF-8')
0 ignored issues
show
Complexity introduced by
This operation has 19440 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style Naming introduced by
The variable $CHAR_CACHE is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
915
    {
916
        static $CHAR_CACHE = [];
917
918
        // init
919
        $chr = (string)$chr;
920
921
        if ($encoding !== 'UTF-8' && $encoding !== 'CP850') {
922
            $encoding = $this->normalize_encoding($encoding, 'UTF-8');
923
        }
924
925
        $cacheKey = $chr . $encoding;
926
        if (isset($CHAR_CACHE[$cacheKey]) === true) {
927
            return $CHAR_CACHE[$cacheKey];
928
        }
929
930
        // check again, if it's still not UTF-8
931
        if ($encoding !== 'UTF-8') {
932
            $chr = $this->encode($encoding, $chr);
933
        }
934
935
        if ($this->ORD === null) {
936
            $this->ORD = $this->getData('ord');
937
        }
938
939
        if (isset($this->ORD[$chr])) {
940
            return $CHAR_CACHE[$cacheKey] = $this->ORD[$chr];
941
        }
942
943
        //
944
        // fallback via "IntlChar"
945
        //
946
947
        if ($this->SUPPORT['intlChar'] === true) {
948
            /** @noinspection PhpComposerExtensionStubsInspection */
949
            $code = \IntlChar::ord($chr);
950
            if ($code) {
951
                return $CHAR_CACHE[$cacheKey] = $code;
952
            }
953
        }
954
955
        //
956
        // fallback via vanilla php
957
        //
958
959
        /** @noinspection CallableParameterUseCaseInTypeContextInspection */
960
        $chr = \unpack('C*', (string)\substr($chr, 0, 4));
961
        $code = $chr ? $chr[1] : 0;
962
963
        if ($code >= 0xF0 && isset($chr[4])) {
964
            /** @noinspection UnnecessaryCastingInspection */
965
            return $CHAR_CACHE[$cacheKey] = (int)((($code - 0xF0) << 18) + (($chr[2] - 0x80) << 12) + (($chr[3] - 0x80) << 6) + $chr[4] - 0x80);
966
        }
967
968
        if ($code >= 0xE0 && isset($chr[3])) {
969
            /** @noinspection UnnecessaryCastingInspection */
970
            return $CHAR_CACHE[$cacheKey] = (int)((($code - 0xE0) << 12) + (($chr[2] - 0x80) << 6) + $chr[3] - 0x80);
971
        }
972
973
        if ($code >= 0xC0 && isset($chr[2])) {
974
            /** @noinspection UnnecessaryCastingInspection */
975
            return $CHAR_CACHE[$cacheKey] = (int)((($code - 0xC0) << 6) + $chr[2] - 0x80);
976
        }
977
978
        return $CHAR_CACHE[$cacheKey] = $code;
979
    }
980
981
    private function strSplit($str, $length = 1, $cleanUtf8 = false, $tryToUseMbFunction = true)
0 ignored issues
show
introduced by
The method strSplit has a boolean flag argument $cleanUtf8, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
introduced by
The method strSplit has a boolean flag argument $tryToUseMbFunction, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Complexity introduced by
This operation has 4032 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
982
    {
983
        if ($length <= 0) {
984
            return [];
985
        }
986
987
        if (is_array($str) === true) {
988
            foreach ($str as $k => $v) {
989
                $str[$k] = $this->strSplit($v, $length, $cleanUtf8, $tryToUseMbFunction);
990
            }
991
992
            return $str;
993
        }
994
995
        // init
996
        $str = (string)$str;
997
998
        if ($str === '') {
999
            return [];
1000
        }
1001
1002
        if ($cleanUtf8 === true) {
1003
            $str = $this->clean($str);
1004
        }
1005
1006
        if ($tryToUseMbFunction === true && $this->SUPPORT['mbstring'] === true) {
1007
            $iMax = \mb_strlen($str);
1008
            if ($iMax <= 127) {
1009
                $ret = [];
1010
                for ($i = 0; $i < $iMax; ++$i) {
1011
                    $ret[] = \mb_substr($str, $i, 1);
1012
                }
1013
            } else {
0 ignored issues
show
Coding Style introduced by
The method strSplit uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
1014
                $retArray = [];
1015
                preg_match_all('/./us', $str, $retArray);
1016
                $ret = isset($retArray[0]) ? $retArray[0] : [];
1017
            }
1018
        } elseif ($this->SUPPORT['pcre_utf8'] === true) {
1019
            $retArray = [];
1020
            preg_match_all('/./us', $str, $retArray);
1021
            $ret = isset($retArray[0]) ? $retArray[0] : [];
1022
        } else {
0 ignored issues
show
Coding Style introduced by
The method strSplit uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
Coding Style introduced by
Blank line found at start of control structure
Loading history...
1023
1024
            // fallback
1025
1026
            $ret = [];
1027
            $len = \strlen($str);
1028
1029
            /** @noinspection ForeachInvariantsInspection */
1030
            for ($i = 0; $i < $len; ++$i) {
1031
                if (($str[$i] & "\x80") === "\x00") {
1032
                    $ret[] = $str[$i];
1033
                } elseif (
1034
                    isset($str[$i + 1])
1035
                    &&
1036
                    ($str[$i] & "\xE0") === "\xC0"
1037
                ) {
1038
                    if (($str[$i + 1] & "\xC0") === "\x80") {
1039
                        $ret[] = $str[$i] . $str[$i + 1];
1040
1041
                        ++$i;
1042
                    }
1043
                } elseif (
1044
                    isset($str[$i + 2])
1045
                    &&
1046
                    ($str[$i] & "\xF0") === "\xE0"
1047
                ) {
1048
                    if (
1049
                        ($str[$i + 1] & "\xC0") === "\x80"
1050
                        &&
1051
                        ($str[$i + 2] & "\xC0") === "\x80"
1052
                    ) {
1053
                        $ret[] = $str[$i] . $str[$i + 1] . $str[$i + 2];
1054
1055
                        $i += 2;
1056
                    }
1057
                } elseif (
1058
                    isset($str[$i + 3])
1059
                    &&
1060
                    ($str[$i] & "\xF8") === "\xF0"
1061
                ) {
1062
                    if (
1063
                        ($str[$i + 1] & "\xC0") === "\x80"
1064
                        &&
1065
                        ($str[$i + 2] & "\xC0") === "\x80"
1066
                        &&
1067
                        ($str[$i + 3] & "\xC0") === "\x80"
1068
                    ) {
1069
                        $ret[] = $str[$i] . $str[$i + 1] . $str[$i + 2] . $str[$i + 3];
1070
1071
                        $i += 3;
1072
                    }
1073
                }
1074
            }
1075
        }
1076
1077
        if ($length > 1) {
1078
            $ret = \array_chunk($ret, $length);
1079
1080
            return array_map(
1081
                static function (&$item) {
1082
                    return implode('', $item);
1083
                },
1084
                $ret
1085
            );
1086
        }
1087
1088
        if (isset($ret[0]) && $ret[0] === '') {
1089
            return [];
1090
        }
1091
1092
        return $ret;
1093
    }
1094
1095
    private function clean($str, $remove_bom = false, $normalize_whitespace = false, $normalize_msword = false, $keep_non_breaking_space = false, $replace_diamond_question_mark = false, $remove_invisible_characters = true)
0 ignored issues
show
introduced by
The method clean has a boolean flag argument $remove_bom, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
introduced by
The method clean has a boolean flag argument $normalize_whitespace, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
introduced by
The method clean has a boolean flag argument $normalize_msword, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
introduced by
The method clean has a boolean flag argument $keep_non_breaking_space, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
introduced by
The method clean has a boolean flag argument $replace_diamond_question_mark, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
introduced by
The method clean has a boolean flag argument $remove_invisible_characters, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Coding Style Naming introduced by
The parameter $remove_bom is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The parameter $normalize_whitespace is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The parameter $normalize_msword is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The parameter $keep_non_breaking_space is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The parameter $replace_diamond_question_mark is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The parameter $remove_invisible_characters is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $replace_diamond_question_mark is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $remove_invisible_characters is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $normalize_whitespace is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $keep_non_breaking_space is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $normalize_msword is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $remove_bom is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Comprehensibility Naming introduced by
The variable name $keep_non_breaking_space exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
Comprehensibility Naming introduced by
The variable name $replace_diamond_question_mark exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
Comprehensibility Naming introduced by
The variable name $remove_invisible_characters exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
Complexity introduced by
This operation has 32 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
1096
    {
1097
        // http://stackoverflow.com/questions/1401317/remove-non-utf8-characters-from-string
1098
        // caused connection reset problem on larger strings
1099
1100
        $regx = '/
1101
          (
1102
            (?: [\x00-\x7F]               # single-byte sequences   0xxxxxxx
1103
            |   [\xC0-\xDF][\x80-\xBF]    # double-byte sequences   110xxxxx 10xxxxxx
1104
            |   [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences   1110xxxx 10xxxxxx * 2
1105
            |   [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
1106
            ){1,100}                      # ...one or more times
1107
          )
1108
        | ( [\x80-\xBF] )                 # invalid byte in range 10000000 - 10111111
1109
        | ( [\xC0-\xFF] )                 # invalid byte in range 11000000 - 11111111
1110
        /x';
1111
        $str = (string)preg_replace($regx, '$1', $str);
1112
1113
        if ($replace_diamond_question_mark === true) {
1114
            $str = $this->replace_diamond_question_mark($str, '');
1115
        }
1116
1117
        if ($remove_invisible_characters === true) {
1118
            $str = $this->remove_invisible_characters($str);
1119
        }
1120
1121
        if ($normalize_whitespace === true) {
1122
            $str = $this->normalize_whitespace($str, $keep_non_breaking_space);
1123
        }
1124
1125
        if ($normalize_msword === true) {
1126
            $str = $this->normalize_msword($str);
1127
        }
1128
1129
        if ($remove_bom === true) {
1130
            $str = $this->remove_bom($str);
1131
        }
1132
1133
        return $str;
1134
    }
1135
1136 6
    public function replace_diamond_question_mark($str, $replacementChar = '', $processInvalidUtf8 = true)
0 ignored issues
show
introduced by
The method replace_diamond_question_mark has a boolean flag argument $processInvalidUtf8, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Coding Style Naming introduced by
The method replace_diamond_question_mark is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Complexity introduced by
This operation has 10 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style introduced by
Method name "Utf8::replace_diamond_question_mark" is not in camel caps format
Loading history...
1137
    {
1138 6
        if ($str === '') {
1139
            return '';
1140
        }
1141
1142 6
        if ($processInvalidUtf8 === true) {
1143 6
            $replacementCharHelper = $replacementChar;
0 ignored issues
show
Comprehensibility Naming introduced by
The variable name $replacementCharHelper exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
1144 6
            if ($replacementChar === '') {
1145 6
                $replacementCharHelper = 'none';
1146
            }
1147
1148 6
            if ($this->SUPPORT['mbstring'] === false) {
1149
                // if there is no native support for "mbstring",
1150
                // then we need to clean the string before ...
1151
                $str = $this->clean($str);
1152
            }
1153
1154 6
            $save = \mb_substitute_character();
1155 6
            \mb_substitute_character($replacementCharHelper);
1156
            // the polyfill maybe return false, so cast to string
1157 6
            $str = (string)\mb_convert_encoding($str, 'UTF-8', 'UTF-8');
1158 6
            \mb_substitute_character($save);
1159
        }
1160
1161 6
        return str_replace(
1162
            [
1163 6
                "\xEF\xBF\xBD",
1164
                '�',
1165
            ],
1166
            [
1167 6
                $replacementChar,
1168 6
                $replacementChar,
1169
            ],
1170 6
            $str
1171
        );
1172
    }
1173
1174 6
    public function remove_invisible_characters($str, $url_encoded = true, $replacement = '')
0 ignored issues
show
introduced by
The method remove_invisible_characters has a boolean flag argument $url_encoded, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Coding Style Naming introduced by
The method remove_invisible_characters is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The parameter $url_encoded is not named in camelCase.

This check marks parameter names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $non_displayables is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $url_encoded is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
Method name "Utf8::remove_invisible_characters" is not in camel caps format
Loading history...
1175
    {
1176
        // init
1177 6
        $non_displayables = [];
1178
1179
        // every control character except newline (dec 10),
1180
        // carriage return (dec 13) and horizontal tab (dec 09)
1181 6
        if ($url_encoded) {
1182 6
            $non_displayables[] = '/%0[0-8bcefBCEF]/'; // url encoded 00-08, 11, 12, 14, 15
1183 6
            $non_displayables[] = '/%1[0-9a-fA-F]/'; // url encoded 16-31
1184
        }
1185
1186 6
        $non_displayables[] = '/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]+/S'; // 00-08, 11, 12, 14-31, 127
1187
1188
        do {
1189 6
            $str = (string)preg_replace($non_displayables, $replacement, $str, -1, $count);
1190 6
        } while ($count !== 0);
1191
1192 6
        return $str;
1193
    }
1194
1195 6
    public function normalize_whitespace($str, $keepNonBreakingSpace = false, $keepBidiUnicodeControls = false)
0 ignored issues
show
introduced by
The method normalize_whitespace has a boolean flag argument $keepNonBreakingSpace, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
introduced by
The method normalize_whitespace has a boolean flag argument $keepBidiUnicodeControls, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Coding Style Naming introduced by
The method normalize_whitespace is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $WHITESPACE_CACHE is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $BIDI_UNICODE_CONTROLS_CACHE is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Comprehensibility Naming introduced by
The variable name $keepBidiUnicodeControls exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
Complexity introduced by
This operation has 18 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style introduced by
Method name "Utf8::normalize_whitespace" is not in camel caps format
Loading history...
1196
    {
1197 6
        if ($str === '') {
1198
            return '';
1199
        }
1200
1201 6
        static $WHITESPACE_CACHE = [];
1202 6
        $cacheKey = (int)$keepNonBreakingSpace;
1203
1204 6
        if (!isset($WHITESPACE_CACHE[$cacheKey])) {
1205 1
            $WHITESPACE_CACHE[$cacheKey] = $this->WHITESPACE_TABLE;
1206
1207 1
            if ($keepNonBreakingSpace === true) {
1208
                unset($WHITESPACE_CACHE[$cacheKey]['NO-BREAK SPACE']);
1209
            }
1210
1211 1
            $WHITESPACE_CACHE[$cacheKey] = array_values($WHITESPACE_CACHE[$cacheKey]);
1212
        }
1213
1214 6
        if ($keepBidiUnicodeControls === false) {
1215 6
            static $BIDI_UNICODE_CONTROLS_CACHE = null;
0 ignored issues
show
Comprehensibility Naming introduced by
The variable name $BIDI_UNICODE_CONTROLS_CACHE exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
1216
1217 6
            if ($BIDI_UNICODE_CONTROLS_CACHE === null) {
1218 1
                $BIDI_UNICODE_CONTROLS_CACHE = array_values($this->BIDI_UNI_CODE_CONTROLS_TABLE);
1219
            }
1220
1221 6
            $str = \str_replace($BIDI_UNICODE_CONTROLS_CACHE, '', $str);
1222
        }
1223
1224 6
        return str_replace($WHITESPACE_CACHE[$cacheKey], ' ', $str);
1225
    }
1226
1227
    private function normalize_msword($str)
0 ignored issues
show
Coding Style Naming introduced by
The method normalize_msword is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
Method name "Utf8::normalize_msword" is not in camel caps format
Loading history...
1228
    {
1229
        if ($str === '') {
1230
            return '';
1231
        }
1232
1233
        $keys = [
1234
            "\xc2\xab", // « (U+00AB) in UTF-8
1235
            "\xc2\xbb", // » (U+00BB) in UTF-8
1236
            "\xe2\x80\x98", // ‘ (U+2018) in UTF-8
1237
            "\xe2\x80\x99", // ’ (U+2019) in UTF-8
1238
            "\xe2\x80\x9a", // ‚ (U+201A) in UTF-8
1239
            "\xe2\x80\x9b", // ‛ (U+201B) in UTF-8
1240
            "\xe2\x80\x9c", // “ (U+201C) in UTF-8
1241
            "\xe2\x80\x9d", // ” (U+201D) in UTF-8
1242
            "\xe2\x80\x9e", // „ (U+201E) in UTF-8
1243
            "\xe2\x80\x9f", // ‟ (U+201F) in UTF-8
1244
            "\xe2\x80\xb9", // ‹ (U+2039) in UTF-8
1245
            "\xe2\x80\xba", // › (U+203A) in UTF-8
1246
            "\xe2\x80\x93", // – (U+2013) in UTF-8
1247
            "\xe2\x80\x94", // — (U+2014) in UTF-8
1248
            "\xe2\x80\xa6", // … (U+2026) in UTF-8
1249
        ];
1250
1251
        $values = [
1252
            '"', // « (U+00AB) in UTF-8
1253
            '"', // » (U+00BB) in UTF-8
1254
            "'", // ‘ (U+2018) in UTF-8
1255
            "'", // ’ (U+2019) in UTF-8
1256
            "'", // ‚ (U+201A) in UTF-8
1257
            "'", // ‛ (U+201B) in UTF-8
1258
            '"', // “ (U+201C) in UTF-8
1259
            '"', // ” (U+201D) in UTF-8
1260
            '"', // „ (U+201E) in UTF-8
1261
            '"', // ‟ (U+201F) in UTF-8
1262
            "'", // ‹ (U+2039) in UTF-8
1263
            "'", // › (U+203A) in UTF-8
1264
            '-', // – (U+2013) in UTF-8
1265
            '-', // — (U+2014) in UTF-8
1266
            '...', // … (U+2026) in UTF-8
1267
        ];
1268
1269
        return str_replace($keys, $values, $str);
1270
    }
1271
1272 6
    public function remove_bom($str)
0 ignored issues
show
Coding Style Naming introduced by
The method remove_bom is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
Method name "Utf8::remove_bom" is not in camel caps format
Loading history...
1273
    {
1274 6
        if ($str === '') {
1275
            return '';
1276
        }
1277
1278 6
        $strLength = \strlen($str);
1279 6
        foreach ($this->BOM as $bomString => $bomByteLength) {
1280 6
            if (strpos($str, $bomString, 0) === 0) {
1281
                $strTmp = \substr($str, $bomByteLength, $strLength);
1282
                if ($strTmp === false) {
1283
                    return '';
1284
                }
1285
1286
                $strLength -= (int)$bomByteLength;
1287
                $str = (string)$strTmp;
1288
            }
1289
        }
1290
1291 6
        return $str;
1292
    }
1293
1294
    private function str_detect_encoding($str)
0 ignored issues
show
Unused Code introduced by
This method is not used, and could be removed.
Loading history...
Complexity introduced by
This operation has 1224 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style Naming introduced by
The method str_detect_encoding is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
Method name "Utf8::str_detect_encoding" is not in camel caps format
Loading history...
1295
    {
1296
        // init
1297
        $str = (string)$str;
1298
1299
        //
1300
        // 1.) check binary strings (010001001...) like UTF-16 / UTF-32 / PDF / Images / ...
1301
        //
1302
1303
        if ($this->is_binary($str, true) === true) {
1304
            $isUtf16 = $this->is_utf16($str, false);
1305
            if ($isUtf16 === 1) {
1306
                return 'UTF-16LE';
1307
            }
1308
            if ($isUtf16 === 2) {
1309
                return 'UTF-16BE';
1310
            }
1311
1312
            $isUtf32 = $this->is_utf32($str, false);
1313
            if ($isUtf32 === 1) {
1314
                return 'UTF-32LE';
1315
            }
1316
            if ($isUtf32 === 2) {
1317
                return 'UTF-32BE';
1318
            }
1319
1320
            // is binary but not "UTF-16" or "UTF-32"
1321
            return false;
1322
        }
1323
1324
        //
1325
        // 2.) simple check for ASCII chars
1326
        //
1327
1328
        if ($this->isAscii($str) === true) {
1329
            return 'ASCII';
1330
        }
1331
1332
        //
1333
        // 3.) simple check for UTF-8 chars
1334
        //
1335
1336
        if ($this->isUtf8($str) === true) {
1337
            return 'UTF-8';
1338
        }
1339
1340
        //
1341
        // 4.) check via "mb_detect_encoding()"
1342
        //
1343
        // INFO: UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always with "mb_detect_encoding()"
1344
1345
        $detectOrder = [
1346
            'ISO-8859-1',
1347
            'ISO-8859-2',
1348
            'ISO-8859-3',
1349
            'ISO-8859-4',
1350
            'ISO-8859-5',
1351
            'ISO-8859-6',
1352
            'ISO-8859-7',
1353
            'ISO-8859-8',
1354
            'ISO-8859-9',
1355
            'ISO-8859-10',
1356
            'ISO-8859-13',
1357
            'ISO-8859-14',
1358
            'ISO-8859-15',
1359
            'ISO-8859-16',
1360
            'WINDOWS-1251',
1361
            'WINDOWS-1252',
1362
            'WINDOWS-1254',
1363
            'CP932',
1364
            'CP936',
1365
            'CP950',
1366
            'CP866',
1367
            'CP850',
1368
            'CP51932',
1369
            'CP50220',
1370
            'CP50221',
1371
            'CP50222',
1372
            'ISO-2022-JP',
1373
            'ISO-2022-KR',
1374
            'JIS',
1375
            'JIS-ms',
1376
            'EUC-CN',
1377
            'EUC-JP',
1378
        ];
1379
1380
        if ($this->SUPPORT['mbstring'] === true) {
1381
            // info: do not use the symfony polyfill here
1382
            $encoding = \mb_detect_encoding($str, $detectOrder, true);
1383
            if ($encoding) {
1384
                return $encoding;
1385
            }
1386
        }
1387
1388
        //
1389
        // 5.) check via "iconv()"
1390
        //
1391
1392
        if ($this->ENCODINGS === null) {
1393
            $this->ENCODINGS = $this->getData('encodings');
1394
        }
1395
1396
        foreach ($this->ENCODINGS as $encodingTmp) {
1397
            // INFO: //IGNORE but still throw notice
1398
            /** @noinspection PhpUsageOfSilenceOperatorInspection */
1399
            if ((string)@\iconv($encodingTmp, $encodingTmp . '//IGNORE', $str) === $str) {
1400
                return $encodingTmp;
1401
            }
1402
        }
1403
1404
        return false;
1405
    }
1406
1407
    private function is_binary($input, $strict = false)
0 ignored issues
show
introduced by
The method is_binary has a boolean flag argument $strict, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Coding Style Naming introduced by
The method is_binary is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $finfo_encoding is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Complexity introduced by
This operation has 112 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style introduced by
Method name "Utf8::is_binary" is not in camel caps format
Loading history...
1408
    {
1409
        $input = (string)$input;
1410
        if ($input === '') {
1411
            return false;
1412
        }
1413
1414
        if (preg_match('~^[01]+$~', $input)) {
1415
            return true;
1416
        }
1417
1418
        $ext = $this->get_file_type($input);
1419
        if ($ext['type'] === 'binary') {
1420
            return true;
1421
        }
1422
1423
        $testLength = \strlen($input);
1424
        $testNull = \substr_count($input, "\x0", 0, $testLength);
1425
        if (($testNull / $testLength) > 0.25) {
1426
            return true;
1427
        }
1428
1429
        if ($strict === true) {
1430
            if ($this->SUPPORT['finfo'] === false) {
1431
                throw new \RuntimeException('ext-fileinfo: is not installed');
1432
            }
1433
1434
            /** @noinspection PhpComposerExtensionStubsInspection */
1435
            $finfo_encoding = (new \finfo(\FILEINFO_MIME_ENCODING))->buffer($input);
1436
            if ($finfo_encoding && $finfo_encoding === 'binary') {
1437
                return true;
1438
            }
1439
        }
1440
1441
        return false;
1442
    }
1443
1444
    private function get_file_type(
0 ignored issues
show
Coding Style Naming introduced by
The method get_file_type is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $str_info is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $type_code is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Complexity introduced by
This operation has 120 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style introduced by
Method name "Utf8::get_file_type" is not in camel caps format
Loading history...
1445
        $str,
1446
        $fallback = [
1447
            'ext' => null,
1448
            'mime' => 'application/octet-stream',
1449
            'type' => null,
1450
        ]
1451
    ) {
1452
        if ($str === '') {
1453
            return $fallback;
1454
        }
1455
1456
        $str_info = \substr($str, 0, 2);
1457
        if ($str_info === false || \strlen($str_info) !== 2) {
1458
            return $fallback;
1459
        }
1460
1461
        $str_info = \unpack('C2chars', $str_info);
1462
        if ($str_info === false) {
1463
            return $fallback;
1464
        }
1465
        $type_code = (int)($str_info['chars1'] . $str_info['chars2']);
1466
1467
        // DEBUG
1468
        //var_dump($type_code);
1469
1470
        switch ($type_code) {
1471
            case 3780:
1472
                $ext = 'pdf';
1473
                $mime = 'application/pdf';
1474
                $type = 'binary';
1475
1476
                break;
1477
            case 7790:
1478
                $ext = 'exe';
1479
                $mime = 'application/octet-stream';
1480
                $type = 'binary';
1481
1482
                break;
1483
            case 7784:
1484
                $ext = 'midi';
1485
                $mime = 'audio/x-midi';
1486
                $type = 'binary';
1487
1488
                break;
1489
            case 8075:
1490
                $ext = 'zip';
1491
                $mime = 'application/zip';
1492
                $type = 'binary';
1493
1494
                break;
1495
            case 8297:
1496
                $ext = 'rar';
1497
                $mime = 'application/rar';
1498
                $type = 'binary';
1499
1500
                break;
1501
            case 255216:
1502
                $ext = 'jpg';
1503
                $mime = 'image/jpeg';
1504
                $type = 'binary';
1505
1506
                break;
1507
            case 7173:
1508
                $ext = 'gif';
1509
                $mime = 'image/gif';
1510
                $type = 'binary';
1511
1512
                break;
1513
            case 6677:
1514
                $ext = 'bmp';
1515
                $mime = 'image/bmp';
1516
                $type = 'binary';
1517
1518
                break;
1519
            case 13780:
1520
                $ext = 'png';
1521
                $mime = 'image/png';
1522
                $type = 'binary';
1523
1524
                break;
1525
            default:
1526
                return $fallback;
1527
        }
1528
1529
        return [
1530
            'ext' => $ext,
1531
            'mime' => $mime,
1532
            'type' => $type,
1533
        ];
1534
    }
1535
1536
    private function is_utf16($str, $checkIfStringIsBinary = true)
0 ignored issues
show
introduced by
The method is_utf16 has a boolean flag argument $checkIfStringIsBinary, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Complexity introduced by
This operation has 1152 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style Naming introduced by
The method is_utf16 is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Comprehensibility Naming introduced by
The variable name $checkIfStringIsBinary exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
Coding Style introduced by
Method name "Utf8::is_utf16" is not in camel caps format
Loading history...
1537
    {
1538
1539
        // init
1540
        $str = (string)$str;
1541
        $strChars = [];
1542
1543
        if (
1544
            $checkIfStringIsBinary === true
1545
            &&
1546
            $this->is_binary($str, true) === false
1547
        ) {
1548
            return false;
1549
        }
1550
1551
        if ($this->SUPPORT['mbstring'] === false) {
1552
            \trigger_error('UTF8::is_utf16() without mbstring may did not work correctly', \E_USER_WARNING);
1553
        }
1554
1555
        $str = $this->remove_bom($str);
1556
1557
1558
        $maybeUTF16LE = 0;
1559
        $test = \mb_convert_encoding($str, 'UTF-8', 'UTF-16LE');
1560
        if ($test) {
1561
            $test2 = \mb_convert_encoding($test, 'UTF-16LE', 'UTF-8');
1562
            $test3 = \mb_convert_encoding($test2, 'UTF-8', 'UTF-16LE');
1563
            if ($test3 === $test) {
1564
                if (\count($strChars) === 0) {
1565
                    $strChars = $this->count_chars($str, true, false);
1566
                }
1567
                $countChars = $this->count_chars($test3);
1568
                foreach ($countChars as $test3char => $test3charEmpty) {
1569
                    if (\in_array($test3char, $strChars, true) === true) {
1570
                        ++$maybeUTF16LE;
1571
                    }
1572
                    unset($countChars[$test3char]);
1573
                }
1574
1575
0 ignored issues
show
Coding Style introduced by
Blank line found at end of control structure
Loading history...
1576
            }
1577
        }
1578
1579
        $maybeUTF16BE = 0;
1580
        $test = \mb_convert_encoding($str, 'UTF-8', 'UTF-16BE');
1581
        if ($test) {
1582
            $test2 = \mb_convert_encoding($test, 'UTF-16BE', 'UTF-8');
1583
            $test3 = \mb_convert_encoding($test2, 'UTF-8', 'UTF-16BE');
1584
            if ($test3 === $test) {
1585
                if (\count($strChars) === 0) {
1586
                    $strChars = $this->count_chars($str, true, false);
1587
                }
1588
                $countChars = $this->count_chars($test3);
1589
                foreach ($countChars as $test3char => $test3charEmpty) {
1590
                    if (\in_array($test3char, $strChars, true) === true) {
1591
                        ++$maybeUTF16BE;
1592
                    }
1593
                    unset($countChars[$test3char]);
1594
                }
1595
0 ignored issues
show
Coding Style introduced by
Blank line found at end of control structure
Loading history...
1596
            }
1597
        }
1598
1599
        if ($maybeUTF16BE !== $maybeUTF16LE) {
1600
            if ($maybeUTF16LE > $maybeUTF16BE) {
1601
                return 1;
1602
            }
1603
1604
            return 2;
1605
        }
1606
1607
        return false;
1608
    }
1609
1610
    private function count_chars($str, $cleanUtf8 = false, $tryToUseMbFunction = true)
0 ignored issues
show
introduced by
The method count_chars has a boolean flag argument $cleanUtf8, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
introduced by
The method count_chars has a boolean flag argument $tryToUseMbFunction, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Coding Style Naming introduced by
The method count_chars is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
Method name "Utf8::count_chars" is not in camel caps format
Loading history...
1611
    {
1612
        return array_count_values($this->strSplit($str, 1, $cleanUtf8, $tryToUseMbFunction));
1613
    }
1614
1615
    /**
1616
     * Check if the string is UTF-32.
1617
     *
1618
     * @param mixed $str <p>The input string.</p>
1619
     * @param bool $checkIfStringIsBinary
1620
     *
1621
     * @return false|int
1622
     *                   <strong>false</strong> if is't not UTF-32,<br>
1623
     *                   <strong>1</strong> for UTF-32LE,<br>
1624
     *                   <strong>2</strong> for UTF-32BE
1625
     */
1626
    private function is_utf32($str, $checkIfStringIsBinary = true)
0 ignored issues
show
introduced by
The method is_utf32 has a boolean flag argument $checkIfStringIsBinary, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Complexity introduced by
This operation has 1152 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style Naming introduced by
The method is_utf32 is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Comprehensibility Naming introduced by
The variable name $checkIfStringIsBinary exceeds the maximum configured length of 20.

Very long variable names usually make code harder to read. It is therefore recommended not to make variable names too verbose.

Loading history...
Coding Style introduced by
Method name "Utf8::is_utf32" is not in camel caps format
Loading history...
1627
    {
1628
        // init
1629
        $str = (string)$str;
1630
        $strChars = [];
1631
1632
        if ($checkIfStringIsBinary === true && $this->is_binary($str, true) === false) {
1633
            return false;
1634
        }
1635
1636
        if ($this->SUPPORT['mbstring'] === false) {
1637
            \trigger_error('UTF8::is_utf32() without mbstring may did not work correctly', \E_USER_WARNING);
1638
        }
1639
1640
        $str = $this->remove_bom($str);
1641
1642
        $maybeUTF32LE = 0;
1643
        $test = \mb_convert_encoding($str, 'UTF-8', 'UTF-32LE');
1644
        if ($test) {
1645
            $test2 = \mb_convert_encoding($test, 'UTF-32LE', 'UTF-8');
1646
            $test3 = \mb_convert_encoding($test2, 'UTF-8', 'UTF-32LE');
1647
            if ($test3 === $test) {
1648
                if (\count($strChars) === 0) {
1649
                    $strChars = $this->count_chars($str, true, false);
1650
                }
1651
                $countChars = $this->count_chars($test3);
1652
                foreach ($countChars as $test3char => $test3charEmpty) {
1653
                    if (\in_array($test3char, $strChars, true) === true) {
1654
                        ++$maybeUTF32LE;
1655
                    }
1656
                    unset($countChars[$test3char]);
1657
                }
1658
            }
1659
        }
1660
1661
        $maybeUTF32BE = 0;
1662
        $test = \mb_convert_encoding($str, 'UTF-8', 'UTF-32BE');
1663
        if ($test) {
1664
            $test2 = \mb_convert_encoding($test, 'UTF-32BE', 'UTF-8');
1665
            $test3 = \mb_convert_encoding($test2, 'UTF-8', 'UTF-32BE');
1666
            if ($test3 === $test) {
1667
                if (\count($strChars) === 0) {
1668
                    $strChars = $this->count_chars($str, true, false);
1669
                }
1670
                $countChars = $this->count_chars($test3);
1671
                foreach ($countChars as $test3char => $test3charEmpty) {
1672
                    if (\in_array($test3char, $strChars, true) === true) {
1673
                        ++$maybeUTF32BE;
1674
                    }
1675
                    unset($countChars[$test3char]);
1676
                }
1677
            }
1678
        }
1679
1680
        if ($maybeUTF32BE !== $maybeUTF32LE) {
1681
            if ($maybeUTF32LE > $maybeUTF32BE) {
1682
                return 1;
1683
            }
1684
1685
            return 2;
1686
        }
1687
1688
        return false;
1689
    }
1690
1691
    /**
1692
     * Checks whether the passed string contains only byte sequences that appear valid UTF-8 characters.
1693
     *
1694
     * @see    http://hsivonen.iki.fi/php-utf8/
1695
     *
1696
     * @param string|string[] $str <p>The string to be checked.</p>
1697
     * @param bool $strict <p>Check also if the string is not UTF-16 or UTF-32.</p>
0 ignored issues
show
Bug introduced by
There is no parameter named $strict. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
1698
     *
1699
     * @return bool
1700
     */
1701
    private function isUtf8($str)
0 ignored issues
show
Complexity introduced by
This operation has 640 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
1702
    {
1703
        if (\is_array($str) === true) {
1704
            foreach ($str as $v) {
1705
                if ($this->isUtf8($v) === false) {
1706
                    return false;
1707
                }
1708
            }
1709
1710
            return true;
1711
        }
1712
1713
        if ($str === '') {
1714
            return true;
1715
        }
1716
1717
        if ($this->system->pcre_utf8_support() !== true) {
0 ignored issues
show
Coding Style introduced by
Blank line found at start of control structure
Loading history...
1718
1719
            // If even just the first character can be matched, when the /u
1720
            // modifier is used, then it's valid UTF-8. If the UTF-8 is somehow
1721
            // invalid, nothing at all will match, even if the string contains
1722
            // some valid sequences
1723
            return preg_match('/^.{1}/us', $str, $ar) === 1;
0 ignored issues
show
Comprehensibility introduced by
Avoid variables with short names like $ar. Configured minimum length is 3.

Short variable names may make your code harder to understand. Variable names should be self-descriptive. This check looks for variable names who are shorter than a configured minimum.

Loading history...
1724
        }
1725
1726
        $mState = 0; // cached expected number of octets after the current octet
1727
        // until the beginning of the next UTF8 character sequence
1728
        $mUcs4 = 0; // cached Unicode character
1729
        $mBytes = 1; // cached expected number of octets in the current sequence
1730
1731
        if ($this->ORD === null) {
1732
            $this->ORD = $this->getData('ord');
1733
        }
1734
1735
        $len = \strlen((string)$str);
1736
        /** @noinspection ForeachInvariantsInspection */
1737
        for ($i = 0; $i < $len; ++$i) {
1738
            $in = $this->ORD[$str[$i]];
0 ignored issues
show
Comprehensibility introduced by
Avoid variables with short names like $in. Configured minimum length is 3.

Short variable names may make your code harder to understand. Variable names should be self-descriptive. This check looks for variable names who are shorter than a configured minimum.

Loading history...
1739
            if ($mState === 0) {
1740
                // When mState is zero we expect either a US-ASCII character or a
1741
                // multi-octet sequence.
1742
                if ((0x80 & $in) === 0) {
1743
                    // US-ASCII, pass straight through.
1744
                    $mBytes = 1;
1745
                } elseif ((0xE0 & $in) === 0xC0) {
1746
                    // First octet of 2 octet sequence.
1747
                    $mUcs4 = $in;
1748
                    $mUcs4 = ($mUcs4 & 0x1F) << 6;
1749
                    $mState = 1;
1750
                    $mBytes = 2;
1751
                } elseif ((0xF0 & $in) === 0xE0) {
1752
                    // First octet of 3 octet sequence.
1753
                    $mUcs4 = $in;
1754
                    $mUcs4 = ($mUcs4 & 0x0F) << 12;
1755
                    $mState = 2;
1756
                    $mBytes = 3;
1757
                } elseif ((0xF8 & $in) === 0xF0) {
1758
                    // First octet of 4 octet sequence.
1759
                    $mUcs4 = $in;
1760
                    $mUcs4 = ($mUcs4 & 0x07) << 18;
1761
                    $mState = 3;
1762
                    $mBytes = 4;
1763
                } elseif ((0xFC & $in) === 0xF8) {
1764
                    /* First octet of 5 octet sequence.
1765
                     *
1766
                     * This is illegal because the encoded codepoint must be either
1767
                     * (a) not the shortest form or
1768
                     * (b) outside the Unicode range of 0-0x10FFFF.
1769
                     * Rather than trying to resynchronize, we will carry on until the end
1770
                     * of the sequence and let the later error handling code catch it.
1771
                     */
1772
                    $mUcs4 = $in;
1773
                    $mUcs4 = ($mUcs4 & 0x03) << 24;
1774
                    $mState = 4;
1775
                    $mBytes = 5;
1776
                } elseif ((0xFE & $in) === 0xFC) {
1777
                    // First octet of 6 octet sequence, see comments for 5 octet sequence.
1778
                    $mUcs4 = $in;
1779
                    $mUcs4 = ($mUcs4 & 1) << 30;
1780
                    $mState = 5;
1781
                    $mBytes = 6;
1782
                } else {
0 ignored issues
show
Coding Style introduced by
The method isUtf8 uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
1783
                    // Current octet is neither in the US-ASCII range nor a legal first
1784
                    // octet of a multi-octet sequence.
1785
                    return false;
1786
                }
1787
            } elseif ((0xC0 & $in) === 0x80) {
0 ignored issues
show
Coding Style introduced by
Blank line found at start of control structure
Loading history...
1788
1789
                // When mState is non-zero, we expect a continuation of the multi-octet
1790
                // sequence
1791
1792
                // Legal continuation.
1793
                $shift = ($mState - 1) * 6;
1794
                $tmp = $in;
1795
                $tmp = ($tmp & 0x0000003F) << $shift;
1796
                $mUcs4 |= $tmp;
1797
                // Prefix: End of the multi-octet sequence. mUcs4 now contains the final
1798
                // Unicode code point to be output.
1799
                if (--$mState === 0) {
1800
                    // Check for illegal sequences and code points.
1801
                    //
1802
                    // From Unicode 3.1, non-shortest form is illegal
1803
                    if (
1804
                        ($mBytes === 2 && $mUcs4 < 0x0080)
1805
                        ||
1806
                        ($mBytes === 3 && $mUcs4 < 0x0800)
1807
                        ||
1808
                        ($mBytes === 4 && $mUcs4 < 0x10000)
1809
                        ||
1810
                        ($mBytes > 4)
1811
                        ||
1812
                        // From Unicode 3.2, surrogate characters are illegal.
1813
                        (($mUcs4 & 0xFFFFF800) === 0xD800)
1814
                        ||
1815
                        // Code points outside the Unicode range are illegal.
1816
                        ($mUcs4 > 0x10FFFF)
1817
                    ) {
1818
                        return false;
1819
                    }
1820
                    // initialize UTF8 cache
1821
                    $mState = 0;
1822
                    $mUcs4 = 0;
1823
                    $mBytes = 1;
1824
                }
1825
            } else {
0 ignored issues
show
Coding Style introduced by
The method isUtf8 uses an else expression. Else is never necessary and you can simplify the code to work without else.
Loading history...
1826
                // ((0xC0 & (*in) != 0x80) && (mState != 0))
1827
                // Incomplete multi-octet sequence.
1828
                return false;
1829
            }
1830
        }
1831
1832
        return true;
1833
    }
1834
1835
    private function to_iso8859($str)
0 ignored issues
show
Unused Code introduced by
This method is not used, and could be removed.
Loading history...
Coding Style Naming introduced by
The method to_iso8859 is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
Method name "Utf8::to_iso8859" is not in camel caps format
Loading history...
1836
    {
1837
        if (is_array($str) === true) {
0 ignored issues
show
Coding Style introduced by
Blank line found at start of control structure
Loading history...
1838
1839
            foreach ($str as $k => $v) {
1840
                $str[$k] = $this->to_iso8859($v);
1841
            }
1842
1843
            return $str;
1844
        }
1845
1846
        $str = (string)$str;
1847
        if ($str === '') {
1848
            return '';
1849
        }
1850
1851
        return $this->utf8_decode($str);
1852
    }
1853
1854
    /**
1855
     * Decodes an UTF-8 string to ISO-8859-1.
1856
     *
1857
     * @param string $str <p>The input string.</p>
1858
     * @param bool $keepUtf8Chars
1859
     *
1860
     * @return string
1861
     */
1862
    private function utf8_decode($str, $keepUtf8Chars = false)
0 ignored issues
show
introduced by
The method utf8_decode has a boolean flag argument $keepUtf8Chars, which is a certain sign of a Single Responsibility Principle violation.
Loading history...
Complexity introduced by
This operation has 480 execution paths which exceeds the configured maximum of 200.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
Coding Style Naming introduced by
The method utf8_decode is not named in camelCase.

This check marks method names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style Naming introduced by
The variable $str_backup is not named in camelCase.

This check marks variable names that have not been written in camelCase.

In camelCase names are written without any punctuation, the start of each new word being marked by a capital letter. Thus the name database connection string becomes databaseConnectionString.

Loading history...
Coding Style introduced by
Method name "Utf8::utf8_decode" is not in camel caps format
Loading history...
1863
    {
1864
        if ($str === '') {
1865
            return '';
1866
        }
1867
1868
        // save for later comparision
1869
        $str_backup = $str;
1870
        $len = \strlen($str);
1871
1872
        if ($this->ORD === null) {
1873
            $this->ORD = $this->getData('ord');
1874
        }
1875
1876
        if ($this->CHR === null) {
1877
            $this->CHR = $this->getData('chr');
1878
        }
1879
1880
        $noCharFound = '?';
1881
        /** @noinspection ForeachInvariantsInspection */
1882
        for ($i = 0, $j = 0; $i < $len; ++$i, ++$j) {
1883
            switch ($str[$i] & "\xF0") {
1884
                case "\xC0":
1885
                case "\xD0":
1886
                    $c = ($this->ORD[$str[$i] & "\x1F"] << 6) | $this->ORD[$str[++$i] & "\x3F"];
0 ignored issues
show
Comprehensibility introduced by
Avoid variables with short names like $c. Configured minimum length is 3.

Short variable names may make your code harder to understand. Variable names should be self-descriptive. This check looks for variable names who are shorter than a configured minimum.

Loading history...
1887
                    $str[$j] = $c < 256 ? $this->CHR[$c] : $noCharFound;
1888
1889
                    break;
1890
1891
                /** @noinspection PhpMissingBreakStatementInspection */
1892
                case "\xF0":
1893
                    ++$i;
1894
1895
                // no break
1896
1897
                case "\xE0":
1898
                    $str[$j] = $noCharFound;
1899
                    $i += 2;
1900
1901
                    break;
1902
1903
                default:
1904
                    $str[$j] = $str[$i];
1905
            }
1906
        }
1907
1908
        $return = substr($str, 0, $j);
1909
        if ($return === false) {
1910
            $return = '';
1911
        }
1912
1913
        if (
1914
            $keepUtf8Chars === true
1915
            &&
1916
            $this->stringLength($return) >= (int)$this->stringLength($str_backup)
1917
        ) {
1918
            return $str_backup;
1919
        }
1920
1921
        return $return;
1922
    }
1923
1924
    /**
1925
     * @param $str
1926
     * @param string $encoding
0 ignored issues
show
Bug introduced by
There is no parameter named $encoding. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
1927
     * @param bool $cleanUtf8
0 ignored issues
show
Bug introduced by
There is no parameter named $cleanUtf8. Was it maybe removed?

This check looks for PHPDoc comments describing methods or function parameters that do not exist on the corresponding method or function.

Consider the following example. The parameter $italy is not defined by the method finale(...).

/**
 * @param array $germany
 * @param array $island
 * @param array $italy
 */
function finale($germany, $island) {
    return "2:1";
}

The most likely cause is that the parameter was removed, but the annotation was not.

Loading history...
1928
     * @return bool|int
1929
     */
1930
    private function stringLength($str)
0 ignored issues
show
Complexity introduced by
This operation has 144 execution paths which exceeds the configured maximum of 10.

A high number of execution paths generally suggests many nested conditional statements and make the code less readible. This can usually be fixed by splitting the method into several smaller methods.

You can also find more information in the “Code” section of your repository.

Loading history...
1931
    {
1932
        if ($str === '') {
1933
            return 0;
1934
        }
1935
1936
        if ($this->SUPPORT['mbstring'] === true) {
1937
            return mb_strlen($str, 'UTF-8');
1938
        }
1939
1940
        if ($this->SUPPORT['iconv'] === true) {
1941
            $returnTmp = \iconv_strlen($str, 'UTF-8');
1942
            if ($returnTmp !== false) {
1943
                return $returnTmp;
1944
            }
1945
        }
1946
1947
        if (
1948
            $this->SUPPORT['intl'] === true
1949
        ) {
1950
            $returnTmp = \grapheme_strlen($str);
1951
            if ($returnTmp !== null) {
1952
                return $returnTmp;
1953
            }
1954
        }
1955
1956
        if ($this->isAscii($str)) {
1957
            return strlen($str);
1958
        }
1959
1960
        //
1961
        // fallback via vanilla php
1962
        //
1963
1964
        \preg_match_all('/./us', $str, $parts);
1965
1966
        $returnTmp = \count($parts[0]);
1967
        if ($returnTmp === 0) {
1968
            return false;
1969
        }
1970
1971
        return $returnTmp;
1972
    }
1973
1974
    private function decimalToChr($int)
1975
    {
1976
        return $this->htmlEntityDecode('&#' . $int . ';', \ENT_QUOTES | \ENT_HTML5);
1977
    }
1978
1979
1980
}
1981