Completed
Push — master ( 8eae75...67f352 )
by Lars
05:02
created

Normalizer   C

Complexity

Total Complexity 63

Size/Duplication

Total Lines 361
Duplicated Lines 2.49 %

Coupling/Cohesion

Components 1
Dependencies 0

Test Coverage

Coverage 98.43%

Importance

Changes 3
Bugs 0 Features 2
Metric Value
wmc 63
c 3
b 0
f 2
lcom 1
cbo 0
dl 9
loc 361
ccs 188
cts 191
cp 0.9843
rs 5.8893

5 Methods

Rating   Name   Duplication   Size   Complexity  
B isNormalized() 0 18 5
F decompose() 0 117 22
C normalize() 0 52 13
A getData() 9 9 2
D recompose() 0 93 21

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complex Class

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like Normalizer often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use Normalizer, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/*
4
 * Copyright (C) 2013 Nicolas Grekas - [email protected]
5
 *
6
 * This library is free software; you can redistribute it and/or modify it
7
 * under the terms of the (at your option):
8
 * Apache License v2.0 (http://apache.org/licenses/LICENSE-2.0.txt), or
9
 * GNU General Public License v2.0 (http://gnu.org/licenses/gpl-2.0.txt).
10
 */
11
12
namespace voku\helper\shim;
13
14
/**
15
 * Normalizer is a PHP fallback implementation of the Normalizer class provided by the intl extension.
16
 *
17
 * It has been validated with Unicode 6.3 Normalization Conformance Test.
18
 * See http://www.unicode.org/reports/tr15/ for detailed info about Unicode normalizations.
19
 *
20
 * @package voku\helper\shim
21
 */
22
class Normalizer
23
{
24
  const NONE    = 1;
25
  const FORM_D  = 2;
26
  const NFD     = 2;
27
  const FORM_KD = 3;
28
  const NFKD    = 3;
29
  const FORM_C  = 4;
30
  const NFC     = 4;
31
  const FORM_KC = 5;
32
  const NFKC    = 5;
33
34
  protected static $C, $D, $KD, $cC;
0 ignored issues
show
Coding Style introduced by
It is generally advisable to only define one property per statement.

Only declaring a single property per statement allows you to later on add doc comments more easily.

It is also recommended by PSR2, so it is a common style that many people expect.

Loading history...
35
36
  /**
37
   * @var array
38
   */
39
  protected static $ulen_mask = array(
40
      "\xC0" => 2,
41
      "\xD0" => 2,
42
      "\xE0" => 3,
43
      "\xF0" => 4,
44
  );
45
46
  /**
47
   * @var string
48
   */
49
  protected static $ASCII = "\x20\x65\x69\x61\x73\x6E\x74\x72\x6F\x6C\x75\x64\x5D\x5B\x63\x6D\x70\x27\x0A\x67\x7C\x68\x76\x2E\x66\x62\x2C\x3A\x3D\x2D\x71\x31\x30\x43\x32\x2A\x79\x78\x29\x28\x4C\x39\x41\x53\x2F\x50\x22\x45\x6A\x4D\x49\x6B\x33\x3E\x35\x54\x3C\x44\x34\x7D\x42\x7B\x38\x46\x77\x52\x36\x37\x55\x47\x4E\x3B\x4A\x7A\x56\x23\x48\x4F\x57\x5F\x26\x21\x4B\x3F\x58\x51\x25\x59\x5C\x09\x5A\x2B\x7E\x5E\x24\x40\x60\x7F\x00\x01\x02\x03\x04\x05\x06\x07\x08\x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F";
50
51
  /**
52
   * is normalized
53
   *
54
   * @param string $str
55
   * @param int    $form
56
   *
57
   * @return bool
58
   */
59 8
  public static function isNormalized($str, $form = self::NFC)
60
  {
61 8
    if (strspn($str .= '', self::$ASCII) === strlen($str)) {
62 1
      return true;
63
    }
64
65
    if (
66 8
        self::NFC === $form
67 8
        &&
68 8
        preg_match('//u', $str)
69 8
        &&
70 6
        !preg_match('/[^\x00-\x{2FF}]/u', $str)
71 8
    ) {
72 3
      return true;
73
    }
74
75 8
    return false; // Pretend false as quick checks implemented in PHP won't be so quick
76
  }
77
78
  /**
79
   * normalize
80
   *
81
   * @param string $str
82
   * @param int    $form
83
   *
84
   * @return false|string false on error
85
   */
86 20
  public static function normalize($str, $form = self::NFC)
87
  {
88 20
    if (!preg_match('//u', $str .= '')) {
89 7
      return false;
90
    }
91
92
    switch ($form) {
93 18
      case self::NONE:
94 1
        return $str;
95 18
      case self::NFC:
96 7
        $C = true;
97 7
        $K = false;
98 7
        break;
99 13
      case self::NFD:
100 11
        $C = false;
101 11
        $K = false;
102 11
        break;
103 5
      case self::NFKC:
104 5
        $C = true;
105 5
        $K = true;
106 5
        break;
107 2
      case self::NFKD:
108 2
        $C = false;
109 2
        $K = true;
110 2
        break;
111 1
      default:
112 1
        return false;
113 1
    }
114
115 18
    if ('' === $str) {
116 3
      return '';
117
    }
118
119 18
    if ($K && empty(self::$KD)) {
120 1
      self::$KD = static::getData('compatibilityDecomposition');
121 1
    }
122
123 18
    if (empty(self::$D)) {
124 1
      self::$D = static::getData('canonicalDecomposition');
125 1
      self::$cC = static::getData('combiningClass');
126 1
    }
127
128 18
    if ($C) {
129 10
      if (empty(self::$C)) {
130 1
        self::$C = static::getData('canonicalComposition');
131 1
      }
132
133 10
      return self::recompose(self::decompose($str, $K));
134
    } else {
135 11
      return self::decompose($str, $K);
136
    }
137
  }
138
139
  /**
140
   * getData
141
   *
142
   * @param string $file
143
   *
144
   * @return bool|mixed
145
   */
146 2 View Code Duplication
  protected static function getData($file)
0 ignored issues
show
Duplication introduced by
This method seems to be duplicated in your project.

Duplicated code is one of the most pungent code smells. If you need to duplicate the same code in three or more different places, we strongly encourage you to look into extracting the code into a single class or operation.

You can also find more detailed suggestions in the “Code” section of your repository.

Loading history...
147
  {
148 2
    $file = __DIR__ . '/unidata/' . $file . '.ser';
149 2
    if (file_exists($file)) {
150 2
      return unserialize(file_get_contents($file));
151
    } else {
152
      return false;
153
    }
154
  }
155
156
  /**
157
   * recompose
158
   *
159
   * @param string $s
160
   *
161
   * @return string
162
   */
163 10
  protected static function recompose($s)
164
  {
165 10
    $ASCII = self::$ASCII;
166 10
    $compMap = self::$C;
167 10
    $combClass = self::$cC;
168 10
    $ulen_mask = self::$ulen_mask;
169
170 10
    $result = $tail = '';
171
172 10
    $i = $s[0] < "\x80" ? 1 : $ulen_mask[$s[0] & "\xF0"];
173 10
    $len = strlen($s);
174
175 10
    $last_uchr = substr($s, 0, $i);
176 10
    $last_ucls = isset($combClass[$last_uchr]) ? 256 : 0;
177
178 10
    while ($i < $len) {
179 10
      if ($s[$i] < "\x80") {
180
        // ASCII chars
181
182 9
        if ($tail) {
183 2
          $last_uchr .= $tail;
184 2
          $tail = '';
185 2
        }
186
187 9
        $j = strspn($s, $ASCII, $i + 1);
188
189 9
        if ($j) {
190 8
          $last_uchr .= substr($s, $i, $j);
191 8
          $i += $j;
192 8
        }
193
194 9
        $result .= $last_uchr;
195 9
        $last_uchr = $s[$i];
196 9
        ++$i;
197 9
      } else {
198 10
        $ulen = $ulen_mask[$s[$i] & "\xF0"];
199 10
        $uchr = substr($s, $i, $ulen);
200
201
        if (
202
            $last_uchr < "\xE1\x84\x80"
203 10
            ||
204
            "\xE1\x84\x92" < $last_uchr
205 6
            ||
206
            $uchr < "\xE1\x85\xA1"
207 4
            ||
208
            "\xE1\x85\xB5" < $uchr
209 4
            ||
210
            $last_ucls
211 10
        ) {
212
          // Table lookup and combining chars composition
213
214 10
          $ucls = isset($combClass[$uchr]) ? $combClass[$uchr] : 0;
215
216 10
          if (isset($compMap[$last_uchr . $uchr]) && (!$last_ucls || $last_ucls < $ucls)) {
217 9
            $last_uchr = $compMap[$last_uchr . $uchr];
218 10
          } elseif ($last_ucls = $ucls) { // this "=" isn't a typo
219 3
            $tail .= $uchr;
220 3
          } else {
221 7
            if ($tail) {
222 3
              $last_uchr .= $tail;
223 3
              $tail = '';
224 3
            }
225
226 7
            $result .= $last_uchr;
227 7
            $last_uchr = $uchr;
228
          }
229 10
        } else {
230
          // Hangul chars
231
232 4
          $L = ord($last_uchr[2]) - 0x80;
233 4
          $V = ord($uchr[2]) - 0xA1;
234 4
          $T = 0;
235
236 4
          $uchr = substr($s, $i + $ulen, 3);
237
238 4
          if ("\xE1\x86\xA7" <= $uchr && $uchr <= "\xE1\x87\x82") {
239 4
            $T = ord($uchr[2]) - 0xA7;
240 4
            0 > $T && $T += 0x40;
241 4
            $ulen += 3;
242 4
          }
243
244 4
          $L = 0xAC00 + ($L * 21 + $V) * 28 + $T;
245 4
          $last_uchr = chr(0xE0 | $L >> 12) . chr(0x80 | $L >> 6 & 0x3F) . chr(0x80 | $L & 0x3F);
246
        }
247
248 10
        $i += $ulen;
249
      }
250 10
    }
251
252 10
    $result = $result . $last_uchr . $tail;
253
254 10
    return $result;
255
  }
256
257
  /**
258
   * decompose
259
   *
260
   * @param string $s
261
   * @param bool   $c
262
   *
263
   * @return string
264
   */
265 18
  protected static function decompose($s, $c)
266
  {
267 18
    $result = '';
268
269 18
    $ASCII = self::$ASCII;
270 18
    $decompMap = self::$D;
271 18
    $combClass = self::$cC;
272 18
    $ulen_mask = self::$ulen_mask;
273
274 18
    if ($c) {
275 5
      $compatMap = self::$KD;
276 5
    }
277
278
    // init
279 18
    $c = array();
280 18
    $i = 0;
281 18
    $len = strlen($s);
282
283 18
    while ($i < $len) {
284 18
      if ($s[$i] < "\x80") {
285
        // ASCII chars
286
287 18
        if ($c) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $c of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
288 11
          ksort($c);
289 11
          $result .= implode('', $c);
290 11
          $c = array();
291 11
        }
292
293 18
        $j = 1 + strspn($s, $ASCII, $i + 1);
294 18
        $result .= substr($s, $i, $j);
295 18
        $i += $j;
296 18
      } else {
297 17
        $ulen = $ulen_mask[$s[$i] & "\xF0"];
298 17
        $uchr = substr($s, $i, $ulen);
299 17
        $i += $ulen;
300
301 17
        if (isset($combClass[$uchr])) {
302
          // Combining chars, for sorting
303
304 13
          isset($c[$combClass[$uchr]]) || $c[$combClass[$uchr]] = '';
305
306 13
          if (isset($compatMap[$uchr])) {
307
            $c[$combClass[$uchr]] .= $compatMap[$uchr];
308
          } else {
309 13
            $c[$combClass[$uchr]] .= (isset($decompMap[$uchr]) ? $decompMap[$uchr] : $uchr);
310
          }
311
312 13
        } else {
313
314 16
          if ($c) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $c of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
315 8
            ksort($c);
316 8
            $result .= implode('', $c);
317 8
            $c = array();
318 8
          }
319
320 16
          if ($uchr < "\xEA\xB0\x80" || "\xED\x9E\xA3" < $uchr) {
321
            // Table lookup
322
323 16
            if (isset($compatMap[$uchr])) {
324 3
              $j = $compatMap[$uchr];
325 3
            } else {
326 16
              $j = (isset($decompMap[$uchr]) ? $decompMap[$uchr] : $uchr);
327
            }
328
329 16
            if ($uchr != $j) {
330 12
              $uchr = $j;
331
332 12
              $j = strlen($uchr);
333 12
              $ulen = $uchr[0] < "\x80" ? 1 : $ulen_mask[$uchr[0] & "\xF0"];
334
335 12
              if ($ulen != $j) {
336
                // Put trailing chars in $s
337
338 12
                $j -= $ulen;
339 12
                $i -= $j;
340
341 12
                if (0 > $i) {
342 1
                  $s = str_repeat(' ', -$i) . $s;
343 1
                  $len -= $i;
344 1
                  $i = 0;
345 1
                }
346
347 12
                while ($j--) {
348 12
                  $s[$i + $j] = $uchr[$ulen + $j];
349 12
                }
350
351 12
                $uchr = substr($uchr, 0, $ulen);
352 12
              }
353 12
            }
354
355 16
          } else {
356
            // Hangul chars
357
358 4
            $uchr = unpack('C*', $uchr);
359 4
            $j = (($uchr[1] - 224) << 12) + (($uchr[2] - 128) << 6) + $uchr[3] - 0xAC80;
360
361 4
            $uchr = "\xE1\x84" . chr(0x80 + (int)($j / 588)) . "\xE1\x85" . chr(0xA1 + (int)(($j % 588) / 28));
362
363 4
            $j = $j % 28;
364
365 4
            if ($j) {
366 4
              $uchr .= $j < 25 ? ("\xE1\x86" . chr(0xA7 + $j)) : ("\xE1\x87" . chr(0x67 + $j));
367 4
            }
368
          }
369
370 16
          $result .= $uchr;
371
        }
372
      }
373 18
    }
374
375 18
    if ($c) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $c of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using ! empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
376 3
      ksort($c);
377 3
      $result .= implode('', $c);
378 3
    }
379
380 18
    return $result;
381
  }
382
}
383