Passed
Pull Request — master (#362)
by
unknown
01:49
created

Font   F

Complexity

Total Complexity 75

Size/Duplication

Total Lines 459
Duplicated Lines 0 %

Test Coverage

Coverage 91.13%

Importance

Changes 13
Bugs 2 Features 1
Metric Value
eloc 201
c 13
b 2
f 1
dl 0
loc 459
ccs 185
cts 203
cp 0.9113
rs 2.4
wmc 75

15 Methods

Rating   Name   Duplication   Size   Complexity  
A getName() 0 3 2
A getType() 0 3 1
A init() 0 4 1
A getDetails() 0 11 2
A setTable() 0 3 1
C loadTranslateTable() 0 106 16
A translateChar() 0 15 6
A uchr() 0 5 1
A decodeOctal() 0 14 3
A getFontSpaceLimit() 0 3 1
B decodeText() 0 46 7
B decodeHexadecimal() 0 31 8
A decodeUnicode() 0 14 3
A decodeEntities() 0 14 3
D decodeContent() 0 84 20

How to fix   Complexity   

Complex Class

Complex classes like Font often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use Font, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
/**
34
 * Class Font
35
 */
36
class Font extends PDFObject
37
{
38
    const MISSING = '?';
39
40
    /**
41
     * @var array
42
     */
43
    protected $table = null;
44
45
    /**
46
     * @var array
47
     */
48
    protected $tableSizes = null;
49
50 25
    public function init()
51
    {
52
        // Load translate table.
53 25
        $this->loadTranslateTable();
54 25
    }
55
56
    /**
57
     * @return string
58
     */
59 2
    public function getName()
60
    {
61 2
        return $this->has('BaseFont') ? (string) $this->get('BaseFont') : '[Unknown]';
62
    }
63
64
    /**
65
     * @return string
66
     */
67 2
    public function getType()
68
    {
69 2
        return (string) $this->header->get('Subtype');
70
    }
71
72
    /**
73
     * @return array
74
     */
75 1
    public function getDetails($deep = true)
76
    {
77 1
        $details = [];
78
79 1
        $details['Name'] = $this->getName();
80 1
        $details['Type'] = $this->getType();
81 1
        $details['Encoding'] = ($this->has('Encoding') ? (string) $this->get('Encoding') : 'Ansi');
82
83 1
        $details += parent::getDetails($deep);
84
85 1
        return $details;
86
    }
87
88
    /**
89
     * @param string $char
90
     * @param bool   $use_default
91
     *
92
     * @return string|bool
93
     */
94 14
    public function translateChar($char, $use_default = true)
95
    {
96 14
        $dec = hexdec(bin2hex($char));
97
98 14
        if (\array_key_exists($dec, $this->table)) {
99 14
            return $this->table[$dec];
100
        }
101
102
        // fallback for decoding single-byte ANSI characters that are not in the lookup table
103 3
        $fallbackDecoded = $char;
104 3
        if (\strlen($char) < 2 && $this->has('Encoding') && 'WinAnsiEncoding' === $this->get('Encoding')->__toString()) {
105
            $fallbackDecoded = self::uchr($dec);
0 ignored issues
show
Bug introduced by
It seems like $dec can also be of type double; however, parameter $code of Smalot\PdfParser\Font::uchr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

105
            $fallbackDecoded = self::uchr(/** @scrutinizer ignore-type */ $dec);
Loading history...
106
        }
107
108 3
        return $use_default ? self::MISSING : $fallbackDecoded;
109
    }
110
111
    /**
112
     * @param int $code
113
     *
114
     * @return string
115
     */
116 26
    public static function uchr($code)
117
    {
118
        // html_entity_decode() will not work with UTF-16 or UTF-32 char entities,
119
        // therefore, we use mb_convert_encoding() instead
120 26
        return mb_convert_encoding('&#'.((int) $code).';', 'UTF-8', 'HTML-ENTITIES');
121
    }
122
123
    /**
124
     * @return array
125
     */
126 25
    public function loadTranslateTable()
127
    {
128 25
        if (null !== $this->table) {
129 1
            return $this->table;
130
        }
131
132 25
        $this->table = [];
133 25
        $this->tableSizes = [
134
            'from' => 1,
135
            'to' => 1,
136
        ];
137
138 25
        if ($this->has('ToUnicode')) {
139 23
            $content = $this->get('ToUnicode')->getContent();
140 23
            $matches = [];
141
142
            // Support for multiple spacerange sections
143 23
            if (preg_match_all('/begincodespacerange(?P<sections>.*?)endcodespacerange/s', $content, $matches)) {
144 23
                foreach ($matches['sections'] as $section) {
145 23
                    $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)>[ \r\n]+/is';
146
147 23
                    preg_match_all($regexp, $section, $matches);
148
149 23
                    $this->tableSizes = [
150 23
                        'from' => max(1, \strlen(current($matches['from'])) / 2),
151 23
                        'to' => max(1, \strlen(current($matches['to'])) / 2),
152
                    ];
153
154 23
                    break;
155
                }
156
            }
157
158
            // Support for multiple bfchar sections
159 23
            if (preg_match_all('/beginbfchar(?P<sections>.*?)endbfchar/s', $content, $matches)) {
160 8
                foreach ($matches['sections'] as $section) {
161 8
                    $regexp = '/<(?P<from>[0-9A-F]+)> +<(?P<to>[0-9A-F]+)>[ \r\n]+/is';
162
163 8
                    preg_match_all($regexp, $section, $matches);
164
165 8
                    $this->tableSizes['from'] = max(1, \strlen(current($matches['from'])) / 2);
166
167 8
                    foreach ($matches['from'] as $key => $from) {
168 8
                        $parts = preg_split(
169 8
                            '/([0-9A-F]{4})/i',
170 8
                            $matches['to'][$key],
171 8
                            0,
172 8
                            PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
173
                        );
174 8
                        $text = '';
175 8
                        foreach ($parts as $part) {
176 8
                            $text .= self::uchr(hexdec($part));
0 ignored issues
show
Bug introduced by
It seems like hexdec($part) can also be of type double; however, parameter $code of Smalot\PdfParser\Font::uchr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

176
                            $text .= self::uchr(/** @scrutinizer ignore-type */ hexdec($part));
Loading history...
177
                        }
178 8
                        $this->table[hexdec($from)] = $text;
179
                    }
180
                }
181
            }
182
183
            // Support for multiple bfrange sections
184 23
            if (preg_match_all('/beginbfrange(?P<sections>.*?)endbfrange/s', $content, $matches)) {
185 19
                foreach ($matches['sections'] as $section) {
186
                    // Support for : <srcCode1> <srcCode2> <dstString>
187 19
                    $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>[ \r\n]+/is';
188
189 19
                    preg_match_all($regexp, $section, $matches);
190
191 19
                    foreach ($matches['from'] as $key => $from) {
192 19
                        $char_from = hexdec($from);
193 19
                        $char_to = hexdec($matches['to'][$key]);
194 19
                        $offset = hexdec($matches['offset'][$key]);
195
196 19
                        for ($char = $char_from; $char <= $char_to; ++$char) {
197 19
                            $this->table[$char] = self::uchr($char - $char_from + $offset);
198
                        }
199
                    }
200
201
                    // Support for : <srcCode1> <srcCodeN> [<dstString1> <dstString2> ... <dstStringN>]
202
                    // Some PDF file has 2-byte Unicode values on new lines > added \r\n
203 19
                    $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *\[(?P<strings>[\r\n<>0-9A-F ]+)\][ \r\n]+/is';
204
205 19
                    preg_match_all($regexp, $section, $matches);
206
207 19
                    foreach ($matches['from'] as $key => $from) {
208 1
                        $char_from = hexdec($from);
209 1
                        $strings = [];
210
211 1
                        preg_match_all('/<(?P<string>[0-9A-F]+)> */is', $matches['strings'][$key], $strings);
212
213 1
                        foreach ($strings['string'] as $position => $string) {
214 1
                            $parts = preg_split(
215 1
                                '/([0-9A-F]{4})/i',
216
                                $string,
217 1
                                0,
218 1
                                PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
219
                            );
220 1
                            $text = '';
221 1
                            foreach ($parts as $part) {
222 1
                                $text .= self::uchr(hexdec($part));
223
                            }
224 1
                            $this->table[$char_from + $position] = $text;
225
                        }
226
                    }
227
                }
228
            }
229
        }
230
231 25
        return $this->table;
232
    }
233
234
    /**
235
     * @param array $table
236
     */
237
    public function setTable($table)
238
    {
239
        $this->table = $table;
240
    }
241
242
    /**
243
     * @param string $hexa
244
     * @param bool   $add_braces
245
     *
246
     * @return string
247
     */
248 29
    public static function decodeHexadecimal($hexa, $add_braces = false)
249
    {
250
        // Special shortcut for XML content.
251 29
        if (false !== stripos($hexa, '<?xml')) {
252 3
            return $hexa;
253
        }
254
255 29
        $text = '';
256 29
        $parts = preg_split('/(<[a-f0-9]+>)/si', $hexa, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
257
258 29
        foreach ($parts as $part) {
259 29
            if (preg_match('/^<.*>$/s', $part) && false === stripos($part, '<?xml')) {
260
                // strip line breaks
261 9
                $part = preg_replace("/[\r\n]/", '', $part);
262 9
                $part = trim($part, '<>');
263 9
                if ($add_braces) {
264 1
                    $text .= '(';
265
                }
266
267 9
                $part = pack('H*', $part);
268 9
                $text .= ($add_braces ? preg_replace('/\\\/s', '\\\\\\', $part) : $part);
269
270 9
                if ($add_braces) {
271 9
                    $text .= ')';
272
                }
273
            } else {
274 29
                $text .= $part;
275
            }
276
        }
277
278 29
        return $text;
279
    }
280
281
    /**
282
     * @param string $text
283
     *
284
     * @return string
285
     */
286 29
    public static function decodeOctal($text)
287
    {
288 29
        $parts = preg_split('/(\\\\\d{3})/s', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
289 29
        $text = '';
290
291 29
        foreach ($parts as $part) {
292 29
            if (preg_match('/^\\\\\d{3}$/', $part)) {
293 17
                $text .= \chr(octdec(trim($part, '\\')));
0 ignored issues
show
Bug introduced by
It seems like octdec(trim($part, '\')) can also be of type double; however, parameter $ascii of chr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

293
                $text .= \chr(/** @scrutinizer ignore-type */ octdec(trim($part, '\\')));
Loading history...
294
            } else {
295 29
                $text .= $part;
296
            }
297
        }
298
299 29
        return $text;
300
    }
301
302
    /**
303
     * @param string $text
304
     *
305
     * @return string
306
     */
307 43
    public static function decodeEntities($text)
308
    {
309 43
        $parts = preg_split('/(#\d{2})/s', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
310 43
        $text = '';
311
312 43
        foreach ($parts as $part) {
313 43
            if (preg_match('/^#\d{2}$/', $part)) {
314 3
                $text .= \chr(hexdec(trim($part, '#')));
0 ignored issues
show
Bug introduced by
It seems like hexdec(trim($part, '#')) can also be of type double; however, parameter $ascii of chr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

314
                $text .= \chr(/** @scrutinizer ignore-type */ hexdec(trim($part, '#')));
Loading history...
315
            } else {
316 43
                $text .= $part;
317
            }
318
        }
319
320 43
        return $text;
321
    }
322
323
    /**
324
     * @param string $text
325
     *
326
     * @return string
327
     */
328 29
    public static function decodeUnicode($text)
329
    {
330 29
        if (preg_match('/^\xFE\xFF/i', $text)) {
331
            // Strip U+FEFF byte order marker.
332 19
            $decode = substr($text, 2);
333 19
            $text = '';
334 19
            $length = \strlen($decode);
335
336 19
            for ($i = 0; $i < $length; $i += 2) {
337 19
                $text .= self::uchr(hexdec(bin2hex(substr($decode, $i, 2))));
0 ignored issues
show
Bug introduced by
It seems like hexdec(bin2hex(substr($decode, $i, 2))) can also be of type double; however, parameter $code of Smalot\PdfParser\Font::uchr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

337
                $text .= self::uchr(/** @scrutinizer ignore-type */ hexdec(bin2hex(substr($decode, $i, 2))));
Loading history...
338
            }
339
        }
340
341 29
        return $text;
342
    }
343
344
    /**
345
     * @return int
346
     */
347 13
    protected function getFontSpaceLimit()
348
    {
349 13
        return -50;
350
    }
351
352
    /**
353
     * @param array $commands
354
     *
355
     * @return string
356
     */
357 13
    public function decodeText($commands)
358
    {
359 13
        $text = '';
360 13
        $word_position = 0;
361 13
        $words = [];
362 13
        $unicode = false;
0 ignored issues
show
Unused Code introduced by
The assignment to $unicode is dead and can be removed.
Loading history...
363 13
        $font_space = $this->getFontSpaceLimit();
364
365 13
        foreach ($commands as $command) {
366 13
            switch ($command[PDFObject::TYPE]) {
367 13
                case 'n':
368 11
                    if ((float) (trim($command[PDFObject::COMMAND])) < $font_space) {
369 5
                        $word_position = \count($words);
370
                    }
371 11
                    continue 2;
372
373 13
                case '<':
374
                    // Decode hexadecimal.
375 7
                    $text = self::decodeHexadecimal('<'.$command[PDFObject::COMMAND].'>');
376 7
                    break;
377
378
                default:
379
                    // Decode octal (if necessary).
380 9
                    $text = self::decodeOctal($command[PDFObject::COMMAND]);
381
            }
382
383
            // replace escaped chars
384 13
            $text = str_replace(
385 13
                ['\\\\', '\(', '\)', '\n', '\r', '\t', '\f', '\ '],
386 13
                ['\\', '(', ')', "\n", "\r", "\t", "\f", ' '],
387
                $text
388
            );
389
390
            // add content to result string
391 13
            if (isset($words[$word_position])) {
392 11
                $words[$word_position] .= $text;
393
            } else {
394 13
                $words[$word_position] = $text;
395
            }
396
        }
397
398 13
        foreach ($words as &$word) {
399 13
            $word = $this->decodeContent($word);
400
        }
401
402 13
        return implode(' ', $words);
403
    }
404
405
    /**
406
     * @param string $text
407
     * @param bool   $unicode This parameter is deprecated and might be removed in a future release
408
     *
409
     * @return string
410
     */
411 15
    public function decodeContent($text, &$unicode = null)
412
    {
413 15
        if ($this->has('ToUnicode')) {
414 13
            $bytes = $this->tableSizes['from'];
415
416 13
            if ($bytes) {
417 13
                $result = '';
418 13
                $length = \strlen($text);
419
420 13
                for ($i = 0; $i < $length; $i += $bytes) {
421 13
                    $char = substr($text, $i, $bytes);
422
423 13
                    if (false !== ($decoded = $this->translateChar($char, false))) {
424 13
                        $char = $decoded;
425
                    } elseif ($this->has('DescendantFonts')) {
426
                        if ($this->get('DescendantFonts') instanceof PDFObject) {
427
                            $fonts = $this->get('DescendantFonts')->getHeader()->getElements();
0 ignored issues
show
Bug introduced by
The method getHeader() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

427
                            $fonts = $this->get('DescendantFonts')->/** @scrutinizer ignore-call */ getHeader()->getElements();

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
428
                        } else {
429
                            $fonts = $this->get('DescendantFonts')->getContent();
430
                        }
431
                        $decoded = false;
432
433
                        foreach ($fonts as $font) {
434
                            if ($font instanceof self) {
435
                                if (false !== ($decoded = $font->translateChar($char, false))) {
436
                                    $decoded = mb_convert_encoding($decoded, 'UTF-8', 'Windows-1252');
0 ignored issues
show
Bug introduced by
It seems like $decoded can also be of type true; however, parameter $str of mb_convert_encoding() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

436
                                    $decoded = mb_convert_encoding(/** @scrutinizer ignore-type */ $decoded, 'UTF-8', 'Windows-1252');
Loading history...
437
                                    break;
438
                                }
439
                            }
440
                        }
441
442
                        if (false !== $decoded) {
443
                            $char = $decoded;
444
                        } else {
445
                            $char = mb_convert_encoding($char, 'UTF-8', 'Windows-1252');
446
                        }
447
                    } else {
448
                        $char = self::MISSING;
449
                    }
450
451 13
                    $result .= $char;
452
                }
453
454 13
                $text = $result;
455
            }
456 10
        } elseif ($this->has('Encoding') && $this->get('Encoding') instanceof Encoding) {
0 ignored issues
show
introduced by
$this->get('Encoding') is never a sub-type of Smalot\PdfParser\Encoding.
Loading history...
457
            /** @var Encoding $encoding */
458 2
            $encoding = $this->get('Encoding');
459 2
            $unicode = mb_check_encoding($text, 'UTF-8');
460 2
            $result = '';
461 2
            if ($unicode) {
462 2
                $chars = preg_split(
463 2
                        '//s'.($unicode ? 'u' : ''),
464
                        $text,
465 2
                        -1,
466 2
                        PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY
467
                );
468
469 2
                foreach ($chars as $char) {
470 2
                    $dec_av = hexdec(bin2hex($char));
471 2
                    $dec_ap = $encoding->translateChar($dec_av);
472 2
                    $result .= self::uchr($dec_ap);
473
                }
474
            } else {
475 2
                $length = \strlen($text);
476
477 2
                for ($i = 0; $i < $length; ++$i) {
478 2
                    $dec_av = hexdec(bin2hex($text[$i]));
479 2
                    $dec_ap = $encoding->translateChar($dec_av);
480 2
                    $result .= self::uchr($dec_ap);
481
                }
482
            }
483 2
            $text = $result;
484 9
        } elseif ($this->get('Encoding') instanceof Element &&
485 9
                  $this->get('Encoding')->equals('MacRomanEncoding')) {
486
            // mb_convert_encoding does not support MacRoman/macintosh,
487
            // so we use iconv() here
488 1
            $text = iconv('macintosh', 'UTF-8', $text);
489 9
        } elseif (!mb_check_encoding($text, 'UTF-8')) {
490
            // don't double-encode strings already in UTF-8
491 3
            $text = mb_convert_encoding($text, 'UTF-8', 'Windows-1252');
492
        }
493
494 15
        return $text;
495
    }
496
}
497