Passed
Pull Request — master (#362)
by
unknown
01:49
created

Font::decodeContent()   D

Complexity

Conditions 20
Paths 7

Size

Total Lines 84
Code Lines 54

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 37
CRAP Score 28.2734

Importance

Changes 5
Bugs 0 Features 1
Metric Value
cc 20
eloc 54
c 5
b 0
f 1
nc 7
nop 2
dl 0
loc 84
ccs 37
cts 51
cp 0.7255
crap 28.2734
rs 4.1666

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
/**
34
 * Class Font
35
 */
36
class Font extends PDFObject
37
{
38
    const MISSING = '?';
39
40
    /**
41
     * @var array
42
     */
43
    protected $table = null;
44
45
    /**
46
     * @var array
47
     */
48
    protected $tableSizes = null;
49
50 25
    public function init()
51
    {
52
        // Load translate table.
53 25
        $this->loadTranslateTable();
54 25
    }
55
56
    /**
57
     * @return string
58
     */
59 2
    public function getName()
60
    {
61 2
        return $this->has('BaseFont') ? (string) $this->get('BaseFont') : '[Unknown]';
62
    }
63
64
    /**
65
     * @return string
66
     */
67 2
    public function getType()
68
    {
69 2
        return (string) $this->header->get('Subtype');
70
    }
71
72
    /**
73
     * @return array
74
     */
75 1
    public function getDetails($deep = true)
76
    {
77 1
        $details = [];
78
79 1
        $details['Name'] = $this->getName();
80 1
        $details['Type'] = $this->getType();
81 1
        $details['Encoding'] = ($this->has('Encoding') ? (string) $this->get('Encoding') : 'Ansi');
82
83 1
        $details += parent::getDetails($deep);
84
85 1
        return $details;
86
    }
87
88
    /**
89
     * @param string $char
90
     * @param bool   $use_default
91
     *
92
     * @return string|bool
93
     */
94 14
    public function translateChar($char, $use_default = true)
95
    {
96 14
        $dec = hexdec(bin2hex($char));
97
98 14
        if (\array_key_exists($dec, $this->table)) {
99 14
            return $this->table[$dec];
100
        }
101
102
        // fallback for decoding single-byte ANSI characters that are not in the lookup table
103 3
        $fallbackDecoded = $char;
104 3
        if (\strlen($char) < 2 && $this->has('Encoding') && 'WinAnsiEncoding' === $this->get('Encoding')->__toString()) {
105
            $fallbackDecoded = self::uchr($dec);
0 ignored issues
show
Bug introduced by
It seems like $dec can also be of type double; however, parameter $code of Smalot\PdfParser\Font::uchr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

105
            $fallbackDecoded = self::uchr(/** @scrutinizer ignore-type */ $dec);
Loading history...
106
        }
107
108 3
        return $use_default ? self::MISSING : $fallbackDecoded;
109
    }
110
111
    /**
112
     * @param int $code
113
     *
114
     * @return string
115
     */
116 26
    public static function uchr($code)
117
    {
118
        // html_entity_decode() will not work with UTF-16 or UTF-32 char entities,
119
        // therefore, we use mb_convert_encoding() instead
120 26
        return mb_convert_encoding('&#'.((int) $code).';', 'UTF-8', 'HTML-ENTITIES');
121
    }
122
123
    /**
124
     * @return array
125
     */
126 25
    public function loadTranslateTable()
127
    {
128 25
        if (null !== $this->table) {
129 1
            return $this->table;
130
        }
131
132 25
        $this->table = [];
133 25
        $this->tableSizes = [
134
            'from' => 1,
135
            'to' => 1,
136
        ];
137
138 25
        if ($this->has('ToUnicode')) {
139 23
            $content = $this->get('ToUnicode')->getContent();
140 23
            $matches = [];
141
142
            // Support for multiple spacerange sections
143 23
            if (preg_match_all('/begincodespacerange(?P<sections>.*?)endcodespacerange/s', $content, $matches)) {
144 23
                foreach ($matches['sections'] as $section) {
145 23
                    $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)>[ \r\n]+/is';
146
147 23
                    preg_match_all($regexp, $section, $matches);
148
149 23
                    $this->tableSizes = [
150 23
                        'from' => max(1, \strlen(current($matches['from'])) / 2),
151 23
                        'to' => max(1, \strlen(current($matches['to'])) / 2),
152
                    ];
153
154 23
                    break;
155
                }
156
            }
157
158
            // Support for multiple bfchar sections
159 23
            if (preg_match_all('/beginbfchar(?P<sections>.*?)endbfchar/s', $content, $matches)) {
160 8
                foreach ($matches['sections'] as $section) {
161 8
                    $regexp = '/<(?P<from>[0-9A-F]+)> +<(?P<to>[0-9A-F]+)>[ \r\n]+/is';
162
163 8
                    preg_match_all($regexp, $section, $matches);
164
165 8
                    $this->tableSizes['from'] = max(1, \strlen(current($matches['from'])) / 2);
166
167 8
                    foreach ($matches['from'] as $key => $from) {
168 8
                        $parts = preg_split(
169 8
                            '/([0-9A-F]{4})/i',
170 8
                            $matches['to'][$key],
171 8
                            0,
172 8
                            PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
173
                        );
174 8
                        $text = '';
175 8
                        foreach ($parts as $part) {
176 8
                            $text .= self::uchr(hexdec($part));
0 ignored issues
show
Bug introduced by
It seems like hexdec($part) can also be of type double; however, parameter $code of Smalot\PdfParser\Font::uchr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

176
                            $text .= self::uchr(/** @scrutinizer ignore-type */ hexdec($part));
Loading history...
177
                        }
178 8
                        $this->table[hexdec($from)] = $text;
179
                    }
180
                }
181
            }
182
183
            // Support for multiple bfrange sections
184 23
            if (preg_match_all('/beginbfrange(?P<sections>.*?)endbfrange/s', $content, $matches)) {
185 19
                foreach ($matches['sections'] as $section) {
186
                    // Support for : <srcCode1> <srcCode2> <dstString>
187 19
                    $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>[ \r\n]+/is';
188
189 19
                    preg_match_all($regexp, $section, $matches);
190
191 19
                    foreach ($matches['from'] as $key => $from) {
192 19
                        $char_from = hexdec($from);
193 19
                        $char_to = hexdec($matches['to'][$key]);
194 19
                        $offset = hexdec($matches['offset'][$key]);
195
196 19
                        for ($char = $char_from; $char <= $char_to; ++$char) {
197 19
                            $this->table[$char] = self::uchr($char - $char_from + $offset);
198
                        }
199
                    }
200
201
                    // Support for : <srcCode1> <srcCodeN> [<dstString1> <dstString2> ... <dstStringN>]
202
                    // Some PDF file has 2-byte Unicode values on new lines > added \r\n
203 19
                    $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *\[(?P<strings>[\r\n<>0-9A-F ]+)\][ \r\n]+/is';
204
205 19
                    preg_match_all($regexp, $section, $matches);
206
207 19
                    foreach ($matches['from'] as $key => $from) {
208 1
                        $char_from = hexdec($from);
209 1
                        $strings = [];
210
211 1
                        preg_match_all('/<(?P<string>[0-9A-F]+)> */is', $matches['strings'][$key], $strings);
212
213 1
                        foreach ($strings['string'] as $position => $string) {
214 1
                            $parts = preg_split(
215 1
                                '/([0-9A-F]{4})/i',
216
                                $string,
217 1
                                0,
218 1
                                PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
219
                            );
220 1
                            $text = '';
221 1
                            foreach ($parts as $part) {
222 1
                                $text .= self::uchr(hexdec($part));
223
                            }
224 1
                            $this->table[$char_from + $position] = $text;
225
                        }
226
                    }
227
                }
228
            }
229
        }
230
231 25
        return $this->table;
232
    }
233
234
    /**
235
     * @param array $table
236
     */
237
    public function setTable($table)
238
    {
239
        $this->table = $table;
240
    }
241
242
    /**
243
     * @param string $hexa
244
     * @param bool   $add_braces
245
     *
246
     * @return string
247
     */
248 29
    public static function decodeHexadecimal($hexa, $add_braces = false)
249
    {
250
        // Special shortcut for XML content.
251 29
        if (false !== stripos($hexa, '<?xml')) {
252 3
            return $hexa;
253
        }
254
255 29
        $text = '';
256 29
        $parts = preg_split('/(<[a-f0-9]+>)/si', $hexa, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
257
258 29
        foreach ($parts as $part) {
259 29
            if (preg_match('/^<.*>$/s', $part) && false === stripos($part, '<?xml')) {
260
                // strip line breaks
261 9
                $part = preg_replace("/[\r\n]/", '', $part);
262 9
                $part = trim($part, '<>');
263 9
                if ($add_braces) {
264 1
                    $text .= '(';
265
                }
266
267 9
                $part = pack('H*', $part);
268 9
                $text .= ($add_braces ? preg_replace('/\\\/s', '\\\\\\', $part) : $part);
269
270 9
                if ($add_braces) {
271 9
                    $text .= ')';
272
                }
273
            } else {
274 29
                $text .= $part;
275
            }
276
        }
277
278 29
        return $text;
279
    }
280
281
    /**
282
     * @param string $text
283
     *
284
     * @return string
285
     */
286 29
    public static function decodeOctal($text)
287
    {
288 29
        $parts = preg_split('/(\\\\\d{3})/s', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
289 29
        $text = '';
290
291 29
        foreach ($parts as $part) {
292 29
            if (preg_match('/^\\\\\d{3}$/', $part)) {
293 17
                $text .= \chr(octdec(trim($part, '\\')));
0 ignored issues
show
Bug introduced by
It seems like octdec(trim($part, '\')) can also be of type double; however, parameter $ascii of chr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

293
                $text .= \chr(/** @scrutinizer ignore-type */ octdec(trim($part, '\\')));
Loading history...
294
            } else {
295 29
                $text .= $part;
296
            }
297
        }
298
299 29
        return $text;
300
    }
301
302
    /**
303
     * @param string $text
304
     *
305
     * @return string
306
     */
307 43
    public static function decodeEntities($text)
308
    {
309 43
        $parts = preg_split('/(#\d{2})/s', $text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
310 43
        $text = '';
311
312 43
        foreach ($parts as $part) {
313 43
            if (preg_match('/^#\d{2}$/', $part)) {
314 3
                $text .= \chr(hexdec(trim($part, '#')));
0 ignored issues
show
Bug introduced by
It seems like hexdec(trim($part, '#')) can also be of type double; however, parameter $ascii of chr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

314
                $text .= \chr(/** @scrutinizer ignore-type */ hexdec(trim($part, '#')));
Loading history...
315
            } else {
316 43
                $text .= $part;
317
            }
318
        }
319
320 43
        return $text;
321
    }
322
323
    /**
324
     * @param string $text
325
     *
326
     * @return string
327
     */
328 29
    public static function decodeUnicode($text)
329
    {
330 29
        if (preg_match('/^\xFE\xFF/i', $text)) {
331
            // Strip U+FEFF byte order marker.
332 19
            $decode = substr($text, 2);
333 19
            $text = '';
334 19
            $length = \strlen($decode);
335
336 19
            for ($i = 0; $i < $length; $i += 2) {
337 19
                $text .= self::uchr(hexdec(bin2hex(substr($decode, $i, 2))));
0 ignored issues
show
Bug introduced by
It seems like hexdec(bin2hex(substr($decode, $i, 2))) can also be of type double; however, parameter $code of Smalot\PdfParser\Font::uchr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

337
                $text .= self::uchr(/** @scrutinizer ignore-type */ hexdec(bin2hex(substr($decode, $i, 2))));
Loading history...
338
            }
339
        }
340
341 29
        return $text;
342
    }
343
344
    /**
345
     * @return int
346
     */
347 13
    protected function getFontSpaceLimit()
348
    {
349 13
        return -50;
350
    }
351
352
    /**
353
     * @param array $commands
354
     *
355
     * @return string
356
     */
357 13
    public function decodeText($commands)
358
    {
359 13
        $text = '';
360 13
        $word_position = 0;
361 13
        $words = [];
362 13
        $unicode = false;
0 ignored issues
show
Unused Code introduced by
The assignment to $unicode is dead and can be removed.
Loading history...
363 13
        $font_space = $this->getFontSpaceLimit();
364
365 13
        foreach ($commands as $command) {
366 13
            switch ($command[PDFObject::TYPE]) {
367 13
                case 'n':
368 11
                    if ((float) (trim($command[PDFObject::COMMAND])) < $font_space) {
369 5
                        $word_position = \count($words);
370
                    }
371 11
                    continue 2;
372
373 13
                case '<':
374
                    // Decode hexadecimal.
375 7
                    $text = self::decodeHexadecimal('<'.$command[PDFObject::COMMAND].'>');
376 7
                    break;
377
378
                default:
379
                    // Decode octal (if necessary).
380 9
                    $text = self::decodeOctal($command[PDFObject::COMMAND]);
381
            }
382
383
            // replace escaped chars
384 13
            $text = str_replace(
385 13
                ['\\\\', '\(', '\)', '\n', '\r', '\t', '\f', '\ '],
386 13
                ['\\', '(', ')', "\n", "\r", "\t", "\f", ' '],
387
                $text
388
            );
389
390
            // add content to result string
391 13
            if (isset($words[$word_position])) {
392 11
                $words[$word_position] .= $text;
393
            } else {
394 13
                $words[$word_position] = $text;
395
            }
396
        }
397
398 13
        foreach ($words as &$word) {
399 13
            $word = $this->decodeContent($word);
400
        }
401
402 13
        return implode(' ', $words);
403
    }
404
405
    /**
406
     * @param string $text
407
     * @param bool   $unicode This parameter is deprecated and might be removed in a future release
408
     *
409
     * @return string
410
     */
411 15
    public function decodeContent($text, &$unicode = null)
412
    {
413 15
        if ($this->has('ToUnicode')) {
414 13
            $bytes = $this->tableSizes['from'];
415
416 13
            if ($bytes) {
417 13
                $result = '';
418 13
                $length = \strlen($text);
419
420 13
                for ($i = 0; $i < $length; $i += $bytes) {
421 13
                    $char = substr($text, $i, $bytes);
422
423 13
                    if (false !== ($decoded = $this->translateChar($char, false))) {
424 13
                        $char = $decoded;
425
                    } elseif ($this->has('DescendantFonts')) {
426
                        if ($this->get('DescendantFonts') instanceof PDFObject) {
427
                            $fonts = $this->get('DescendantFonts')->getHeader()->getElements();
0 ignored issues
show
Bug introduced by
The method getHeader() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

427
                            $fonts = $this->get('DescendantFonts')->/** @scrutinizer ignore-call */ getHeader()->getElements();

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
428
                        } else {
429
                            $fonts = $this->get('DescendantFonts')->getContent();
430
                        }
431
                        $decoded = false;
432
433
                        foreach ($fonts as $font) {
434
                            if ($font instanceof self) {
435
                                if (false !== ($decoded = $font->translateChar($char, false))) {
436
                                    $decoded = mb_convert_encoding($decoded, 'UTF-8', 'Windows-1252');
0 ignored issues
show
Bug introduced by
It seems like $decoded can also be of type true; however, parameter $str of mb_convert_encoding() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

436
                                    $decoded = mb_convert_encoding(/** @scrutinizer ignore-type */ $decoded, 'UTF-8', 'Windows-1252');
Loading history...
437
                                    break;
438
                                }
439
                            }
440
                        }
441
442
                        if (false !== $decoded) {
443
                            $char = $decoded;
444
                        } else {
445
                            $char = mb_convert_encoding($char, 'UTF-8', 'Windows-1252');
446
                        }
447
                    } else {
448
                        $char = self::MISSING;
449
                    }
450
451 13
                    $result .= $char;
452
                }
453
454 13
                $text = $result;
455
            }
456 10
        } elseif ($this->has('Encoding') && $this->get('Encoding') instanceof Encoding) {
0 ignored issues
show
introduced by
$this->get('Encoding') is never a sub-type of Smalot\PdfParser\Encoding.
Loading history...
457
            /** @var Encoding $encoding */
458 2
            $encoding = $this->get('Encoding');
459 2
            $unicode = mb_check_encoding($text, 'UTF-8');
460 2
            $result = '';
461 2
            if ($unicode) {
462 2
                $chars = preg_split(
463 2
                        '//s'.($unicode ? 'u' : ''),
464
                        $text,
465 2
                        -1,
466 2
                        PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY
467
                );
468
469 2
                foreach ($chars as $char) {
470 2
                    $dec_av = hexdec(bin2hex($char));
471 2
                    $dec_ap = $encoding->translateChar($dec_av);
472 2
                    $result .= self::uchr($dec_ap);
473
                }
474
            } else {
475 2
                $length = \strlen($text);
476
477 2
                for ($i = 0; $i < $length; ++$i) {
478 2
                    $dec_av = hexdec(bin2hex($text[$i]));
479 2
                    $dec_ap = $encoding->translateChar($dec_av);
480 2
                    $result .= self::uchr($dec_ap);
481
                }
482
            }
483 2
            $text = $result;
484 9
        } elseif ($this->get('Encoding') instanceof Element &&
485 9
                  $this->get('Encoding')->equals('MacRomanEncoding')) {
486
            // mb_convert_encoding does not support MacRoman/macintosh,
487
            // so we use iconv() here
488 1
            $text = iconv('macintosh', 'UTF-8', $text);
489 9
        } elseif (!mb_check_encoding($text, 'UTF-8')) {
490
            // don't double-encode strings already in UTF-8
491 3
            $text = mb_convert_encoding($text, 'UTF-8', 'Windows-1252');
492
        }
493
494 15
        return $text;
495
    }
496
}
497