Font   F
last analyzed

Complexity

Total Complexity 95

Size/Duplication

Total Lines 667
Duplicated Lines 0 %

Test Coverage

Coverage 93.45%

Importance

Changes 15
Bugs 2 Features 2
Metric Value
eloc 259
c 15
b 2
f 2
dl 0
loc 667
ccs 257
cts 275
cp 0.9345
rs 2
wmc 95

25 Methods

Rating   Name   Duplication   Size   Complexity  
A getName() 0 3 2
A getType() 0 3 1
A init() 0 4 1
A getDetails() 0 11 2
B translateChar() 0 26 8
A uchr() 0 14 2
A decodeOctal() 0 16 1
A getFontSpaceLimit() 0 3 1
A createEncodingByPdfObject() 0 8 1
A setTable() 0 3 1
A getInitializedEncodingByPdfObject() 0 7 2
B decodeText() 0 67 9
A decodeContentByEncodingEncoding() 0 12 2
B decodeHexadecimal() 0 31 7
A decodeContentByAutodetectIfNecessary() 0 7 2
A decodeContentByEncoding() 0 25 5
A calculateTextWidth() 0 34 5
A decodeUnicode() 0 14 3
A decodeEntities() 0 5 1
A decodeContent() 0 21 5
D loadTranslateTable() 0 114 19
A decodeContentByEncodingElement() 0 9 2
B decodeContentByToUnicodeCMapOrDescendantFonts() 0 46 10
A getIconvEncodingNameOrNullByPdfEncodingName() 0 11 2
A createInitializedEncodingByPdfObject() 0 6 1

How to fix   Complexity   

Complex Class

Complex classes like Font often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use Font, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\Encoding\WinAnsiEncoding;
36
use Smalot\PdfParser\Exception\EncodingNotFoundException;
37
38
/**
39
 * Class Font
40
 */
41
class Font extends PDFObject
42
{
43
    public const MISSING = '?';
44
45
    /**
46
     * @var array
47
     */
48
    protected $table;
49
50
    /**
51
     * @var array
52
     */
53
    protected $tableSizes;
54
55
    /**
56
     * Caches results from uchr.
57
     *
58
     * @var array
59
     */
60
    private static $uchrCache = [];
61
62
    /**
63
     * In some PDF-files encoding could be referenced by object id but object itself does not contain
64
     * `/Type /Encoding` in its dictionary. These objects wouldn't be initialized as Encoding in
65
     * \Smalot\PdfParser\PDFObject::factory() during file parsing (they would be just PDFObject).
66
     *
67
     * Therefore, we create an instance of Encoding from them during decoding and cache this value in this property.
68
     *
69
     * @var Encoding
70
     *
71
     * @see https://github.com/smalot/pdfparser/pull/500
72
     */
73
    private $initializedEncodingByPdfObject;
74
75 67
    public function init()
76
    {
77
        // Load translate table.
78 67
        $this->loadTranslateTable();
79
    }
80
81 4
    public function getName(): string
82
    {
83 4
        return $this->has('BaseFont') ? (string) $this->get('BaseFont') : '[Unknown]';
84
    }
85
86 4
    public function getType(): string
87
    {
88 4
        return (string) $this->header->get('Subtype');
89
    }
90
91 3
    public function getDetails(bool $deep = true): array
92
    {
93 3
        $details = [];
94
95 3
        $details['Name'] = $this->getName();
96 3
        $details['Type'] = $this->getType();
97 3
        $details['Encoding'] = ($this->has('Encoding') ? (string) $this->get('Encoding') : 'Ansi');
98
99 3
        $details += parent::getDetails($deep);
100
101 3
        return $details;
102
    }
103
104
    /**
105
     * @return string|bool
106
     */
107 44
    public function translateChar(string $char, bool $use_default = true)
108
    {
109 44
        $dec = hexdec(bin2hex($char));
110
111 44
        if (\array_key_exists($dec, $this->table)) {
112 41
            return $this->table[$dec];
113
        }
114
115
        // fallback for decoding single-byte ANSI characters that are not in the lookup table
116 11
        $fallbackDecoded = $char;
117
        if (
118 11
            \strlen($char) < 2
119 11
            && $this->has('Encoding')
120 11
            && $this->get('Encoding') instanceof Encoding
0 ignored issues
show
introduced by
$this->get('Encoding') is never a sub-type of Smalot\PdfParser\Encoding.
Loading history...
121
        ) {
122
            try {
123 2
                if (WinAnsiEncoding::class === $this->get('Encoding')->__toString()) {
124 2
                    $fallbackDecoded = self::uchr($dec);
125
                }
126
            } catch (EncodingNotFoundException $e) {
127
                // Encoding->getEncodingClass() throws EncodingNotFoundException when BaseEncoding doesn't exists
128
                // See table 5.11 on PDF 1.5 specs for more info
129
            }
130
        }
131
132 11
        return $use_default ? self::MISSING : $fallbackDecoded;
133
    }
134
135
    /**
136
     * Convert unicode character code to "utf-8" encoded string.
137
     *
138
     * @param int|float $code Unicode character code. Will be casted to int internally!
139
     */
140 62
    public static function uchr($code): string
141
    {
142
        // note:
143
        // $code was typed as int before, but changed in https://github.com/smalot/pdfparser/pull/623
144
        // because in some cases uchr was called with a float instead of an integer.
145 62
        $code = (int) $code;
146
147 62
        if (!isset(self::$uchrCache[$code])) {
148
            // html_entity_decode() will not work with UTF-16 or UTF-32 char entities,
149
            // therefore, we use mb_convert_encoding() instead
150 21
            self::$uchrCache[$code] = mb_convert_encoding("&#{$code};", 'UTF-8', 'HTML-ENTITIES');
151
        }
152
153 62
        return self::$uchrCache[$code];
154
    }
155
156
    /**
157
     * Init internal chars translation table by ToUnicode CMap.
158
     */
159 67
    public function loadTranslateTable(): array
160
    {
161 67
        if (null !== $this->table) {
162 1
            return $this->table;
163
        }
164
165 67
        $this->table = [];
166 67
        $this->tableSizes = [
167 67
            'from' => 1,
168 67
            'to' => 1,
169 67
        ];
170
171 67
        if ($this->has('ToUnicode')) {
172 57
            $content = $this->get('ToUnicode')->getContent();
173 57
            $matches = [];
174
175
            // Support for multiple spacerange sections
176 57
            if (preg_match_all('/begincodespacerange(?P<sections>.*?)endcodespacerange/s', $content, $matches)) {
177 56
                foreach ($matches['sections'] as $section) {
178 56
                    $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)>[ \r\n]+/is';
179
180 56
                    preg_match_all($regexp, $section, $matches);
181
182 56
                    $this->tableSizes = [
183 56
                        'from' => max(1, \strlen(current($matches['from'])) / 2),
184 56
                        'to' => max(1, \strlen(current($matches['to'])) / 2),
185 56
                    ];
186
187 56
                    break;
188
                }
189
            }
190
191
            // Support for multiple bfchar sections
192 57
            if (preg_match_all('/beginbfchar(?P<sections>.*?)endbfchar/s', $content, $matches)) {
193 31
                foreach ($matches['sections'] as $section) {
194 31
                    $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)>[ \r\n]+/is';
195
196 31
                    preg_match_all($regexp, $section, $matches);
197
198 31
                    $this->tableSizes['from'] = max(1, \strlen(current($matches['from'])) / 2);
199
200 31
                    foreach ($matches['from'] as $key => $from) {
201 31
                        $parts = preg_split(
202 31
                            '/([0-9A-F]{4})/i',
203 31
                            $matches['to'][$key],
204 31
                            0,
205 31
                            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
206 31
                        );
207 31
                        $text = '';
208 31
                        foreach ($parts as $part) {
209 31
                            $text .= self::uchr(hexdec($part));
210
                        }
211 31
                        $this->table[hexdec($from)] = $text;
212
                    }
213
                }
214
            }
215
216
            // Support for multiple bfrange sections
217 57
            if (preg_match_all('/beginbfrange(?P<sections>.*?)endbfrange/s', $content, $matches)) {
218 38
                foreach ($matches['sections'] as $section) {
219
                    /**
220 38
                     * Regexp to capture <from>, <to>, and either <offset> or [...] items.
221
                     * - (?P<from>...) Source range's start
222 38
                     * - (?P<to>...)   Source range's end
223
                     * - (?P<dest>...) Destination range's offset or each char code
224 38
                     *                 Some PDF file has 2-byte Unicode values on new lines > added \r\n
225 38
                     */
226 38
                    $regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *(?P<dest><[0-9A-F]+>|\[[\r\n<>0-9A-F ]+\])[ \r\n]+/is';
227 38
228
                    preg_match_all($regexp, $section, $matches);
229 38
230 38
                    foreach ($matches['from'] as $key => $from) {
231
                        $char_from = hexdec($from);
232
                        $char_to = hexdec($matches['to'][$key]);
233
                        $dest = $matches['dest'][$key];
234
235
                        if (1 === preg_match('/^<(?P<offset>[0-9A-F]+)>$/i', $dest, $offset_matches)) {
236 38
                            // Support for : <srcCode1> <srcCode2> <dstString>
237
                            $offset = hexdec($offset_matches['offset']);
238 38
239
                            for ($char = $char_from; $char <= $char_to; ++$char) {
240 38
                                $this->table[$char] = self::uchr($char - $char_from + $offset);
241 4
                            }
242 4
                        } else {
243
                            // Support for : <srcCode1> <srcCodeN> [<dstString1> <dstString2> ... <dstStringN>]
244 4
                            $strings = [];
245
                            $matched = preg_match_all('/<(?P<string>[0-9A-F]+)> */is', $dest, $strings);
246 4
                            if (false === $matched || 0 === $matched) {
247 4
                                continue;
248 4
                            }
249 4
250 4
                            foreach ($strings['string'] as $position => $string) {
251 4
                                $parts = preg_split(
252 4
                                    '/([0-9A-F]{4})/i',
253 4
                                    $string,
254 4
                                    0,
255 4
                                    \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
256
                                );
257 4
                                if (false === $parts) {
258
                                    continue;
259
                                }
260
                                $text = '';
261
                                foreach ($parts as $part) {
262
                                    $text .= self::uchr(hexdec($part));
263
                                }
264 67
                                $this->table[$char_from + $position] = $text;
265
                            }
266
                        }
267
                    }
268
                }
269
            }
270
        }
271
272
        return $this->table;
273
    }
274 2
275
    /**
276 2
     * Set custom char translation table where:
277
     * - key - integer character code;
278
     * - value - "utf-8" encoded value;
279
     *
280
     * @return void
281
     */
282 2
    public function setTable(array $table)
283
    {
284 2
        $this->table = $table;
285 2
    }
286
287
    /**
288 2
     * Calculate text width with data from header 'Widths'. If width of character is not found then character is added to missing array.
289
     */
290
    public function calculateTextWidth(string $text, ?array &$missing = null): ?float
291
    {
292
        $index_map = array_flip($this->table);
293
        $details = $this->getDetails();
294
295
        // Usually, Widths key is set in $details array, but if it isn't use an empty array instead.
296 2
        $widths = $details['Widths'] ?? [];
297
298 2
        /*
299 2
         * Widths array is zero indexed but table is not. We must map them based on FirstChar and LastChar
300 2
         *
301 2
         * Note: Without the change you would see warnings in PHP 8.4 because the values of FirstChar or LastChar
302 2
         *       can be null sometimes.
303
         */
304 2
        $width_map = array_flip(range((int) $details['FirstChar'], (int) $details['LastChar']));
305 1
306 2
        $width = null;
307
        $missing = [];
308 2
        $textLength = mb_strlen($text);
309 2
        for ($i = 0; $i < $textLength; ++$i) {
310
            $char = mb_substr($text, $i, 1);
311 1
            if (
312 1
                !\array_key_exists($char, $index_map)
313
                || !\array_key_exists($index_map[$char], $width_map)
314
                || !\array_key_exists($width_map[$index_map[$char]], $widths)
315 2
            ) {
316
                $missing[] = $char;
317
                continue;
318
            }
319
            $width_index = $width_map[$index_map[$char]];
320
            $width += $widths[$width_index];
321 71
        }
322
323
        return $width;
324 71
    }
325 2
326
    /**
327
     * Decode hexadecimal encoded string. If $add_braces is true result value would be wrapped by parentheses.
328 71
     */
329 71
    public static function decodeHexadecimal(string $hexa, bool $add_braces = false): string
330
    {
331 71
        // Special shortcut for XML content.
332 71
        if (false !== stripos($hexa, '<?xml')) {
333
            return $hexa;
334 28
        }
335 28
336 28
        $text = '';
337 1
        $parts = preg_split('/(<[a-f0-9\s]+>)/si', $hexa, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE);
338
339
        foreach ($parts as $part) {
340 28
            if (preg_match('/^<[a-f0-9\s]+>$/si', $part)) {
341 28
                // strip whitespace
342
                $part = preg_replace("/\s/", '', $part);
343 28
                $part = trim($part, '<>');
344 28
                if ($add_braces) {
345
                    $text .= '(';
346
                }
347 71
348
                $part = pack('H*', $part);
349
                $text .= ($add_braces ? preg_replace('/\\\/s', '\\\\\\', $part) : $part);
350
351 71
                if ($add_braces) {
352
                    $text .= ')';
353
                }
354
            } else {
355
                $text .= $part;
356
            }
357 70
        }
358
359
        return $text;
360 70
    }
361
362
    /**
363
     * Decode string with octal-decoded chunks.
364 70
     */
365 26
    public static function decodeOctal(string $text): string
366 70
    {
367
        // Replace all double backslashes \\ with a special string
368
        $text = strtr($text, ['\\\\' => '[**pdfparserdblslsh**]']);
369 70
370
        // Now we can replace all octal codes without worrying about
371
        // escaped backslashes
372 70
        $text = preg_replace_callback('/\\\\([0-7]{1,3})/', function ($m) {
373
            return \chr(octdec($m[1]));
0 ignored issues
show
Bug introduced by
It seems like octdec($m[1]) can also be of type double; however, parameter $codepoint of chr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

373
            return \chr(/** @scrutinizer ignore-type */ octdec($m[1]));
Loading history...
374
        }, $text);
375
376
        // Unescape any parentheses
377
        $text = str_replace(['\\(', '\\)'], ['(', ')'], $text);
378 85
379
        // Replace instances of the special string with a single backslash
380 85
        return str_replace('[**pdfparserdblslsh**]', '\\', $text);
381 7
    }
382 85
383
    /**
384
     * Decode string with html entity encoded chars.
385
     */
386
    public static function decodeEntities(string $text): string
387
    {
388
        return preg_replace_callback('/#([0-9a-f]{2})/i', function ($m) {
389
            return \chr(hexdec($m[1]));
0 ignored issues
show
Bug introduced by
It seems like hexdec($m[1]) can also be of type double; however, parameter $codepoint of chr() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

389
            return \chr(/** @scrutinizer ignore-type */ hexdec($m[1]));
Loading history...
390
        }, $text);
391
    }
392 71
393
    /**
394 71
     * Check if given string is Unicode text (by BOM);
395
     * If true - decode to "utf-8" encoded string.
396 34
     * Otherwise - return text as is.
397 34
     *
398 34
     * @todo Rename in next major release to make the name correspond to reality (for ex. decodeIfUnicode())
399
     */
400 34
    public static function decodeUnicode(string $text): string
401 34
    {
402
        if ("\xFE\xFF" === substr($text, 0, 2)) {
403
            // Strip U+FEFF byte order marker.
404
            $decode = substr($text, 2);
405 71
            $text = '';
406
            $length = \strlen($decode);
407
408
            for ($i = 0; $i < $length; $i += 2) {
409
                $text .= self::uchr(hexdec(bin2hex(substr($decode, $i, 2))));
410
            }
411 47
        }
412
413 47
        return $text;
414
    }
415
416
    /**
417
     * @todo Deprecated, use $this->config->getFontSpaceLimit() instead.
418
     */
419 47
    protected function getFontSpaceLimit(): int
420
    {
421 47
        return $this->config->getFontSpaceLimit();
0 ignored issues
show
Bug introduced by
The method getFontSpaceLimit() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

421
        return $this->config->/** @scrutinizer ignore-call */ getFontSpaceLimit();

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
422 47
    }
423 47
424
    /**
425 47
     * Decode text by commands array.
426 46
     */
427 46
    public function decodeText(array $commands, float $fontFactor = 4): string
428 33
    {
429 33
        $word_position = 0;
430 16
        $words = [];
431
        $font_space = $this->getFontSpaceLimit() * abs($fontFactor) / 4;
432 33
433 46
        foreach ($commands as $command) {
434
            switch ($command[PDFObject::TYPE]) {
435 26
                case 'n':
436 26
                    $offset = (float) trim($command[PDFObject::COMMAND]);
437
                    if ($offset - (float) $font_space < 0) {
438
                        $word_position = \count($words);
439
                    }
440 35
                    continue 2;
441
                case '<':
442
                    // Decode hexadecimal.
443
                    $text = self::decodeHexadecimal('<'.$command[PDFObject::COMMAND].'>');
444 46
                    break;
445 46
446 46
                default:
447 46
                    // Decode octal (if necessary).
448 46
                    $text = self::decodeOctal($command[PDFObject::COMMAND]);
449
            }
450
451 46
            // replace escaped chars
452 33
            $text = str_replace(
453
                ['\\\\', '\(', '\)', '\n', '\r', '\t', '\f', '\ ', '\b'],
454 46
                [\chr(92), \chr(40), \chr(41), \chr(10), \chr(13), \chr(9), \chr(12), \chr(32), \chr(8)],
455
                $text
456
            );
457
458 47
            // add content to result string
459 46
            if (isset($words[$word_position])) {
460 46
                $words[$word_position] .= $text;
461
            } else {
462
                $words[$word_position] = $text;
463
            }
464
        }
465
466
        foreach ($words as &$word) {
467 47
            $word = $this->decodeContent($word);
468 12
            $word = str_replace("\t", ' ', $word);
469 4
        }
470
471
        // Remove internal "words" that are just spaces, but leave them
472 47
        // if they are at either end of the array of words. This fixes,
473
        // for   example,   lines   that   are   justified   to   fill
474
        // a whole row.
475
        for ($x = \count($words) - 2; $x >= 1; --$x) {
476
            if ('' === trim($words[$x], ' ')) {
477
                unset($words[$x]);
478 47
            }
479 47
        }
480 47
        $words = array_values($words);
481 47
482 47
        // Cut down on the number of unnecessary internal spaces by
483 47
        // imploding the string on the null byte, and checking if the
484
        // text includes extra spaces on either side. If so, merge
485 47
        // where appropriate.
486
        $words = implode("\x00\x00", $words);
487
        $words = str_replace(
488
            [" \x00\x00 ", "\x00\x00 ", " \x00\x00", "\x00\x00"],
489
            ['  ', ' ', ' ', ' '],
490
            $words
491
        );
492
493 51
        return $words;
494
    }
495
496
    /**
497 51
     * Decode given $text to "utf-8" encoded string.
498 3
     *
499
     * @param bool $unicode This parameter is deprecated and might be removed in a future release
500
     */
501 50
    public function decodeContent(string $text, ?bool &$unicode = null): string
0 ignored issues
show
Unused Code introduced by
The parameter $unicode is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

501
    public function decodeContent(string $text, /** @scrutinizer ignore-unused */ ?bool &$unicode = null): string

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
502 41
    {
503
        // If this string begins with a UTF-16BE BOM, then decode it
504
        // directly as Unicode
505 33
        if ("\xFE\xFF" === substr($text, 0, 2)) {
506 28
            return $this->decodeUnicode($text);
507
        }
508 28
509 28
        if ($this->has('ToUnicode')) {
510
            return $this->decodeContentByToUnicodeCMapOrDescendantFonts($text);
511
        }
512
513 9
        if ($this->has('Encoding')) {
514
            $result = $this->decodeContentByEncoding($text);
515
516
            if (null !== $result) {
517
                return $result;
518
            }
519
        }
520
521
        return $this->decodeContentByAutodetectIfNecessary($text);
522
    }
523
524
    /**
525 41
     * First try to decode $text by ToUnicode CMap.
526
     * If char translation not found in ToUnicode CMap tries:
527 41
     *  - If DescendantFonts exists tries to decode char by one of that fonts.
528
     *      - If have no success to decode by DescendantFonts interpret $text as a string with "Windows-1252" encoding.
529 41
     *  - If DescendantFonts does not exist just return "?" as decoded char.
530 41
     *
531 41
     * @todo Seems this is invalid algorithm that do not follow pdf-format specification. Must be rewritten.
532
     */
533 41
    private function decodeContentByToUnicodeCMapOrDescendantFonts(string $text): string
534 41
    {
535
        $bytes = $this->tableSizes['from'];
536 41
537 41
        if ($bytes) {
538
            $result = '';
539
            $length = \strlen($text);
540
541
            for ($i = 0; $i < $length; $i += $bytes) {
542
                $char = substr($text, $i, $bytes);
543
544
                if (false !== ($decoded = $this->translateChar($char, false))) {
545
                    $char = $decoded;
546
                } elseif ($this->has('DescendantFonts')) {
547
                    if ($this->get('DescendantFonts') instanceof PDFObject) {
548
                        $fonts = $this->get('DescendantFonts')->getHeader()->getElements();
0 ignored issues
show
Bug introduced by
The method getHeader() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

548
                        $fonts = $this->get('DescendantFonts')->/** @scrutinizer ignore-call */ getHeader()->getElements();

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
549
                    } else {
550
                        $fonts = $this->get('DescendantFonts')->getContent();
551
                    }
552
                    $decoded = false;
553
554
                    foreach ($fonts as $font) {
555
                        if ($font instanceof self) {
556
                            if (false !== ($decoded = $font->translateChar($char, false))) {
557
                                $decoded = mb_convert_encoding($decoded, 'UTF-8', 'Windows-1252');
0 ignored issues
show
Bug introduced by
It seems like $decoded can also be of type true; however, parameter $string of mb_convert_encoding() does only seem to accept array|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

557
                                $decoded = mb_convert_encoding(/** @scrutinizer ignore-type */ $decoded, 'UTF-8', 'Windows-1252');
Loading history...
558
                                break;
559
                            }
560
                        }
561
                    }
562
563
                    if (false !== $decoded) {
564 41
                        $char = $decoded;
565
                    } else {
566
                        $char = mb_convert_encoding($char, 'UTF-8', 'Windows-1252');
567 41
                    }
568
                } else {
569
                    $char = self::MISSING;
570 41
                }
571
572
                $result .= $char;
573
            }
574
575
            $text = $result;
576 28
        }
577
578 28
        return $text;
579
    }
580
581 28
    /**
582 4
     * Decode content by any type of Encoding (dictionary's item) instance.
583
     */
584
    private function decodeContentByEncoding(string $text): ?string
585
    {
586 28
        $encoding = $this->get('Encoding');
587 4
588
        // When Encoding referenced by object id (/Encoding 520 0 R) but object itself does not contain `/Type /Encoding` in it's dictionary.
589
        if ($encoding instanceof PDFObject) {
0 ignored issues
show
introduced by
$encoding is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
590
            $encoding = $this->getInitializedEncodingByPdfObject($encoding);
591 25
        }
592 25
593
        // When Encoding referenced by object id (/Encoding 520 0 R) but object itself contains `/Type /Encoding` in it's dictionary.
594
        if ($encoding instanceof Encoding) {
0 ignored issues
show
introduced by
$encoding is never a sub-type of Smalot\PdfParser\Encoding.
Loading history...
595
            return $this->decodeContentByEncodingEncoding($text, $encoding);
596
        }
597
598
        // When Encoding is just string (/Encoding /WinAnsiEncoding)
599
        if ($encoding instanceof Element) { // todo: ElementString class must by used?
0 ignored issues
show
introduced by
$encoding is always a sub-type of Smalot\PdfParser\Element.
Loading history...
600
            return $this->decodeContentByEncodingElement($text, $encoding);
601
        }
602
603
        // don't double-encode strings already in UTF-8
604
        if (!mb_check_encoding($text, 'UTF-8')) {
605
            return mb_convert_encoding($text, 'UTF-8', 'Windows-1252');
606 4
        }
607
608 4
        return $text;
609 4
    }
610
611
    /**
612 4
     * Returns already created or create a new one if not created before Encoding instance by PDFObject instance.
613
     */
614
    private function getInitializedEncodingByPdfObject(PDFObject $PDFObject): Encoding
615
    {
616
        if (!$this->initializedEncodingByPdfObject) {
617
            $this->initializedEncodingByPdfObject = $this->createInitializedEncodingByPdfObject($PDFObject);
618 4
        }
619
620 4
        return $this->initializedEncodingByPdfObject;
621 4
    }
622
623 4
    /**
624 4
     * Decode content when $encoding (given by $this->get('Encoding')) is instance of Encoding.
625 4
     */
626 4
    private function decodeContentByEncodingEncoding(string $text, Encoding $encoding): string
627
    {
628
        $result = '';
629 4
        $length = \strlen($text);
630
631
        for ($i = 0; $i < $length; ++$i) {
632
            $dec_av = hexdec(bin2hex($text[$i]));
633
            $dec_ap = $encoding->translateChar($dec_av);
634
            $result .= self::uchr($dec_ap ?? $dec_av);
635 25
        }
636
637 25
        return $result;
638
    }
639
640
    /**
641 25
     * Decode content when $encoding (given by $this->get('Encoding')) is instance of Element.
642
     */
643 25
    private function decodeContentByEncodingElement(string $text, Element $encoding): ?string
644
    {
645
        $pdfEncodingName = $encoding->getContent();
646
647
        // mb_convert_encoding does not support MacRoman/macintosh,
648
        // so we use iconv() here
649 25
        $iconvEncodingName = $this->getIconvEncodingNameOrNullByPdfEncodingName($pdfEncodingName);
650
651 25
        return $iconvEncodingName ? iconv($iconvEncodingName, 'UTF-8//TRANSLIT//IGNORE', $text) : null;
652 25
    }
653 25
654 25
    /**
655 25
     * Convert PDF encoding name to iconv-known encoding name.
656
     */
657 25
    private function getIconvEncodingNameOrNullByPdfEncodingName(string $pdfEncodingName): ?string
658 25
    {
659 25
        $pdfToIconvEncodingNameMap = [
660
            'StandardEncoding' => 'ISO-8859-1',
661
            'MacRomanEncoding' => 'MACINTOSH',
662
            'WinAnsiEncoding' => 'CP1252',
663
        ];
664
665
        return \array_key_exists($pdfEncodingName, $pdfToIconvEncodingNameMap)
666
            ? $pdfToIconvEncodingNameMap[$pdfEncodingName]
667
            : null;
668 9
    }
669
670 9
    /**
671 8
     * If string seems like "utf-8" encoded string do nothing and just return given string as is.
672
     * Otherwise, interpret string as "Window-1252" encoded string.
673
     *
674 2
     * @return string|false
675
     */
676
    private function decodeContentByAutodetectIfNecessary(string $text)
677
    {
678
        if (mb_check_encoding($text, 'UTF-8')) {
679
            return $text;
680
        }
681 4
682
        return mb_convert_encoding($text, 'UTF-8', 'Windows-1252');
0 ignored issues
show
Bug Best Practice introduced by
The expression return mb_convert_encodi...UTF-8', 'Windows-1252') also could return the type array which is incompatible with the documented return type false|string.
Loading history...
683 4
        // todo: Why exactly `Windows-1252` used?
684 4
    }
685
686 4
    /**
687
     * Create Encoding instance by PDFObject instance and init it.
688
     */
689
    private function createInitializedEncodingByPdfObject(PDFObject $PDFObject): Encoding
690
    {
691
        $encoding = $this->createEncodingByPdfObject($PDFObject);
692 4
        $encoding->init();
693
694 4
        return $encoding;
695 4
    }
696 4
697 4
    /**
698
     * Create Encoding instance by PDFObject instance (without init).
699 4
     */
700
    private function createEncodingByPdfObject(PDFObject $PDFObject): Encoding
701
    {
702
        $document = $PDFObject->getDocument();
703
        $header = $PDFObject->getHeader();
704
        $content = $PDFObject->getContent();
705
        $config = $PDFObject->getConfig();
706
707
        return new Encoding($document, $header, $content, $config);
708
    }
709
}
710