Passed
Pull Request — master (#693)
by Konrad
15:20 queued 12:54
created

PDFObject::formatContent()   F

Complexity

Conditions 19
Paths 386

Size

Total Lines 177
Code Lines 83

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 93
CRAP Score 19

Importance

Changes 2
Bugs 1 Features 0
Metric Value
cc 19
eloc 83
c 2
b 1
f 0
nc 386
nop 1
dl 0
loc 177
ccs 93
cts 93
cp 1
crap 19
rs 1.3583

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\XObject\Form;
36
use Smalot\PdfParser\XObject\Image;
37
38
/**
39
 * Class PDFObject
40
 */
41
class PDFObject
42
{
43
    public const TYPE = 't';
44
45
    public const OPERATOR = 'o';
46
47
    public const COMMAND = 'c';
48
49
    /**
50
     * The recursion stack.
51
     *
52
     * @var array
53
     */
54
    public static $recursionStack = [];
55
56
    /**
57
     * @var Document|null
58
     */
59
    protected $document;
60
61
    /**
62
     * @var Header
63
     */
64
    protected $header;
65
66
    /**
67
     * @var string
68
     */
69
    protected $content;
70
71
    /**
72
     * @var Config|null
73
     */
74
    protected $config;
75
76
    /**
77
     * @var bool
78
     */
79
    protected $addPositionWhitespace = false;
80
81 94
    public function __construct(
82
        Document $document,
83
        ?Header $header = null,
84
        ?string $content = null,
85
        ?Config $config = null
86
    ) {
87 94
        $this->document = $document;
88 94
        $this->header = $header ?? new Header();
89 94
        $this->content = $content;
90 94
        $this->config = $config;
91
    }
92
93 72
    public function init()
94
    {
95 72
    }
96
97 4
    public function getDocument(): Document
98
    {
99 4
        return $this->document;
0 ignored issues
show
Bug Best Practice introduced by
The expression return $this->document could return the type null which is incompatible with the type-hinted return Smalot\PdfParser\Document. Consider adding an additional type-check to rule them out.
Loading history...
100
    }
101
102 72
    public function getHeader(): ?Header
103
    {
104 72
        return $this->header;
105
    }
106
107 4
    public function getConfig(): ?Config
108
    {
109 4
        return $this->config;
110
    }
111
112
    /**
113
     * @return Element|PDFObject|Header
114
     */
115 74
    public function get(string $name)
116
    {
117 74
        return $this->header->get($name);
118
    }
119
120 73
    public function has(string $name): bool
121
    {
122 73
        return $this->header->has($name);
123
    }
124
125 4
    public function getDetails(bool $deep = true): array
126
    {
127 4
        return $this->header->getDetails($deep);
128
    }
129
130 59
    public function getContent(): ?string
131
    {
132 59
        return $this->content;
133
    }
134
135
    /**
136
     * Creates a duplicate of the document stream with
137
     * strings and other items replaced by $char. Formerly
138
     * getSectionsText() used this output to more easily gather offset
139
     * values to extract text from the *actual* document stream.
140
     *
141
     * @deprecated function is no longer used and will be removed in a future release
142
     *
143
     * @internal
144
     */
145 1
    public function cleanContent(string $content, string $char = 'X')
146
    {
147 1
        $char = $char[0];
148 1
        $content = str_replace(['\\\\', '\\)', '\\('], $char.$char, $content);
149
150
        // Remove image bloc with binary content
151 1
        preg_match_all('/\s(BI\s.*?(\sID\s).*?(\sEI))\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
152 1
        foreach ($matches[0] as $part) {
153
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
154
        }
155
156
        // Clean content in square brackets [.....]
157 1
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);
0 ignored issues
show
Unused Code introduced by
The call to preg_match_all() has too many arguments starting with PREG_OFFSET_CAPTURE. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

157
        /** @scrutinizer ignore-call */ 
158
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
158 1
        foreach ($matches[1] as $part) {
159 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
160
        }
161
162
        // Clean content in round brackets (.....)
163 1
        preg_match_all('/\((.*?)\)/s', $content, $matches, \PREG_OFFSET_CAPTURE);
164 1
        foreach ($matches[1] as $part) {
165 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
166
        }
167
168
        // Clean structure
169 1
        if ($parts = preg_split('/(<|>)/s', $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
0 ignored issues
show
Bug introduced by
It seems like $content can also be of type array; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

169
        if ($parts = preg_split('/(<|>)/s', /** @scrutinizer ignore-type */ $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
Loading history...
170 1
            $content = '';
171 1
            $level = 0;
172 1
            foreach ($parts as $part) {
173 1
                if ('<' == $part) {
174 1
                    ++$level;
175
                }
176
177 1
                $content .= (0 == $level ? $part : str_repeat($char, \strlen($part)));
178
179 1
                if ('>' == $part) {
180 1
                    --$level;
181
                }
182
            }
183
        }
184
185
        // Clean BDC and EMC markup
186 1
        preg_match_all(
187 1
            '/(\/[A-Za-z0-9\_]*\s*'.preg_quote($char).'*BDC)/s',
188 1
            $content,
189 1
            $matches,
190 1
            \PREG_OFFSET_CAPTURE
191 1
        );
192 1
        foreach ($matches[1] as $part) {
193 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
194
        }
195
196 1
        preg_match_all('/\s(EMC)\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
197 1
        foreach ($matches[1] as $part) {
198 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
199
        }
200
201 1
        return $content;
202
    }
203
204
    /**
205
     * Takes a string of PDF document stream text and formats
206
     * it into a multi-line string with one PDF command on each line,
207
     * separated by \r\n. If the given string is null, or binary data
208
     * is detected instead of a document stream then return an empty
209
     * string.
210
     */
211 53
    private function formatContent(?string $content): string
212
    {
213 53
        if (null === $content) {
214 3
            return '';
215
        }
216
217
        // Outside of (String) content in PDF document streams, all
218
        // text should conform to UTF-8. Test for binary content by
219
        // deleting everything after the first open-parenthesis ( which
220
        // indicates the beginning of a string. Then test what remains
221
        // for valid UTF-8. If it's not UTF-8, return an empty string
222
        // as this $content is most likely binary.
223 50
        if (false === mb_check_encoding(preg_replace('/\(.*$/s', '', $content), 'UTF-8')) {
224 1
            return '';
225
        }
226
227
        // Find all inline image content and replace them so they aren't
228
        // affected by the next steps
229 50
        $pdfInlineImages = [];
230 50
        $offsetBI = 0;
231 50
        while (preg_match('/\sBI\s(\/.+?)\sID\s(.+?)\sEI(?=\s|$)/s', $content, $text, \PREG_OFFSET_CAPTURE, $offsetBI)) {
232
            // Attempt to detemine if this instance of the 'BI' command
233
            // actually occured within a (string) using the following
234
            // steps:
235
236
            // Remove any escaped parentheses from the alleged image
237
            // characteristics data
238 1
            $para = str_replace(['\\(', '\\)'], '', $text[1][0]);
239
240
            // Remove all correctly ordered and balanced parentheses
241
            // from (strings)
242
            do {
243 1
                $paraTest = $para;
244 1
                $para = preg_replace('/\(([^)]*)\)/', '$1', $paraTest);
245 1
            } while ($para != $paraTest);
246
247 1
            $paraOpen = strpos($para, '(');
248 1
            $paraClose = strpos($para, ')');
249
250
            // If the remaining text contains a close parenthesis ')'
251
            // AND it occurs before any open parenthesis, then we are
252
            // almost certain to be inside a (string)
253 1
            if (0 < $paraClose && (false === $paraOpen || $paraClose < $paraOpen)) {
254
                // Bump the search offset forward and match again
255 1
                $offsetBI = (int) $text[1][1];
256 1
                continue;
257
            }
258
259
            // Double check that this is actually inline image data by
260
            // parsing the alleged image characteristics as a dictionary
261 1
            $dict = $this->parseDictionary('<<'.$text[1][0].'>>');
262
263
            // Check if an image Width and Height are set in the dict
264 1
            if ((isset($dict['W']) || isset($dict['Width']))
265 1
                && (isset($dict['H']) || isset($dict['Height']))) {
266 1
                $id = uniqid('IMAGE_', true);
267 1
                $pdfInlineImages[$id] = [
268 1
                    preg_replace(['/\r\n/', '/\r/', '/\n/'], ' ', $text[1][0]),
269 1
                    preg_replace(['/\r\n/', '/\r/', '/\n/'], '', $text[2][0]),
270 1
                ];
271 1
                $content = preg_replace(
272 1
                    '/'.preg_quote($text[0][0], '/').'/',
273 1
                    '^^^'.$id.'^^^',
274 1
                    $content,
275 1
                    1
276 1
                );
277
            }
278
        }
279
280
        // Find all strings () and replace them so they aren't affected
281
        // by the next steps
282 50
        $pdfstrings = [];
283 50
        $attempt = '(';
284 50
        while (preg_match('/'.preg_quote($attempt, '/').'.*?(?<![^\\\\]\\\\)\)/s', $content, $text)) {
285
            // PDF strings can contain unescaped parentheses as long as
286
            // they're balanced, so check for balanced parentheses
287 41
            $left = preg_match_all('/(?<![^\\\\]\\\\)\(/', $text[0]);
288 41
            $right = preg_match_all('/(?<![^\\\\]\\\\)\)/', $text[0]);
289
290 41
            if ($left == $right) {
291
                // Replace the string with a unique placeholder
292 41
                $id = uniqid('STRING_', true);
293 41
                $pdfstrings[$id] = $text[0];
294 41
                $content = preg_replace(
295 41
                    '/'.preg_quote($text[0], '/').'/',
296 41
                    '@@@'.$id.'@@@',
297 41
                    $content,
298 41
                    1
299 41
                );
300
301
                // Reset to search for the next string
302 41
                $attempt = '(';
303
            } else {
304
                // We had unbalanced parentheses, so use the current
305
                // match as a base to find a longer string
306 1
                $attempt = $text[0];
307
            }
308
        }
309
310
        // Remove all carriage returns and line-feeds from the document stream
311 50
        $content = str_replace(["\r", "\n"], ' ', trim($content));
312
313
        // Find all dictionary << >> commands and replace them so they
314
        // aren't affected by the next steps
315 50
        $dictstore = [];
316 50
        while (preg_match('/(<<.*?>> *)(BDC|BMC|DP|MP)/s', $content, $dicttext)) {
317 18
            $dictid = uniqid('DICT_', true);
318 18
            $dictstore[$dictid] = $dicttext[1];
319 18
            $content = preg_replace(
320 18
                '/'.preg_quote($dicttext[0], '/').'/',
321 18
                ' ###'.$dictid.'###'.$dicttext[2],
322 18
                $content,
323 18
                1
324 18
            );
325
        }
326
327
        // Normalize white-space in the document stream
328 50
        $content = preg_replace('/\s{2,}/', ' ', $content);
329
330
        // Find all valid PDF operators and add \r\n after each; this
331
        // ensures there is just one command on every line
332
        // Source: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A
333
        // Source: https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A
334
        // Note: PDF Reference 1.7 lists 'I' and 'rI' as valid commands, while
335
        //       PDF 32000:2008 lists them as 'i' and 'ri' respectively. Both versions
336
        //       appear here in the list for completeness.
337 50
        $operators = [
338 50
            'b*', 'b', 'BDC', 'BMC', 'B*', 'BI', 'BT', 'BX', 'B', 'cm', 'cs', 'c', 'CS',
339 50
            'd0', 'd1', 'd', 'Do', 'DP', 'EMC', 'EI', 'ET', 'EX', 'f*', 'f', 'F', 'gs',
340 50
            'g', 'G',  'h', 'i', 'ID', 'I', 'j', 'J', 'k', 'K', 'l', 'm', 'MP', 'M', 'n',
341 50
            'q', 'Q', 're', 'rg', 'ri', 'rI', 'RG', 'scn', 'sc', 'sh', 's', 'SCN', 'SC',
342 50
            'S', 'T*', 'Tc', 'Td', 'TD', 'Tf', 'TJ', 'Tj', 'TL', 'Tm', 'Tr', 'Ts', 'Tw',
343 50
            'Tz', 'v', 'w', 'W*', 'W', 'y', '\'', '"',
344 50
        ];
345 50
        foreach ($operators as $operator) {
346 50
            $content = preg_replace(
347 50
                '/(?<!\w|\/)'.preg_quote($operator, '/').'(?![\w10\*])/',
348 50
                $operator."\r\n",
349 50
                $content
350 50
            );
351
        }
352
353
        // Restore the original content of the dictionary << >> commands
354 50
        $dictstore = array_reverse($dictstore, true);
355 50
        foreach ($dictstore as $id => $dict) {
356 18
            $content = str_replace('###'.$id.'###', $dict, $content);
357
        }
358
359
        // Restore the original string content
360 50
        $pdfstrings = array_reverse($pdfstrings, true);
361 50
        foreach ($pdfstrings as $id => $text) {
362
            // Strings may contain escaped newlines, or literal newlines
363
            // and we should clean these up before replacing the string
364
            // back into the content stream; this ensures no strings are
365
            // split between two lines (every command must be on one line)
366 41
            $text = str_replace(
367 41
                ["\\\r\n", "\\\r", "\\\n", "\r", "\n"],
368 41
                ['', '', '', '\r', '\n'],
369 41
                $text
370 41
            );
371
372 41
            $content = str_replace('@@@'.$id.'@@@', $text, $content);
373
        }
374
375
        // Restore the original content of any inline images
376 50
        $pdfInlineImages = array_reverse($pdfInlineImages, true);
377 50
        foreach ($pdfInlineImages as $id => $image) {
378 1
            $content = str_replace(
379 1
                '^^^'.$id.'^^^',
380 1
                "\r\nBI\r\n".$image[0]." ID\r\n".$image[1]." EI\r\n",
381 1
                $content
382 1
            );
383
        }
384
385 50
        $content = trim(preg_replace(['/(\r\n){2,}/', '/\r\n +/'], "\r\n", $content));
386
387 50
        return $content;
388
    }
389
390
    /**
391
     * getSectionsText() now takes an entire, unformatted
392
     * document stream as a string, cleans it, then filters out
393
     * commands that aren't needed for text positioning/extraction. It
394
     * returns an array of unprocessed PDF commands, one command per
395
     * element.
396
     *
397
     * @internal
398
     */
399 52
    public function getSectionsText(?string $content): array
400
    {
401 52
        $sections = [];
402
403
        // A cleaned stream has one command on every line, so split the
404
        // cleaned stream content on \r\n into an array
405 52
        $textCleaned = preg_split(
406 52
            '/(\r\n|\n|\r)/',
407 52
            $this->formatContent($content),
408 52
            -1,
409 52
            \PREG_SPLIT_NO_EMPTY
410 52
        );
411
412 52
        $inTextBlock = false;
413 52
        foreach ($textCleaned as $line) {
414 49
            $line = trim($line);
415
416
            // Skip empty lines
417 49
            if ('' === $line) {
418
                continue;
419
            }
420
421
            // If a 'BT' is encountered, set the $inTextBlock flag
422 49
            if (preg_match('/BT$/', $line)) {
423 49
                $inTextBlock = true;
424 49
                $sections[] = $line;
425
426
            // If an 'ET' is encountered, unset the $inTextBlock flag
427 49
            } elseif ('ET' == $line) {
428 49
                $inTextBlock = false;
429 49
                $sections[] = $line;
430 49
            } elseif ($inTextBlock) {
431
                // If we are inside a BT ... ET text block, save all lines
432 49
                $sections[] = trim($line);
433
            } else {
434
                // Otherwise, if we are outside of a text block, only
435
                // save specific, necessary lines. Care should be taken
436
                // to ensure a command being checked for *only* matches
437
                // that command. For instance, a simple search for 'c'
438
                // may also match the 'sc' command. See the command
439
                // list in the formatContent() method above.
440
                // Add more commands to save here as you find them in
441
                // weird PDFs!
442 48
                if ('q' == $line[-1] || 'Q' == $line[-1]) {
443
                    // Save and restore graphics state commands
444 42
                    $sections[] = $line;
445 48
                } elseif (preg_match('/(?<!\w)B[DM]C$/', $line)) {
446
                    // Begin marked content sequence
447 16
                    $sections[] = $line;
448 48
                } elseif (preg_match('/(?<!\w)[DM]P$/', $line)) {
449
                    // Marked content point
450 1
                    $sections[] = $line;
451 47
                } elseif (preg_match('/(?<!\w)EMC$/', $line)) {
452
                    // End marked content sequence
453 15
                    $sections[] = $line;
454 45
                } elseif (preg_match('/(?<!\w)cm$/', $line)) {
455
                    // Graphics position change commands
456 33
                    $sections[] = $line;
457 45
                } elseif (preg_match('/(?<!\w)Tf$/', $line)) {
458
                    // Font change commands
459 3
                    $sections[] = $line;
460 45
                } elseif (preg_match('/(?<!\w)Do$/', $line)) {
461
                    // Invoke named XObject command
462 15
                    $sections[] = $line;
463
                }
464
            }
465
        }
466
467 52
        return $sections;
468
    }
469
470 46
    private function getDefaultFont(?Page $page = null): Font
471
    {
472 46
        $fonts = [];
473 46
        if (null !== $page) {
474 44
            $fonts = $page->getFonts();
475
        }
476
477 46
        $firstFont = $this->document->getFirstFont();
0 ignored issues
show
Bug introduced by
The method getFirstFont() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

477
        /** @scrutinizer ignore-call */ 
478
        $firstFont = $this->document->getFirstFont();

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
478 46
        if (null !== $firstFont) {
479 43
            $fonts[] = $firstFont;
480
        }
481
482 46
        if (\count($fonts) > 0) {
483 43
            return reset($fonts);
484
        }
485
486 3
        return new Font($this->document, null, null, $this->config);
0 ignored issues
show
Bug introduced by
It seems like $this->document can also be of type null; however, parameter $document of Smalot\PdfParser\Font::__construct() does only seem to accept Smalot\PdfParser\Document, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

486
        return new Font(/** @scrutinizer ignore-type */ $this->document, null, null, $this->config);
Loading history...
487
    }
488
489
    /**
490
     * Decode a '[]TJ' command and attempt to use alternate
491
     * fonts if the current font results in output that contains
492
     * Unicode control characters.
493
     *
494
     * @internal
495
     *
496
     * @param array<int,array<string,string|bool>> $command
497
     */
498 43
    private function getTJUsingFontFallback(Font $font, array $command, ?Page $page = null, float $fontFactor = 4): string
499
    {
500 43
        $orig_text = $font->decodeText($command, $fontFactor);
501 43
        $text = $orig_text;
502
503
        // If we make this a Config option, we can add a check if it's
504
        // enabled here.
505 43
        if (null !== $page) {
506 43
            $font_ids = array_keys($page->getFonts());
507
508
            // If the decoded text contains UTF-8 control characters
509
            // then the font page being used is probably the wrong one.
510
            // Loop through the rest of the fonts to see if we can get
511
            // a good decode. Allow x09 to x0d which are whitespace.
512 43
            while (preg_match('/[\x00-\x08\x0e-\x1f\x7f]/u', $text) || false !== strpos(bin2hex($text), '00')) {
513
                // If we're out of font IDs, then give up and use the
514
                // original string
515 3
                if (0 == \count($font_ids)) {
516 3
                    return $orig_text;
517
                }
518
519
                // Try the next font ID
520 3
                $font = $page->getFont(array_shift($font_ids));
521 3
                $text = $font->decodeText($command, $fontFactor);
522
            }
523
        }
524
525 43
        return $text;
526
    }
527
528
    /**
529
     * Expects a string that is a full PDF dictionary object,
530
     * including the outer enclosing << >> angle brackets
531
     *
532
     * @internal
533
     *
534
     * @throws \Exception
535
     */
536 18
    public function parseDictionary(string $dictionary): array
537
    {
538
        // Normalize whitespace
539 18
        $dictionary = preg_replace(['/\r/', '/\n/', '/\s{2,}/'], ' ', trim($dictionary));
540
541 18
        if ('<<' != substr($dictionary, 0, 2)) {
542
            throw new \Exception('Not a valid dictionary object.');
543
        }
544
545 18
        $parsed = [];
546 18
        $stack = [];
547 18
        $currentName = '';
548 18
        $arrayTypeNumeric = false;
549
550
        // Remove outer layer of dictionary, and split on tokens
551 18
        $split = preg_split(
552 18
            '/(<<|>>|\[|\]|\/[^\s\/\[\]\(\)<>]*)/',
553 18
            trim(preg_replace('/^<<|>>$/', '', $dictionary)),
554 18
            -1,
555 18
            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
556 18
        );
557
558 18
        foreach ($split as $token) {
559 18
            $token = trim($token);
560
            switch ($token) {
561 18
                case '':
562 8
                    break;
563
564
                    // Open numeric array
565 18
                case '[':
566 8
                    $parsed[$currentName] = [];
567 8
                    $arrayTypeNumeric = true;
568
569
                    // Move up one level in the stack
570 8
                    $stack[\count($stack)] = &$parsed;
571 8
                    $parsed = &$parsed[$currentName];
572 8
                    $currentName = '';
573 8
                    break;
574
575
                    // Open hashed array
576 18
                case '<<':
577 1
                    $parsed[$currentName] = [];
578 1
                    $arrayTypeNumeric = false;
579
580
                    // Move up one level in the stack
581 1
                    $stack[\count($stack)] = &$parsed;
582 1
                    $parsed = &$parsed[$currentName];
583 1
                    $currentName = '';
584 1
                    break;
585
586
                    // Close numeric array
587 18
                case ']':
588
                    // Revert string type arrays back to a single element
589 8
                    if (\is_array($parsed) && 1 == \count($parsed)
590 8
                        && isset($parsed[0]) && \is_string($parsed[0])
591 8
                        && '' !== $parsed[0] && '/' != $parsed[0][0]) {
592 6
                        $parsed = '['.$parsed[0].']';
593
                    }
594
                    // Close hashed array
595
                    // no break
596 18
                case '>>':
597 8
                    $arrayTypeNumeric = false;
598
599
                    // Move down one level in the stack
600 8
                    $parsed = &$stack[\count($stack) - 1];
601 8
                    unset($stack[\count($stack) - 1]);
602 8
                    break;
603
604
                default:
605
                    // If value begins with a slash, then this is a name
606
                    // Add it to the appropriate array
607 18
                    if ('/' == substr($token, 0, 1)) {
608 18
                        $currentName = substr($token, 1);
609 18
                        if (true == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
610 7
                            $parsed[] = $currentName;
611 18
                            $currentName = '';
612
                        }
613 18
                    } elseif ('' != $currentName) {
614 18
                        if (false == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
615 18
                            $parsed[$currentName] = $token;
616
                        }
617 18
                        $currentName = '';
618 5
                    } elseif ('' == $currentName) {
619 5
                        $parsed[] = $token;
620
                    }
621
            }
622
        }
623
624 18
        return $parsed;
625
    }
626
627
    /**
628
     * Returns the text content of a PDF as a string. Attempts to add
629
     * whitespace for spacing and line-breaks where appropriate.
630
     *
631
     * getText() leverages getTextArray() to get the content
632
     * of the document, setting the addPositionWhitespace flag to true
633
     * so whitespace is inserted in a logical way for reading by
634
     * humans.
635
     */
636 37
    public function getText(?Page $page = null): string
637
    {
638 37
        $this->addPositionWhitespace = true;
639 37
        $result = $this->getTextArray($page);
640 37
        $this->addPositionWhitespace = false;
641
642 37
        return implode('', $result).' ';
643
    }
644
645
    /**
646
     * Returns the text content of a PDF as an array of strings. No
647
     * extra whitespace is inserted besides what is actually encoded in
648
     * the PDF text.
649
     *
650
     * @throws \Exception
651
     */
652 46
    public function getTextArray(?Page $page = null): array
653
    {
654 46
        $result = [];
655 46
        $text = [];
656
657 46
        $marked_stack = [];
658 46
        $last_written_position = false;
659
660 46
        $sections = $this->getSectionsText($this->content);
661 46
        $current_font = $this->getDefaultFont($page);
662 46
        $current_font_size = 1;
663 46
        $current_text_leading = 0;
664
665 46
        $current_position = ['x' => false, 'y' => false];
666 46
        $current_position_tm = [
667 46
            'a' => 1, 'b' => 0, 'c' => 0,
668 46
            'i' => 0, 'j' => 1, 'k' => 0,
669 46
            'x' => 0, 'y' => 0, 'z' => 1,
670 46
        ];
671 46
        $current_position_td = ['x' => 0, 'y' => 0];
672 46
        $current_position_cm = [
673 46
            'a' => 1, 'b' => 0, 'c' => 0,
674 46
            'i' => 0, 'j' => 1, 'k' => 0,
675 46
            'x' => 0, 'y' => 0, 'z' => 1,
676 46
        ];
677
678 46
        $clipped_font = [];
679 46
        $clipped_position_cm = [];
680
681 46
        self::$recursionStack[] = $this->getUniqueId();
682
683 46
        foreach ($sections as $section) {
684 43
            $commands = $this->getCommandsText($section);
685 43
            foreach ($commands as $command) {
686 43
                switch ($command[self::OPERATOR]) {
687
                    // Begin text object
688 43
                    case 'BT':
689
                        // Reset text positioning matrices
690 43
                        $current_position_tm = [
691 43
                            'a' => 1, 'b' => 0, 'c' => 0,
692 43
                            'i' => 0, 'j' => 1, 'k' => 0,
693 43
                            'x' => 0, 'y' => 0, 'z' => 1,
694 43
                        ];
695 43
                        $current_position_td = ['x' => 0, 'y' => 0];
696 43
                        $current_text_leading = 0;
697 43
                        break;
698
699
                        // Begin marked content sequence with property list
700 43
                    case 'BDC':
701 16
                        if (preg_match('/(<<.*>>)$/', $command[self::COMMAND], $match)) {
702 16
                            $dict = $this->parseDictionary($match[1]);
703
704
                            // Check for ActualText block
705 16
                            if (isset($dict['ActualText']) && \is_string($dict['ActualText']) && '' !== $dict['ActualText']) {
706 4
                                if ('[' == $dict['ActualText'][0]) {
707
                                    // Simulate a 'TJ' command on the stack
708
                                    $marked_stack[] = [
709
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'TJ')[0],
710
                                    ];
711 4
                                } elseif ('<' == $dict['ActualText'][0] || '(' == $dict['ActualText'][0]) {
712
                                    // Simulate a 'Tj' command on the stack
713 4
                                    $marked_stack[] = [
714 4
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'Tj')[0],
715 4
                                    ];
716
                                }
717
                            }
718
                        }
719 16
                        break;
720
721
                        // Begin marked content sequence
722 43
                    case 'BMC':
723 2
                        if ('ReversedChars' == $command[self::COMMAND]) {
724
                            // Upon encountering a ReversedChars command,
725
                            // add the characters we've built up so far to
726
                            // the result array
727 1
                            $result = array_merge($result, $text);
728
729
                            // Start a fresh $text array that will contain
730
                            // reversed characters
731 1
                            $text = [];
732
733
                            // Add the reversed text flag to the stack
734 1
                            $marked_stack[] = ['ReversedChars' => true];
735
                        }
736 2
                        break;
737
738
                        // set graphics position matrix
739 43
                    case 'cm':
740 29
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
741 29
                        $current_position_cm = [
742 29
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
743 29
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
744 29
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
745 29
                        ];
746 29
                        break;
747
748 43
                    case 'Do':
749 15
                        if (null !== $page) {
750 15
                            $args = preg_split('/\s/s', $command[self::COMMAND]);
751 15
                            $id = trim(array_pop($args), '/ ');
752 15
                            $xobject = $page->getXObject($id);
753
754
                            // @todo $xobject could be a ElementXRef object, which would then throw an error
755 15
                            if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
756
                                // Not a circular reference.
757 15
                                $text[] = $xobject->getText($page);
758
                            }
759
                        }
760 15
                        break;
761
762
                        // Marked content point with (DP) & without (MP) property list
763 43
                    case 'DP':
764 43
                    case 'MP':
765 1
                        break;
766
767
                        // End text object
768 43
                    case 'ET':
769 43
                        break;
770
771
                        // Store current selected font and graphics matrix
772 43
                    case 'q':
773 37
                        $clipped_font[] = [$current_font, $current_font_size];
774 37
                        $clipped_position_cm[] = $current_position_cm;
775 37
                        break;
776
777
                        // Restore previous selected font and graphics matrix
778 43
                    case 'Q':
779 37
                        list($current_font, $current_font_size) = array_pop($clipped_font);
780 37
                        $current_position_cm = array_pop($clipped_position_cm);
781 37
                        break;
782
783
                        // End marked content sequence
784 43
                    case 'EMC':
785 17
                        $data = false;
786 17
                        if (\count($marked_stack)) {
787 5
                            $marked = array_pop($marked_stack);
788 5
                            $action = key($marked);
789 5
                            $data = $marked[$action];
790
791
                            switch ($action) {
792
                                // If we are in ReversedChars mode...
793 5
                                case 'ReversedChars':
794
                                    // Reverse the characters we've built up so far
795 1
                                    foreach ($text as $key => $t) {
796 1
                                        $text[$key] = implode('', array_reverse(
797 1
                                            mb_str_split($t, 1, mb_internal_encoding())
0 ignored issues
show
Bug introduced by
It seems like mb_internal_encoding() can also be of type true; however, parameter $encoding of mb_str_split() does only seem to accept null|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

797
                                            mb_str_split($t, 1, /** @scrutinizer ignore-type */ mb_internal_encoding())
Loading history...
798 1
                                        ));
799
                                    }
800
801
                                    // Add these characters to the result array
802 1
                                    $result = array_merge($result, $text);
803
804
                                    // Start a fresh $text array that will contain
805
                                    // non-reversed characters
806 1
                                    $text = [];
807 1
                                    break;
808
809 4
                                case 'ActualText':
810
                                    // Use the content of the ActualText as a command
811 4
                                    $command = $data;
812 4
                                    break;
813
                            }
814
                        }
815
816
                        // If this EMC command has been transformed into a 'Tj'
817
                        // or 'TJ' command because of being ActualText, then bypass
818
                        // the break to proceed to the writing section below.
819 17
                        if ('Tj' != $command[self::OPERATOR] && 'TJ' != $command[self::OPERATOR]) {
820 17
                            break;
821
                        }
822
823
                        // no break
824 43
                    case "'":
825 43
                    case '"':
826 4
                        if ("'" == $command[self::OPERATOR] || '"' == $command[self::OPERATOR]) {
827
                            // Move to next line and write text
828
                            $current_position['x'] = 0;
829
                            $current_position_td['x'] = 0;
830
                            $current_position_td['y'] += $current_text_leading;
831
                        }
832
                        // no break
833 43
                    case 'Tj':
834 35
                        $command[self::COMMAND] = [$command];
835
                        // no break
836 43
                    case 'TJ':
837
                        // Check the marked content stack for flags
838 43
                        $actual_text = false;
839 43
                        $reverse_text = false;
840 43
                        foreach ($marked_stack as $marked) {
841 5
                            if (isset($marked['ActualText'])) {
842 4
                                $actual_text = true;
843
                            }
844 5
                            if (isset($marked['ReversedChars'])) {
845 1
                                $reverse_text = true;
846
                            }
847
                        }
848
849
                        // Account for text position ONLY just before we write text
850 43
                        if (false === $actual_text && \is_array($last_written_position)) {
851
                            // If $last_written_position is an array, that
852
                            // means we have stored text position coordinates
853
                            // for placing an ActualText
854 4
                            $currentX = $last_written_position[0];
855 4
                            $currentY = $last_written_position[1];
856 4
                            $last_written_position = false;
857
                        } else {
858 43
                            $currentX = $current_position_cm['x'] + $current_position_tm['x'] + $current_position_td['x'];
859 43
                            $currentY = $current_position_cm['y'] + $current_position_tm['y'] + $current_position_td['y'];
860
                        }
861 43
                        $whiteSpace = '';
862
863 43
                        $factorX = -$current_font_size * $current_position_tm['a'] - $current_font_size * $current_position_tm['i'];
864 43
                        $factorY = $current_font_size * $current_position_tm['b'] + $current_font_size * $current_position_tm['j'];
865
866 43
                        if (true === $this->addPositionWhitespace && false !== $current_position['x']) {
867 31
                            $curY = $currentY - $current_position['y'];
868 31
                            if (abs($curY) >= abs($factorY) / 4) {
869 30
                                $whiteSpace = "\n";
870
                            } else {
871 30
                                if (true === $reverse_text) {
872 1
                                    $curX = $current_position['x'] - $currentX;
873
                                } else {
874 30
                                    $curX = $currentX - $current_position['x'];
875
                                }
876
877
                                // In abs($factorX * 7) below, the 7 is chosen arbitrarily
878
                                // as the number of apparent "spaces" in a document we
879
                                // would need before considering them a "tab". In the
880
                                // future, we might offer this value to users as a config
881
                                // option.
882 30
                                if ($curX >= abs($factorX * 7)) {
883 20
                                    $whiteSpace = "\t";
884 29
                                } elseif ($curX >= abs($factorX * 2)) {
885 17
                                    $whiteSpace = ' ';
886
                                }
887
                            }
888
                        }
889
890 43
                        $newtext = $this->getTJUsingFontFallback(
891 43
                            $current_font,
892 43
                            $command[self::COMMAND],
893 43
                            $page,
894 43
                            $factorX
895 43
                        );
896
897
                        // If there is no ActualText pending then write
898 43
                        if (false === $actual_text) {
899 43
                            $newtext = str_replace(["\r", "\n"], '', $newtext);
900 43
                            if (false !== $reverse_text) {
901
                                // If we are in ReversedChars mode, add the whitespace last
902 1
                                $text[] = preg_replace('/  $/', ' ', $newtext.$whiteSpace);
903
                            } else {
904
                                // Otherwise add the whitespace first
905 43
                                if (' ' === $whiteSpace && isset($text[\count($text) - 1])) {
906 16
                                    $text[\count($text) - 1] = preg_replace('/ $/', '', $text[\count($text) - 1]);
907
                                }
908 43
                                $text[] = preg_replace('/^[ \t]{2}/', ' ', $whiteSpace.$newtext);
909
                            }
910
911
                            // Record the position of this inserted text for comparison
912
                            // with the next text block.
913
                            // Provide a 'fudge' factor guess on how wide this text block
914
                            // is based on the number of characters. This helps limit the
915
                            // number of tabs inserted, but isn't perfect.
916 43
                            $factor = $factorX / 2;
917 43
                            $current_position = [
918 43
                                'x' => $currentX - mb_strlen($newtext) * $factor,
919 43
                                'y' => $currentY,
920 43
                            ];
921 4
                        } elseif (false === $last_written_position) {
922
                            // If there is an ActualText in the pipeline
923
                            // store the position this undisplayed text
924
                            // *would* have been written to, so the
925
                            // ActualText is displayed in the right spot
926 4
                            $last_written_position = [$currentX, $currentY];
927 4
                            $current_position['x'] = $currentX;
928
                        }
929 43
                        break;
930
931
                        // move to start of next line
932 43
                    case 'T*':
933 13
                        $current_position['x'] = 0;
934 13
                        $current_position_td['x'] = 0;
935 13
                        $current_position_td['y'] += $current_text_leading;
936 13
                        break;
937
938
                        // set character spacing
939 43
                    case 'Tc':
940 13
                        break;
941
942
                        // move text current point and set leading
943 43
                    case 'Td':
944 43
                    case 'TD':
945
                        // move text current point
946 32
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
947 32
                        $y = (float) array_pop($args);
948 32
                        $x = (float) array_pop($args);
949
950 32
                        if ('TD' == $command[self::OPERATOR]) {
951 7
                            $current_text_leading = -$y * $current_position_tm['b'] - $y * $current_position_tm['j'];
952
                        }
953
954 32
                        $current_position_td = [
955 32
                            'x' => $current_position_td['x'] + $x * $current_position_tm['a'] + $x * $current_position_tm['i'],
956 32
                            'y' => $current_position_td['y'] + $y * $current_position_tm['b'] + $y * $current_position_tm['j'],
957 32
                        ];
958 32
                        break;
959
960 43
                    case 'Tf':
961 43
                        $args = preg_split('/\s/s', $command[self::COMMAND]);
962 43
                        $size = (float) array_pop($args);
963 43
                        $id = trim(array_pop($args), '/');
964 43
                        if (null !== $page) {
965 43
                            $new_font = $page->getFont($id);
966
                            // If an invalid font ID is given, do not update the font.
967
                            // This should theoretically never happen, as the PDF spec states for the Tf operator:
968
                            // "The specified font value shall match a resource name in the Font entry of the default resource dictionary"
969
                            // (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 435)
970
                            // But we want to make sure that malformed PDFs do not simply crash.
971 43
                            if (null !== $new_font) {
972 39
                                $current_font = $new_font;
973 39
                                $current_font_size = $size;
974
                            }
975
                        }
976 43
                        break;
977
978
                        // set leading
979 37
                    case 'TL':
980 6
                        $y = (float) $command[self::COMMAND];
981 6
                        $current_text_leading = -$y * $current_position_tm['b'] + -$y * $current_position_tm['j'];
982 6
                        break;
983
984
                        // set text position matrix
985 37
                    case 'Tm':
986 34
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
987 34
                        $current_position_tm = [
988 34
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
989 34
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
990 34
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
991 34
                        ];
992 34
                        break;
993
994
                        // set text rendering mode
995 22
                    case 'Ts':
996
                        break;
997
998
                        // set super/subscripting text rise
999 22
                    case 'Ts':
1000
                        break;
1001
1002
                        // set word spacing
1003 22
                    case 'Tw':
1004 9
                        break;
1005
1006
                        // set horizontal scaling
1007 22
                    case 'Tz':
1008
                        break;
1009
1010
                    default:
1011
                }
1012
            }
1013
        }
1014
1015 46
        $result = array_merge($result, $text);
1016
1017 46
        return $result;
1018
    }
1019
1020
    /**
1021
     * getCommandsText() expects the content of $text_part to be an
1022
     * already formatted, single-line command from a document stream.
1023
     * The companion function getSectionsText() returns a document
1024
     * stream as an array of single commands for just this purpose.
1025
     * Because of this, the argument $offset is no longer used, and
1026
     * may be removed in a future PdfParser release.
1027
     *
1028
     * A better name for this function would be getCommandText()
1029
     * since it now always works on just one command.
1030
     */
1031 50
    public function getCommandsText(string $text_part, int &$offset = 0): array
0 ignored issues
show
Unused Code introduced by
The parameter $offset is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

1031
    public function getCommandsText(string $text_part, /** @scrutinizer ignore-unused */ int &$offset = 0): array

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
1032
    {
1033 50
        $commands = $matches = [];
1034
1035 50
        preg_match('/^(([\/\[\(<])?.*)(?<!\w)([a-z01\'\"*]+)$/i', $text_part, $matches);
1036
1037
        // If no valid command is detected, return an empty array
1038 50
        if (!isset($matches[1]) || !isset($matches[2]) || !isset($matches[3])) {
1039 1
            return [];
1040
        }
1041
1042 50
        $type = $matches[2];
1043 50
        $operator = $matches[3];
1044 50
        $command = trim($matches[1]);
1045
1046 50
        if ('TJ' == $operator) {
1047 40
            $subcommand = [];
1048 40
            $command = trim($command, '[]');
1049
            do {
1050 40
                $oldCommand = $command;
1051
1052
                // Search for parentheses string () format
1053 40
                if (preg_match('/^ *\((.*?)(?<![^\\\\]\\\\)\) *(-?[\d.]+)?/', $command, $tjmatch)) {
1054 34
                    $subcommand[] = [
1055 34
                        self::TYPE => '(',
1056 34
                        self::OPERATOR => 'TJ',
1057 34
                        self::COMMAND => $tjmatch[1],
1058 34
                    ];
1059 34
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
1060 28
                        $subcommand[] = [
1061 28
                            self::TYPE => 'n',
1062 28
                            self::OPERATOR => '',
1063 28
                            self::COMMAND => $tjmatch[2],
1064 28
                        ];
1065
                    }
1066 34
                    $command = substr($command, \strlen($tjmatch[0]));
1067
                }
1068
1069
                // Search for hexadecimal <> format
1070 40
                if (preg_match('/^ *<([0-9a-f\s]*)> *(-?[\d.]+)?/i', $command, $tjmatch)) {
1071 19
                    $tjmatch[1] = preg_replace('/\s/', '', $tjmatch[1]);
1072 19
                    $subcommand[] = [
1073 19
                        self::TYPE => '<',
1074 19
                        self::OPERATOR => 'TJ',
1075 19
                        self::COMMAND => $tjmatch[1],
1076 19
                    ];
1077 19
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
1078 18
                        $subcommand[] = [
1079 18
                            self::TYPE => 'n',
1080 18
                            self::OPERATOR => '',
1081 18
                            self::COMMAND => $tjmatch[2],
1082 18
                        ];
1083
                    }
1084 19
                    $command = substr($command, \strlen($tjmatch[0]));
1085
                }
1086 40
            } while ($command != $oldCommand);
1087
1088 40
            $command = $subcommand;
1089 50
        } elseif ('Tj' == $operator || "'" == $operator || '"' == $operator) {
1090
            // Depending on the string type, trim the data of the
1091
            // appropriate delimiters
1092 39
            if ('(' == $type) {
1093
                // Don't use trim() here since a () string may end with
1094
                // a balanced or escaped right parentheses, and trim()
1095
                // will delete both. Both strings below are valid:
1096
                //   eg. (String())
1097
                //   eg. (String\))
1098 33
                $command = preg_replace('/^\(|\)$/', '', $command);
1099 15
            } elseif ('<' == $type) {
1100 39
                $command = trim($command, '<>');
1101
            }
1102 50
        } elseif ('/' == $type) {
1103 49
            $command = substr($command, 1);
1104
        }
1105
1106 50
        $commands[] = [
1107 50
            self::TYPE => $type,
1108 50
            self::OPERATOR => $operator,
1109 50
            self::COMMAND => $command,
1110 50
        ];
1111
1112 50
        return $commands;
1113
    }
1114
1115 65
    public static function factory(
1116
        Document $document,
1117
        Header $header,
1118
        ?string $content,
1119
        ?Config $config = null
1120
    ): self {
1121 65
        switch ($header->get('Type')->getContent()) {
1122 65
            case 'XObject':
1123 19
                switch ($header->get('Subtype')->getContent()) {
1124 19
                    case 'Image':
1125 12
                        return new Image($document, $header, $config->getRetainImageContent() ? $content : null, $config);
0 ignored issues
show
Bug introduced by
The method getRetainImageContent() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1125
                        return new Image($document, $header, $config->/** @scrutinizer ignore-call */ getRetainImageContent() ? $content : null, $config);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
1126
1127 8
                    case 'Form':
1128 8
                        return new Form($document, $header, $content, $config);
1129
                }
1130
1131
                return new self($document, $header, $content, $config);
1132
1133 65
            case 'Pages':
1134 64
                return new Pages($document, $header, $content, $config);
1135
1136 65
            case 'Page':
1137 64
                return new Page($document, $header, $content, $config);
1138
1139 65
            case 'Encoding':
1140 12
                return new Encoding($document, $header, $content, $config);
1141
1142 65
            case 'Font':
1143 64
                $subtype = $header->get('Subtype')->getContent();
1144 64
                $classname = '\Smalot\PdfParser\Font\Font'.$subtype;
1145
1146 64
                if (class_exists($classname)) {
1147 64
                    return new $classname($document, $header, $content, $config);
1148
                }
1149
1150
                return new Font($document, $header, $content, $config);
1151
1152
            default:
1153 65
                return new self($document, $header, $content, $config);
1154
        }
1155
    }
1156
1157
    /**
1158
     * Returns unique id identifying the object.
1159
     */
1160 46
    protected function getUniqueId(): string
1161
    {
1162 46
        return spl_object_hash($this);
1163
    }
1164
}
1165