Passed
Pull Request — master (#693)
by
unknown
02:56
created

PDFObject::getSectionsText()   C

Complexity

Conditions 14
Paths 13

Size

Total Lines 69
Code Lines 34

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 35
CRAP Score 14.0042

Importance

Changes 0
Metric Value
cc 14
eloc 34
c 0
b 0
f 0
nc 13
nop 1
dl 0
loc 69
ccs 35
cts 36
cp 0.9722
crap 14.0042
rs 6.2666

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\XObject\Form;
36
use Smalot\PdfParser\XObject\Image;
37
38
/**
39
 * Class PDFObject
40
 */
41
class PDFObject
42
{
43
    public const TYPE = 't';
44
45
    public const OPERATOR = 'o';
46
47
    public const COMMAND = 'c';
48
49
    /**
50
     * The recursion stack.
51
     *
52
     * @var array
53
     */
54
    public static $recursionStack = [];
55
56
    /**
57
     * @var Document|null
58
     */
59
    protected $document;
60
61
    /**
62
     * @var Header
63
     */
64
    protected $header;
65
66
    /**
67
     * @var string
68
     */
69
    protected $content;
70
71
    /**
72
     * @var Config|null
73
     */
74
    protected $config;
75
76
    /**
77
     * @var bool
78
     */
79
    protected $addPositionWhitespace = false;
80
81 94
    public function __construct(
82
        Document $document,
83
        ?Header $header = null,
84
        ?string $content = null,
85
        ?Config $config = null
86
    ) {
87 94
        $this->document = $document;
88 94
        $this->header = $header ?? new Header();
89 94
        $this->content = $content;
90 94
        $this->config = $config;
91
    }
92
93 72
    public function init()
94
    {
95 72
    }
96
97 4
    public function getDocument(): Document
98
    {
99 4
        return $this->document;
0 ignored issues
show
Bug Best Practice introduced by
The expression return $this->document could return the type null which is incompatible with the type-hinted return Smalot\PdfParser\Document. Consider adding an additional type-check to rule them out.
Loading history...
100
    }
101
102 72
    public function getHeader(): ?Header
103
    {
104 72
        return $this->header;
105
    }
106
107 4
    public function getConfig(): ?Config
108
    {
109 4
        return $this->config;
110
    }
111
112
    /**
113
     * @return Element|PDFObject|Header
114
     */
115 74
    public function get(string $name)
116
    {
117 74
        return $this->header->get($name);
118
    }
119
120 73
    public function has(string $name): bool
121
    {
122 73
        return $this->header->has($name);
123
    }
124
125 4
    public function getDetails(bool $deep = true): array
126
    {
127 4
        return $this->header->getDetails($deep);
128
    }
129
130 59
    public function getContent(): ?string
131
    {
132 59
        return $this->content;
133
    }
134
135
    /**
136
     * Creates a duplicate of the document stream with
137
     * strings and other items replaced by $char. Formerly
138
     * getSectionsText() used this output to more easily gather offset
139
     * values to extract text from the *actual* document stream.
140
     *
141
     * @deprecated function is no longer used and will be removed in a future release
142
     *
143
     * @internal
144
     */
145 1
    public function cleanContent(string $content, string $char = 'X')
146
    {
147 1
        $char = $char[0];
148 1
        $content = str_replace(['\\\\', '\\)', '\\('], $char.$char, $content);
149
150
        // Remove image bloc with binary content
151 1
        preg_match_all('/\s(BI\s.*?(\sID\s).*?(\sEI))\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
152 1
        foreach ($matches[0] as $part) {
153
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
154
        }
155
156
        // Clean content in square brackets [.....]
157 1
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);
0 ignored issues
show
Unused Code introduced by
The call to preg_match_all() has too many arguments starting with PREG_OFFSET_CAPTURE. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

157
        /** @scrutinizer ignore-call */ 
158
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
158 1
        foreach ($matches[1] as $part) {
159 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
160
        }
161
162
        // Clean content in round brackets (.....)
163 1
        preg_match_all('/\((.*?)\)/s', $content, $matches, \PREG_OFFSET_CAPTURE);
164 1
        foreach ($matches[1] as $part) {
165 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
166
        }
167
168
        // Clean structure
169 1
        if ($parts = preg_split('/(<|>)/s', $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
0 ignored issues
show
Bug introduced by
It seems like $content can also be of type array; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

169
        if ($parts = preg_split('/(<|>)/s', /** @scrutinizer ignore-type */ $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
Loading history...
170 1
            $content = '';
171 1
            $level = 0;
172 1
            foreach ($parts as $part) {
173 1
                if ('<' == $part) {
174 1
                    ++$level;
175
                }
176
177 1
                $content .= (0 == $level ? $part : str_repeat($char, \strlen($part)));
178
179 1
                if ('>' == $part) {
180 1
                    --$level;
181
                }
182
            }
183
        }
184
185
        // Clean BDC and EMC markup
186 1
        preg_match_all(
187 1
            '/(\/[A-Za-z0-9\_]*\s*'.preg_quote($char).'*BDC)/s',
188 1
            $content,
189 1
            $matches,
190 1
            \PREG_OFFSET_CAPTURE
191 1
        );
192 1
        foreach ($matches[1] as $part) {
193 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
194
        }
195
196 1
        preg_match_all('/\s(EMC)\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
197 1
        foreach ($matches[1] as $part) {
198 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
199
        }
200
201 1
        return $content;
202
    }
203
204
    /**
205
     * Takes a string of PDF document stream text and formats
206
     * it into a multi-line string with one PDF command on each line,
207
     * separated by \r\n. If the given string is null, or binary data
208
     * is detected instead of a document stream then return an empty
209
     * string.
210
     */
211 53
    private function formatContent(?string $content): string
212
    {
213 53
        if (null === $content) {
214 3
            return '';
215
        }
216
217
        // Outside of (String) and inline image content in PDF document
218
        // streams, all text should conform to UTF-8. Test for binary
219
        // content by deleting everything after the first open-
220
        // parenthesis ( which indicates the beginning of a string, or
221
        // the first ID command which indicates the beginning of binary
222
        // inline image content. Then test what remains for valid
223
        // UTF-8. If it's not UTF-8, return an empty string as this
224
        // $content is most likely binary. Unfortunately, using
225
        // mb_check_encoding(..., 'UTF-8') is not strict enough, so the
226
        // following regexp, adapted from the W3, is used. See:
227
        // https://www.w3.org/International/questions/qa-forms-utf-8.en
228
        // We use preg_replace() instead of preg_match() to avoid "JIT
229
        // stack limit exhausted" errors on larger files.
230 50
        $utf8Filter = preg_replace('/(
231
            [\x09\x0A\x0D\x20-\x7E] |            # ASCII
232
            [\xC2-\xDF][\x80-\xBF] |             # non-overlong 2-byte
233
            \xE0[\xA0-\xBF][\x80-\xBF] |         # excluding overlongs
234
            [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} |  # straight 3-byte
235
            \xED[\x80-\x9F][\x80-\xBF] |         # excluding surrogates
236
            \xF0[\x90-\xBF][\x80-\xBF]{2} |      # planes 1-3
237
            [\xF1-\xF3][\x80-\xBF]{3} |          # planes 4-15
238
            \xF4[\x80-\x8F][\x80-\xBF]{2}        # plane 16
239 50
        )/xs', '', preg_replace('/(\(|ID\s).*$/s', '', $content));
240
241 50
        if ('' !== $utf8Filter) {
242 1
            return '';
243
        }
244
245
        // Find all inline image content and replace them so they aren't
246
        // affected by the next steps
247 50
        $pdfInlineImages = [];
248 50
        $offsetBI = 0;
249 50
        while (preg_match('/\sBI\s(\/.+?)\sID\s(.+?)\sEI(?=\s|$)/s', $content, $text, \PREG_OFFSET_CAPTURE, $offsetBI)) {
250
            // Attempt to detemine if this instance of the 'BI' command
251
            // actually occured within a (string) using the following
252
            // steps:
253
254
            // Remove any escaped parentheses from the alleged image
255
            // characteristics data
256 1
            $para = str_replace(['\\(', '\\)'], '', $text[1][0]);
257
258
            // Remove all correctly ordered and balanced parentheses
259
            // from (strings)
260
            do {
261 1
                $paraTest = $para;
262 1
                $para = preg_replace('/\(([^)]*)\)/', '$1', $paraTest);
263 1
            } while ($para != $paraTest);
264
265 1
            $paraOpen = strpos($para, '(');
266 1
            $paraClose = strpos($para, ')');
267
268
            // If the remaining text contains a close parenthesis ')'
269
            // AND it occurs before any open parenthesis, then we are
270
            // almost certain to be inside a (string)
271 1
            if (0 < $paraClose && (false === $paraOpen || $paraClose < $paraOpen)) {
272
                // Bump the search offset forward and match again
273 1
                $offsetBI = (int) $text[1][1];
274 1
                continue;
275
            }
276
277
            // Double check that this is actually inline image data by
278
            // parsing the alleged image characteristics as a dictionary
279 1
            $dict = $this->parseDictionary('<<'.$text[1][0].'>>');
280
281
            // Check if an image Width and Height are set in the dict
282 1
            if ((isset($dict['W']) || isset($dict['Width']))
283 1
                && (isset($dict['H']) || isset($dict['Height']))) {
284 1
                $id = uniqid('IMAGE_', true);
285 1
                $pdfInlineImages[$id] = [
286 1
                    preg_replace(['/\r\n/', '/\r/', '/\n/'], ' ', $text[1][0]),
287 1
                    preg_replace(['/\r\n/', '/\r/', '/\n/'], '', $text[2][0]),
288 1
                ];
289 1
                $content = preg_replace(
290 1
                    '/'.preg_quote($text[0][0], '/').'/',
291 1
                    '^^^'.$id.'^^^',
292 1
                    $content,
293 1
                    1
294 1
                );
295
            } else {
296
                // If there was no valid dictionary, or a height and width
297
                // weren't specified, then we don't know what this is, so
298
                // just leave it alone; bump the search offset forward and
299
                // match again
300
                $offsetBI = (int) $text[1][1];
301
            }
302
        }
303
304
        // Find all strings () and replace them so they aren't affected
305
        // by the next steps
306 50
        $pdfstrings = [];
307 50
        $attempt = '(';
308 50
        while (preg_match('/'.preg_quote($attempt, '/').'.*?(?<![^\\\\]\\\\)\)/s', $content, $text)) {
309
            // PDF strings can contain unescaped parentheses as long as
310
            // they're balanced, so check for balanced parentheses
311 41
            $left = preg_match_all('/(?<![^\\\\]\\\\)\(/', $text[0]);
312 41
            $right = preg_match_all('/(?<![^\\\\]\\\\)\)/', $text[0]);
313
314 41
            if ($left == $right) {
315
                // Replace the string with a unique placeholder
316 41
                $id = uniqid('STRING_', true);
317 41
                $pdfstrings[$id] = $text[0];
318 41
                $content = preg_replace(
319 41
                    '/'.preg_quote($text[0], '/').'/',
320 41
                    '@@@'.$id.'@@@',
321 41
                    $content,
322 41
                    1
323 41
                );
324
325
                // Reset to search for the next string
326 41
                $attempt = '(';
327
            } else {
328
                // We had unbalanced parentheses, so use the current
329
                // match as a base to find a longer string
330 1
                $attempt = $text[0];
331
            }
332
        }
333
334
        // Remove all carriage returns and line-feeds from the document stream
335 50
        $content = str_replace(["\r", "\n"], ' ', trim($content));
336
337
        // Find all dictionary << >> commands and replace them so they
338
        // aren't affected by the next steps
339 50
        $dictstore = [];
340 50
        while (preg_match('/(<<.*?>> *)(BDC|BMC|DP|MP)/s', $content, $dicttext)) {
341 18
            $dictid = uniqid('DICT_', true);
342 18
            $dictstore[$dictid] = $dicttext[1];
343 18
            $content = preg_replace(
344 18
                '/'.preg_quote($dicttext[0], '/').'/',
345 18
                ' ###'.$dictid.'###'.$dicttext[2],
346 18
                $content,
347 18
                1
348 18
            );
349
        }
350
351
        // Normalize white-space in the document stream
352 50
        $content = preg_replace('/\s{2,}/', ' ', $content);
353
354
        // Find all valid PDF operators and add \r\n after each; this
355
        // ensures there is just one command on every line
356
        // Source: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A
357
        // Source: https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A
358
        // Note: PDF Reference 1.7 lists 'I' and 'rI' as valid commands, while
359
        //       PDF 32000:2008 lists them as 'i' and 'ri' respectively. Both versions
360
        //       appear here in the list for completeness.
361 50
        $operators = [
362 50
            'b*', 'b', 'BDC', 'BMC', 'B*', 'BI', 'BT', 'BX', 'B', 'cm', 'cs', 'c', 'CS',
363 50
            'd0', 'd1', 'd', 'Do', 'DP', 'EMC', 'EI', 'ET', 'EX', 'f*', 'f', 'F', 'gs',
364 50
            'g', 'G',  'h', 'i', 'ID', 'I', 'j', 'J', 'k', 'K', 'l', 'm', 'MP', 'M', 'n',
365 50
            'q', 'Q', 're', 'rg', 'ri', 'rI', 'RG', 'scn', 'sc', 'sh', 's', 'SCN', 'SC',
366 50
            'S', 'T*', 'Tc', 'Td', 'TD', 'Tf', 'TJ', 'Tj', 'TL', 'Tm', 'Tr', 'Ts', 'Tw',
367 50
            'Tz', 'v', 'w', 'W*', 'W', 'y', '\'', '"',
368 50
        ];
369 50
        foreach ($operators as $operator) {
370 50
            $content = preg_replace(
371 50
                '/(?<!\w|\/)'.preg_quote($operator, '/').'(?![\w10\*])/',
372 50
                $operator."\r\n",
373 50
                $content
374 50
            );
375
        }
376
377
        // Restore the original content of the dictionary << >> commands
378 50
        $dictstore = array_reverse($dictstore, true);
379 50
        foreach ($dictstore as $id => $dict) {
380 18
            $content = str_replace('###'.$id.'###', $dict, $content);
381
        }
382
383
        // Restore the original string content
384 50
        $pdfstrings = array_reverse($pdfstrings, true);
385 50
        foreach ($pdfstrings as $id => $text) {
386
            // Strings may contain escaped newlines, or literal newlines
387
            // and we should clean these up before replacing the string
388
            // back into the content stream; this ensures no strings are
389
            // split between two lines (every command must be on one line)
390 41
            $text = str_replace(
391 41
                ["\\\r\n", "\\\r", "\\\n", "\r", "\n"],
392 41
                ['', '', '', '\r', '\n'],
393 41
                $text
394 41
            );
395
396 41
            $content = str_replace('@@@'.$id.'@@@', $text, $content);
397
        }
398
399
        // Restore the original content of any inline images
400 50
        $pdfInlineImages = array_reverse($pdfInlineImages, true);
401 50
        foreach ($pdfInlineImages as $id => $image) {
402 1
            $content = str_replace(
403 1
                '^^^'.$id.'^^^',
404 1
                "\r\nBI\r\n".$image[0]." ID\r\n".$image[1]." EI\r\n",
405 1
                $content
406 1
            );
407
        }
408
409 50
        $content = trim(preg_replace(['/(\r\n){2,}/', '/\r\n +/'], "\r\n", $content));
410
411 50
        return $content;
412
    }
413
414
    /**
415
     * getSectionsText() now takes an entire, unformatted
416
     * document stream as a string, cleans it, then filters out
417
     * commands that aren't needed for text positioning/extraction. It
418
     * returns an array of unprocessed PDF commands, one command per
419
     * element.
420
     *
421
     * @internal
422
     */
423 52
    public function getSectionsText(?string $content): array
424
    {
425 52
        $sections = [];
426
427
        // A cleaned stream has one command on every line, so split the
428
        // cleaned stream content on \r\n into an array
429 52
        $textCleaned = preg_split(
430 52
            '/(\r\n|\n|\r)/',
431 52
            $this->formatContent($content),
432 52
            -1,
433 52
            \PREG_SPLIT_NO_EMPTY
434 52
        );
435
436 52
        $inTextBlock = false;
437 52
        foreach ($textCleaned as $line) {
438 49
            $line = trim($line);
439
440
            // Skip empty lines
441 49
            if ('' === $line) {
442
                continue;
443
            }
444
445
            // If a 'BT' is encountered, set the $inTextBlock flag
446 49
            if (preg_match('/BT$/', $line)) {
447 49
                $inTextBlock = true;
448 49
                $sections[] = $line;
449
450
            // If an 'ET' is encountered, unset the $inTextBlock flag
451 49
            } elseif ('ET' == $line) {
452 49
                $inTextBlock = false;
453 49
                $sections[] = $line;
454 49
            } elseif ($inTextBlock) {
455
                // If we are inside a BT ... ET text block, save all lines
456 49
                $sections[] = trim($line);
457
            } else {
458
                // Otherwise, if we are outside of a text block, only
459
                // save specific, necessary lines. Care should be taken
460
                // to ensure a command being checked for *only* matches
461
                // that command. For instance, a simple search for 'c'
462
                // may also match the 'sc' command. See the command
463
                // list in the formatContent() method above.
464
                // Add more commands to save here as you find them in
465
                // weird PDFs!
466 48
                if ('q' == $line[-1] || 'Q' == $line[-1]) {
467
                    // Save and restore graphics state commands
468 42
                    $sections[] = $line;
469 48
                } elseif (preg_match('/(?<!\w)B[DM]C$/', $line)) {
470
                    // Begin marked content sequence
471 16
                    $sections[] = $line;
472 48
                } elseif (preg_match('/(?<!\w)[DM]P$/', $line)) {
473
                    // Marked content point
474 1
                    $sections[] = $line;
475 47
                } elseif (preg_match('/(?<!\w)EMC$/', $line)) {
476
                    // End marked content sequence
477 15
                    $sections[] = $line;
478 45
                } elseif (preg_match('/(?<!\w)cm$/', $line)) {
479
                    // Graphics position change commands
480 33
                    $sections[] = $line;
481 45
                } elseif (preg_match('/(?<!\w)Tf$/', $line)) {
482
                    // Font change commands
483 3
                    $sections[] = $line;
484 45
                } elseif (preg_match('/(?<!\w)Do$/', $line)) {
485
                    // Invoke named XObject command
486 15
                    $sections[] = $line;
487
                }
488
            }
489
        }
490
491 52
        return $sections;
492
    }
493
494 46
    private function getDefaultFont(?Page $page = null): Font
495
    {
496 46
        $fonts = [];
497 46
        if (null !== $page) {
498 44
            $fonts = $page->getFonts();
499
        }
500
501 46
        $firstFont = $this->document->getFirstFont();
0 ignored issues
show
Bug introduced by
The method getFirstFont() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

501
        /** @scrutinizer ignore-call */ 
502
        $firstFont = $this->document->getFirstFont();

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
502 46
        if (null !== $firstFont) {
503 43
            $fonts[] = $firstFont;
504
        }
505
506 46
        if (\count($fonts) > 0) {
507 43
            return reset($fonts);
508
        }
509
510 3
        return new Font($this->document, null, null, $this->config);
0 ignored issues
show
Bug introduced by
It seems like $this->document can also be of type null; however, parameter $document of Smalot\PdfParser\Font::__construct() does only seem to accept Smalot\PdfParser\Document, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

510
        return new Font(/** @scrutinizer ignore-type */ $this->document, null, null, $this->config);
Loading history...
511
    }
512
513
    /**
514
     * Decode a '[]TJ' command and attempt to use alternate
515
     * fonts if the current font results in output that contains
516
     * Unicode control characters.
517
     *
518
     * @internal
519
     *
520
     * @param array<int,array<string,string|bool>> $command
521
     */
522 43
    private function getTJUsingFontFallback(Font $font, array $command, ?Page $page = null, float $fontFactor = 4): string
523
    {
524 43
        $orig_text = $font->decodeText($command, $fontFactor);
525 43
        $text = $orig_text;
526
527
        // If we make this a Config option, we can add a check if it's
528
        // enabled here.
529 43
        if (null !== $page) {
530 43
            $font_ids = array_keys($page->getFonts());
531
532
            // If the decoded text contains UTF-8 control characters
533
            // then the font page being used is probably the wrong one.
534
            // Loop through the rest of the fonts to see if we can get
535
            // a good decode. Allow x09 to x0d which are whitespace.
536 43
            while (preg_match('/[\x00-\x08\x0e-\x1f\x7f]/u', $text) || false !== strpos(bin2hex($text), '00')) {
537
                // If we're out of font IDs, then give up and use the
538
                // original string
539 3
                if (0 == \count($font_ids)) {
540 3
                    return $orig_text;
541
                }
542
543
                // Try the next font ID
544 3
                $font = $page->getFont(array_shift($font_ids));
545 3
                $text = $font->decodeText($command, $fontFactor);
546
            }
547
        }
548
549 43
        return $text;
550
    }
551
552
    /**
553
     * Expects a string that is a full PDF dictionary object,
554
     * including the outer enclosing << >> angle brackets
555
     *
556
     * @internal
557
     *
558
     * @throws \Exception
559
     */
560 18
    public function parseDictionary(string $dictionary): array
561
    {
562
        // Normalize whitespace
563 18
        $dictionary = preg_replace(['/\r/', '/\n/', '/\s{2,}/'], ' ', trim($dictionary));
564
565 18
        if ('<<' != substr($dictionary, 0, 2)) {
566
            throw new \Exception('Not a valid dictionary object.');
567
        }
568
569 18
        $parsed = [];
570 18
        $stack = [];
571 18
        $currentName = '';
572 18
        $arrayTypeNumeric = false;
573
574
        // Remove outer layer of dictionary, and split on tokens
575 18
        $split = preg_split(
576 18
            '/(<<|>>|\[|\]|\/[^\s\/\[\]\(\)<>]*)/',
577 18
            trim(preg_replace('/^<<|>>$/', '', $dictionary)),
578 18
            -1,
579 18
            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
580 18
        );
581
582 18
        foreach ($split as $token) {
583 18
            $token = trim($token);
584
            switch ($token) {
585 18
                case '':
586 8
                    break;
587
588
                    // Open numeric array
589 18
                case '[':
590 8
                    $parsed[$currentName] = [];
591 8
                    $arrayTypeNumeric = true;
592
593
                    // Move up one level in the stack
594 8
                    $stack[\count($stack)] = &$parsed;
595 8
                    $parsed = &$parsed[$currentName];
596 8
                    $currentName = '';
597 8
                    break;
598
599
                    // Open hashed array
600 18
                case '<<':
601 1
                    $parsed[$currentName] = [];
602 1
                    $arrayTypeNumeric = false;
603
604
                    // Move up one level in the stack
605 1
                    $stack[\count($stack)] = &$parsed;
606 1
                    $parsed = &$parsed[$currentName];
607 1
                    $currentName = '';
608 1
                    break;
609
610
                    // Close numeric array
611 18
                case ']':
612
                    // Revert string type arrays back to a single element
613 8
                    if (\is_array($parsed) && 1 == \count($parsed)
614 8
                        && isset($parsed[0]) && \is_string($parsed[0])
615 8
                        && '' !== $parsed[0] && '/' != $parsed[0][0]) {
616 6
                        $parsed = '['.$parsed[0].']';
617
                    }
618
                    // Close hashed array
619
                    // no break
620 18
                case '>>':
621 8
                    $arrayTypeNumeric = false;
622
623
                    // Move down one level in the stack
624 8
                    $parsed = &$stack[\count($stack) - 1];
625 8
                    unset($stack[\count($stack) - 1]);
626 8
                    break;
627
628
                default:
629
                    // If value begins with a slash, then this is a name
630
                    // Add it to the appropriate array
631 18
                    if ('/' == substr($token, 0, 1)) {
632 18
                        $currentName = substr($token, 1);
633 18
                        if (true == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
634 7
                            $parsed[] = $currentName;
635 18
                            $currentName = '';
636
                        }
637 18
                    } elseif ('' != $currentName) {
638 18
                        if (false == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
639 18
                            $parsed[$currentName] = $token;
640
                        }
641 18
                        $currentName = '';
642 5
                    } elseif ('' == $currentName) {
643 5
                        $parsed[] = $token;
644
                    }
645
            }
646
        }
647
648 18
        return $parsed;
649
    }
650
651
    /**
652
     * Returns the text content of a PDF as a string. Attempts to add
653
     * whitespace for spacing and line-breaks where appropriate.
654
     *
655
     * getText() leverages getTextArray() to get the content
656
     * of the document, setting the addPositionWhitespace flag to true
657
     * so whitespace is inserted in a logical way for reading by
658
     * humans.
659
     */
660 37
    public function getText(?Page $page = null): string
661
    {
662 37
        $this->addPositionWhitespace = true;
663 37
        $result = $this->getTextArray($page);
664 37
        $this->addPositionWhitespace = false;
665
666 37
        return implode('', $result).' ';
667
    }
668
669
    /**
670
     * Returns the text content of a PDF as an array of strings. No
671
     * extra whitespace is inserted besides what is actually encoded in
672
     * the PDF text.
673
     *
674
     * @throws \Exception
675
     */
676 46
    public function getTextArray(?Page $page = null): array
677
    {
678 46
        $result = [];
679 46
        $text = [];
680
681 46
        $marked_stack = [];
682 46
        $last_written_position = false;
683
684 46
        $sections = $this->getSectionsText($this->content);
685 46
        $current_font = $this->getDefaultFont($page);
686 46
        $current_font_size = 1;
687 46
        $current_text_leading = 0;
688
689 46
        $current_position = ['x' => false, 'y' => false];
690 46
        $current_position_tm = [
691 46
            'a' => 1, 'b' => 0, 'c' => 0,
692 46
            'i' => 0, 'j' => 1, 'k' => 0,
693 46
            'x' => 0, 'y' => 0, 'z' => 1,
694 46
        ];
695 46
        $current_position_td = ['x' => 0, 'y' => 0];
696 46
        $current_position_cm = [
697 46
            'a' => 1, 'b' => 0, 'c' => 0,
698 46
            'i' => 0, 'j' => 1, 'k' => 0,
699 46
            'x' => 0, 'y' => 0, 'z' => 1,
700 46
        ];
701
702 46
        $clipped_font = [];
703 46
        $clipped_position_cm = [];
704
705 46
        self::$recursionStack[] = $this->getUniqueId();
706
707 46
        foreach ($sections as $section) {
708 43
            $commands = $this->getCommandsText($section);
709 43
            foreach ($commands as $command) {
710 43
                switch ($command[self::OPERATOR]) {
711
                    // Begin text object
712 43
                    case 'BT':
713
                        // Reset text positioning matrices
714 43
                        $current_position_tm = [
715 43
                            'a' => 1, 'b' => 0, 'c' => 0,
716 43
                            'i' => 0, 'j' => 1, 'k' => 0,
717 43
                            'x' => 0, 'y' => 0, 'z' => 1,
718 43
                        ];
719 43
                        $current_position_td = ['x' => 0, 'y' => 0];
720 43
                        $current_text_leading = 0;
721 43
                        break;
722
723
                        // Begin marked content sequence with property list
724 43
                    case 'BDC':
725 16
                        if (preg_match('/(<<.*>>)$/', $command[self::COMMAND], $match)) {
726 16
                            $dict = $this->parseDictionary($match[1]);
727
728
                            // Check for ActualText block
729 16
                            if (isset($dict['ActualText']) && \is_string($dict['ActualText']) && '' !== $dict['ActualText']) {
730 4
                                if ('[' == $dict['ActualText'][0]) {
731
                                    // Simulate a 'TJ' command on the stack
732
                                    $marked_stack[] = [
733
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'TJ')[0],
734
                                    ];
735 4
                                } elseif ('<' == $dict['ActualText'][0] || '(' == $dict['ActualText'][0]) {
736
                                    // Simulate a 'Tj' command on the stack
737 4
                                    $marked_stack[] = [
738 4
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'Tj')[0],
739 4
                                    ];
740
                                }
741
                            }
742
                        }
743 16
                        break;
744
745
                        // Begin marked content sequence
746 43
                    case 'BMC':
747 2
                        if ('ReversedChars' == $command[self::COMMAND]) {
748
                            // Upon encountering a ReversedChars command,
749
                            // add the characters we've built up so far to
750
                            // the result array
751 1
                            $result = array_merge($result, $text);
752
753
                            // Start a fresh $text array that will contain
754
                            // reversed characters
755 1
                            $text = [];
756
757
                            // Add the reversed text flag to the stack
758 1
                            $marked_stack[] = ['ReversedChars' => true];
759
                        }
760 2
                        break;
761
762
                        // set graphics position matrix
763 43
                    case 'cm':
764 29
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
765 29
                        $current_position_cm = [
766 29
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
767 29
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
768 29
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
769 29
                        ];
770 29
                        break;
771
772 43
                    case 'Do':
773 15
                        if (null !== $page) {
774 15
                            $args = preg_split('/\s/s', $command[self::COMMAND]);
775 15
                            $id = trim(array_pop($args), '/ ');
776 15
                            $xobject = $page->getXObject($id);
777
778
                            // @todo $xobject could be a ElementXRef object, which would then throw an error
779 15
                            if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
780
                                // Not a circular reference.
781 15
                                $text[] = $xobject->getText($page);
782
                            }
783
                        }
784 15
                        break;
785
786
                        // Marked content point with (DP) & without (MP) property list
787 43
                    case 'DP':
788 43
                    case 'MP':
789 1
                        break;
790
791
                        // End text object
792 43
                    case 'ET':
793 43
                        break;
794
795
                        // Store current selected font and graphics matrix
796 43
                    case 'q':
797 37
                        $clipped_font[] = [$current_font, $current_font_size];
798 37
                        $clipped_position_cm[] = $current_position_cm;
799 37
                        break;
800
801
                        // Restore previous selected font and graphics matrix
802 43
                    case 'Q':
803 37
                        list($current_font, $current_font_size) = array_pop($clipped_font);
804 37
                        $current_position_cm = array_pop($clipped_position_cm);
805 37
                        break;
806
807
                        // End marked content sequence
808 43
                    case 'EMC':
809 17
                        $data = false;
810 17
                        if (\count($marked_stack)) {
811 5
                            $marked = array_pop($marked_stack);
812 5
                            $action = key($marked);
813 5
                            $data = $marked[$action];
814
815
                            switch ($action) {
816
                                // If we are in ReversedChars mode...
817 5
                                case 'ReversedChars':
818
                                    // Reverse the characters we've built up so far
819 1
                                    foreach ($text as $key => $t) {
820 1
                                        $text[$key] = implode('', array_reverse(
821 1
                                            mb_str_split($t, 1, mb_internal_encoding())
0 ignored issues
show
Bug introduced by
It seems like mb_internal_encoding() can also be of type true; however, parameter $encoding of mb_str_split() does only seem to accept null|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

821
                                            mb_str_split($t, 1, /** @scrutinizer ignore-type */ mb_internal_encoding())
Loading history...
822 1
                                        ));
823
                                    }
824
825
                                    // Add these characters to the result array
826 1
                                    $result = array_merge($result, $text);
827
828
                                    // Start a fresh $text array that will contain
829
                                    // non-reversed characters
830 1
                                    $text = [];
831 1
                                    break;
832
833 4
                                case 'ActualText':
834
                                    // Use the content of the ActualText as a command
835 4
                                    $command = $data;
836 4
                                    break;
837
                            }
838
                        }
839
840
                        // If this EMC command has been transformed into a 'Tj'
841
                        // or 'TJ' command because of being ActualText, then bypass
842
                        // the break to proceed to the writing section below.
843 17
                        if ('Tj' != $command[self::OPERATOR] && 'TJ' != $command[self::OPERATOR]) {
844 17
                            break;
845
                        }
846
847
                        // no break
848 43
                    case "'":
849 43
                    case '"':
850 4
                        if ("'" == $command[self::OPERATOR] || '"' == $command[self::OPERATOR]) {
851
                            // Move to next line and write text
852
                            $current_position['x'] = 0;
853
                            $current_position_td['x'] = 0;
854
                            $current_position_td['y'] += $current_text_leading;
855
                        }
856
                        // no break
857 43
                    case 'Tj':
858 35
                        $command[self::COMMAND] = [$command];
859
                        // no break
860 43
                    case 'TJ':
861
                        // Check the marked content stack for flags
862 43
                        $actual_text = false;
863 43
                        $reverse_text = false;
864 43
                        foreach ($marked_stack as $marked) {
865 5
                            if (isset($marked['ActualText'])) {
866 4
                                $actual_text = true;
867
                            }
868 5
                            if (isset($marked['ReversedChars'])) {
869 1
                                $reverse_text = true;
870
                            }
871
                        }
872
873
                        // Account for text position ONLY just before we write text
874 43
                        if (false === $actual_text && \is_array($last_written_position)) {
875
                            // If $last_written_position is an array, that
876
                            // means we have stored text position coordinates
877
                            // for placing an ActualText
878 4
                            $currentX = $last_written_position[0];
879 4
                            $currentY = $last_written_position[1];
880 4
                            $last_written_position = false;
881
                        } else {
882 43
                            $currentX = $current_position_cm['x'] + $current_position_tm['x'] + $current_position_td['x'];
883 43
                            $currentY = $current_position_cm['y'] + $current_position_tm['y'] + $current_position_td['y'];
884
                        }
885 43
                        $whiteSpace = '';
886
887 43
                        $factorX = -$current_font_size * $current_position_tm['a'] - $current_font_size * $current_position_tm['i'];
888 43
                        $factorY = $current_font_size * $current_position_tm['b'] + $current_font_size * $current_position_tm['j'];
889
890 43
                        if (true === $this->addPositionWhitespace && false !== $current_position['x']) {
891 31
                            $curY = $currentY - $current_position['y'];
892 31
                            if (abs($curY) >= abs($factorY) / 4) {
893 30
                                $whiteSpace = "\n";
894
                            } else {
895 30
                                if (true === $reverse_text) {
896 1
                                    $curX = $current_position['x'] - $currentX;
897
                                } else {
898 30
                                    $curX = $currentX - $current_position['x'];
899
                                }
900
901
                                // In abs($factorX * 7) below, the 7 is chosen arbitrarily
902
                                // as the number of apparent "spaces" in a document we
903
                                // would need before considering them a "tab". In the
904
                                // future, we might offer this value to users as a config
905
                                // option.
906 30
                                if ($curX >= abs($factorX * 7)) {
907 20
                                    $whiteSpace = "\t";
908 29
                                } elseif ($curX >= abs($factorX * 2)) {
909 17
                                    $whiteSpace = ' ';
910
                                }
911
                            }
912
                        }
913
914 43
                        $newtext = $this->getTJUsingFontFallback(
915 43
                            $current_font,
916 43
                            $command[self::COMMAND],
917 43
                            $page,
918 43
                            $factorX
919 43
                        );
920
921
                        // If there is no ActualText pending then write
922 43
                        if (false === $actual_text) {
923 43
                            $newtext = str_replace(["\r", "\n"], '', $newtext);
924 43
                            if (false !== $reverse_text) {
925
                                // If we are in ReversedChars mode, add the whitespace last
926 1
                                $text[] = preg_replace('/  $/', ' ', $newtext.$whiteSpace);
927
                            } else {
928
                                // Otherwise add the whitespace first
929 43
                                if (' ' === $whiteSpace && isset($text[\count($text) - 1])) {
930 16
                                    $text[\count($text) - 1] = preg_replace('/ $/', '', $text[\count($text) - 1]);
931
                                }
932 43
                                $text[] = preg_replace('/^[ \t]{2}/', ' ', $whiteSpace.$newtext);
933
                            }
934
935
                            // Record the position of this inserted text for comparison
936
                            // with the next text block.
937
                            // Provide a 'fudge' factor guess on how wide this text block
938
                            // is based on the number of characters. This helps limit the
939
                            // number of tabs inserted, but isn't perfect.
940 43
                            $factor = $factorX / 2;
941 43
                            $current_position = [
942 43
                                'x' => $currentX - mb_strlen($newtext) * $factor,
943 43
                                'y' => $currentY,
944 43
                            ];
945 4
                        } elseif (false === $last_written_position) {
946
                            // If there is an ActualText in the pipeline
947
                            // store the position this undisplayed text
948
                            // *would* have been written to, so the
949
                            // ActualText is displayed in the right spot
950 4
                            $last_written_position = [$currentX, $currentY];
951 4
                            $current_position['x'] = $currentX;
952
                        }
953 43
                        break;
954
955
                        // move to start of next line
956 43
                    case 'T*':
957 13
                        $current_position['x'] = 0;
958 13
                        $current_position_td['x'] = 0;
959 13
                        $current_position_td['y'] += $current_text_leading;
960 13
                        break;
961
962
                        // set character spacing
963 43
                    case 'Tc':
964 13
                        break;
965
966
                        // move text current point and set leading
967 43
                    case 'Td':
968 43
                    case 'TD':
969
                        // move text current point
970 32
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
971 32
                        $y = (float) array_pop($args);
972 32
                        $x = (float) array_pop($args);
973
974 32
                        if ('TD' == $command[self::OPERATOR]) {
975 7
                            $current_text_leading = -$y * $current_position_tm['b'] - $y * $current_position_tm['j'];
976
                        }
977
978 32
                        $current_position_td = [
979 32
                            'x' => $current_position_td['x'] + $x * $current_position_tm['a'] + $x * $current_position_tm['i'],
980 32
                            'y' => $current_position_td['y'] + $y * $current_position_tm['b'] + $y * $current_position_tm['j'],
981 32
                        ];
982 32
                        break;
983
984 43
                    case 'Tf':
985 43
                        $args = preg_split('/\s/s', $command[self::COMMAND]);
986 43
                        $size = (float) array_pop($args);
987 43
                        $id = trim(array_pop($args), '/');
988 43
                        if (null !== $page) {
989 43
                            $new_font = $page->getFont($id);
990
                            // If an invalid font ID is given, do not update the font.
991
                            // This should theoretically never happen, as the PDF spec states for the Tf operator:
992
                            // "The specified font value shall match a resource name in the Font entry of the default resource dictionary"
993
                            // (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 435)
994
                            // But we want to make sure that malformed PDFs do not simply crash.
995 43
                            if (null !== $new_font) {
996 39
                                $current_font = $new_font;
997 39
                                $current_font_size = $size;
998
                            }
999
                        }
1000 43
                        break;
1001
1002
                        // set leading
1003 37
                    case 'TL':
1004 6
                        $y = (float) $command[self::COMMAND];
1005 6
                        $current_text_leading = -$y * $current_position_tm['b'] + -$y * $current_position_tm['j'];
1006 6
                        break;
1007
1008
                        // set text position matrix
1009 37
                    case 'Tm':
1010 34
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
1011 34
                        $current_position_tm = [
1012 34
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
1013 34
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
1014 34
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
1015 34
                        ];
1016 34
                        break;
1017
1018
                        // set text rendering mode
1019 22
                    case 'Ts':
1020
                        break;
1021
1022
                        // set super/subscripting text rise
1023 22
                    case 'Ts':
1024
                        break;
1025
1026
                        // set word spacing
1027 22
                    case 'Tw':
1028 9
                        break;
1029
1030
                        // set horizontal scaling
1031 22
                    case 'Tz':
1032
                        break;
1033
1034
                    default:
1035
                }
1036
            }
1037
        }
1038
1039 46
        $result = array_merge($result, $text);
1040
1041 46
        return $result;
1042
    }
1043
1044
    /**
1045
     * getCommandsText() expects the content of $text_part to be an
1046
     * already formatted, single-line command from a document stream.
1047
     * The companion function getSectionsText() returns a document
1048
     * stream as an array of single commands for just this purpose.
1049
     * Because of this, the argument $offset is no longer used, and
1050
     * may be removed in a future PdfParser release.
1051
     *
1052
     * A better name for this function would be getCommandText()
1053
     * since it now always works on just one command.
1054
     */
1055 50
    public function getCommandsText(string $text_part, int &$offset = 0): array
0 ignored issues
show
Unused Code introduced by
The parameter $offset is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

1055
    public function getCommandsText(string $text_part, /** @scrutinizer ignore-unused */ int &$offset = 0): array

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
1056
    {
1057 50
        $commands = $matches = [];
1058
1059 50
        preg_match('/^(([\/\[\(<])?.*)(?<!\w)([a-z01\'\"*]+)$/i', $text_part, $matches);
1060
1061
        // If no valid command is detected, return an empty array
1062 50
        if (!isset($matches[1]) || !isset($matches[2]) || !isset($matches[3])) {
1063 1
            return [];
1064
        }
1065
1066 50
        $type = $matches[2];
1067 50
        $operator = $matches[3];
1068 50
        $command = trim($matches[1]);
1069
1070 50
        if ('TJ' == $operator) {
1071 40
            $subcommand = [];
1072 40
            $command = trim($command, '[]');
1073
            do {
1074 40
                $oldCommand = $command;
1075
1076
                // Search for parentheses string () format
1077 40
                if (preg_match('/^ *\((.*?)(?<![^\\\\]\\\\)\) *(-?[\d.]+)?/', $command, $tjmatch)) {
1078 34
                    $subcommand[] = [
1079 34
                        self::TYPE => '(',
1080 34
                        self::OPERATOR => 'TJ',
1081 34
                        self::COMMAND => $tjmatch[1],
1082 34
                    ];
1083 34
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
1084 28
                        $subcommand[] = [
1085 28
                            self::TYPE => 'n',
1086 28
                            self::OPERATOR => '',
1087 28
                            self::COMMAND => $tjmatch[2],
1088 28
                        ];
1089
                    }
1090 34
                    $command = substr($command, \strlen($tjmatch[0]));
1091
                }
1092
1093
                // Search for hexadecimal <> format
1094 40
                if (preg_match('/^ *<([0-9a-f\s]*)> *(-?[\d.]+)?/i', $command, $tjmatch)) {
1095 19
                    $tjmatch[1] = preg_replace('/\s/', '', $tjmatch[1]);
1096 19
                    $subcommand[] = [
1097 19
                        self::TYPE => '<',
1098 19
                        self::OPERATOR => 'TJ',
1099 19
                        self::COMMAND => $tjmatch[1],
1100 19
                    ];
1101 19
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
1102 18
                        $subcommand[] = [
1103 18
                            self::TYPE => 'n',
1104 18
                            self::OPERATOR => '',
1105 18
                            self::COMMAND => $tjmatch[2],
1106 18
                        ];
1107
                    }
1108 19
                    $command = substr($command, \strlen($tjmatch[0]));
1109
                }
1110 40
            } while ($command != $oldCommand);
1111
1112 40
            $command = $subcommand;
1113 50
        } elseif ('Tj' == $operator || "'" == $operator || '"' == $operator) {
1114
            // Depending on the string type, trim the data of the
1115
            // appropriate delimiters
1116 39
            if ('(' == $type) {
1117
                // Don't use trim() here since a () string may end with
1118
                // a balanced or escaped right parentheses, and trim()
1119
                // will delete both. Both strings below are valid:
1120
                //   eg. (String())
1121
                //   eg. (String\))
1122 33
                $command = preg_replace('/^\(|\)$/', '', $command);
1123 15
            } elseif ('<' == $type) {
1124 39
                $command = trim($command, '<>');
1125
            }
1126 50
        } elseif ('/' == $type) {
1127 49
            $command = substr($command, 1);
1128
        }
1129
1130 50
        $commands[] = [
1131 50
            self::TYPE => $type,
1132 50
            self::OPERATOR => $operator,
1133 50
            self::COMMAND => $command,
1134 50
        ];
1135
1136 50
        return $commands;
1137
    }
1138
1139 65
    public static function factory(
1140
        Document $document,
1141
        Header $header,
1142
        ?string $content,
1143
        ?Config $config = null
1144
    ): self {
1145 65
        switch ($header->get('Type')->getContent()) {
1146 65
            case 'XObject':
1147 19
                switch ($header->get('Subtype')->getContent()) {
1148 19
                    case 'Image':
1149 12
                        return new Image($document, $header, $config->getRetainImageContent() ? $content : null, $config);
0 ignored issues
show
Bug introduced by
The method getRetainImageContent() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1149
                        return new Image($document, $header, $config->/** @scrutinizer ignore-call */ getRetainImageContent() ? $content : null, $config);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
1150
1151 8
                    case 'Form':
1152 8
                        return new Form($document, $header, $content, $config);
1153
                }
1154
1155
                return new self($document, $header, $content, $config);
1156
1157 65
            case 'Pages':
1158 64
                return new Pages($document, $header, $content, $config);
1159
1160 65
            case 'Page':
1161 64
                return new Page($document, $header, $content, $config);
1162
1163 65
            case 'Encoding':
1164 12
                return new Encoding($document, $header, $content, $config);
1165
1166 65
            case 'Font':
1167 64
                $subtype = $header->get('Subtype')->getContent();
1168 64
                $classname = '\Smalot\PdfParser\Font\Font'.$subtype;
1169
1170 64
                if (class_exists($classname)) {
1171 64
                    return new $classname($document, $header, $content, $config);
1172
                }
1173
1174
                return new Font($document, $header, $content, $config);
1175
1176
            default:
1177 65
                return new self($document, $header, $content, $config);
1178
        }
1179
    }
1180
1181
    /**
1182
     * Returns unique id identifying the object.
1183
     */
1184 46
    protected function getUniqueId(): string
1185
    {
1186 46
        return spl_object_hash($this);
1187
    }
1188
}
1189