Test Failed
Pull Request — master (#634)
by
unknown
02:08
created

PDFObject   F

Complexity

Total Complexity 155

Size/Duplication

Total Lines 962
Duplicated Lines 0 %

Test Coverage

Coverage 90.73%

Importance

Changes 7
Bugs 2 Features 0
Metric Value
eloc 451
c 7
b 2
f 0
dl 0
loc 962
ccs 372
cts 410
cp 0.9073
rs 2
wmc 155

19 Methods

Rating   Name   Duplication   Size   Complexity  
A getContent() 0 3 1
A getConfig() 0 3 1
A getHeader() 0 3 1
A getDocument() 0 3 1
A init() 0 2 1
A has() 0 3 1
A __construct() 0 10 1
A get() 0 3 1
A getDetails() 0 3 1
A getTJUsingFontFallback() 0 32 5
A getUniqueId() 0 3 1
B cleanContent() 0 115 9
C getCommandsText() 0 77 15
A getDefaultFont() 0 17 4
F getTextArray() 0 348 68
A getText() 0 7 1
C getSectionsText() 0 69 14
D parseDictionary() 0 89 19
B factory() 0 39 10

How to fix   Complexity   

Complex Class

Complex classes like PDFObject often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use PDFObject, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\XObject\Form;
36
use Smalot\PdfParser\XObject\Image;
37
38
/**
39
 * Class PDFObject
40
 */
41
class PDFObject
42
{
43
    public const TYPE = 't';
44
45
    public const OPERATOR = 'o';
46
47
    public const COMMAND = 'c';
48
49
    /**
50
     * The recursion stack.
51
     *
52
     * @var array
53
     */
54
    public static $recursionStack = [];
55
56
    /**
57
     * @var Document
58
     */
59
    protected $document;
60
61
    /**
62
     * @var Header
63
     */
64
    protected $header;
65
66
    /**
67
     * @var string
68
     */
69
    protected $content;
70
71
    /**
72
     * @var Config
73
     */
74
    protected $config;
75
76 62
    /**
77
     * @var bool
78
     */
79
    protected $addPositionWhitespace = false;
80
81
    public function __construct(
82 62
        Document $document,
83 62
        Header $header = null,
84 62
        string $content = null,
85 62
        Config $config = null
86 62
    ) {
87
        $this->document = $document;
88 49
        $this->header = $header ?? new Header();
89
        $this->content = $content;
90 49
        $this->config = $config;
91
    }
92 3
93
    public function init()
94 3
    {
95
    }
96
97 49
    public function getDocument(): Document
98
    {
99 49
        return $this->document;
100
    }
101
102 3
    public function getHeader(): ?Header
103
    {
104 3
        return $this->header;
105
    }
106
107
    public function getConfig(): ?Config
108
    {
109
        return $this->config;
110 50
    }
111
112 50
    /**
113
     * @return Element|PDFObject|Header
114
     */
115 47
    public function get(string $name)
116
    {
117 47
        return $this->header->get($name);
118
    }
119
120 3
    public function has(string $name): bool
121
    {
122 3
        return $this->header->has($name);
123
    }
124
125 38
    public function getDetails(bool $deep = true): array
126
    {
127 38
        return $this->header->getDetails($deep);
128
    }
129
130 32
    public function getContent(): ?string
131
    {
132 32
        return $this->content;
133 32
    }
134
135
    /**
136 32
     * Takes a string of PDF document stream text and formats it into
137 32
     * a multi-line string with one PDF command on each line, separated
138
     * by \r\n. If the given string is null, or binary data is detected
139
     * instead of a document stream then return an empty string.
140
     *
141
     * @internal For internal use only, not part of the public API
142 32
     */
143 32
    public function cleanContent(?string $content): string
144 22
    {
145
        if (null === $content) {
146
            return '';
147
        }
148 32
149 32
        // Find all strings () and replace them so they aren't affected
150 21
        // by the next steps
151
        $pdfstrings = [];
152
        $attempt = '(';
153
        while (preg_match('/'.preg_quote($attempt, '/').'.*?(?<![^\\\\]\\\\)\)/s', $content, $text)) {
154 32
            // PDF strings can contain unescaped parentheses as long as
155 32
            // they're balanced, so check for balanced parentheses
156 32
            $left = preg_match_all('/(?<![^\\\\]\\\\)\(/', $text[0]);
157 32
            $right = preg_match_all('/(?<![^\\\\]\\\\)\)/', $text[0]);
158 32
159 18
            if ($left == $right) {
160
                // Replace the string with a unique placeholder
161
                $id = uniqid('STRING_', true);
162 32
                $pdfstrings[$id] = $text[0];
163
                $content = preg_replace(
164 32
                    '/'.preg_quote($text[0], '/').'/',
165 18
                    '@@@'.$id.'@@@',
166
                    $content,
167
                    1
168
                );
169
170
                // Reset to search for the next string
171 32
                $attempt = '(';
172 32
            } else {
173
                // We had unbalanced parentheses, so use the current
174
                // match as a base to find a longer string
175 32
                $attempt = $text[0];
176
            }
177 32
        }
178 7
179
        // Remove all carriage returns and line-feeds from the document stream
180
        $content = str_replace(["\r", "\n"], ' ', trim($content));
181 32
182 32
        // Find all dictionary << >> commands and replace them so they
183 11
        // aren't affected by the next steps
184
        $dictstore = [];
185
        while (preg_match('/(<<.*?>> *)(BDC|BMC|DP|MP)/', $content, $dicttext)) {
186 32
            $dictid = uniqid('DICT_', true);
187
            $dictstore[$dictid] = $dicttext[1];
188
            $content = preg_replace(
189 31
                '/'.preg_quote($dicttext[0], '/').'/',
190
                ' ###'.$dictid.'###'.$dicttext[2],
191 31
                $content,
192 31
                1
193 31
            );
194
        }
195
196 31
        // Now that all strings and dictionaries are hidden, the only
197 29
        // PDF commands left should all be plain text.
198 29
        // Detect text encoding of the current string to prevent reading
199 29
        // content streams that are images, etc. This prevents PHP
200
        // error messages when JPEG content is sent to this function
201
        // by the sample file '12249.pdf' from:
202 29
        // https://github.com/smalot/pdfparser/issues/458
203 29
        if (false === mb_detect_encoding($content, null, true)) {
204
            return '';
205
        }
206 29
207
        // Normalize white-space in the document stream
208
        $content = preg_replace('/\s{2,}/', ' ', $content);
209
210 29
        // Find all valid PDF operators and add \r\n after each; this
211
        // ensures there is just one command on every line
212 29
        // Source: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A
213
        // Source: https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A
214
        // Note: PDF Reference 1.7 lists 'I' and 'rI' as valid commands, while
215
        //       PDF 32000:2008 lists them as 'i' and 'ri' respectively. Both versions
216
        //       appear here in the list for completeness.
217 31
        $operators = [
218 4
          'b*', 'b', 'BDC', 'BMC', 'B*', 'BI', 'BT', 'BX', 'B', 'cm', 'cs', 'c', 'CS',
219 4
          'd0', 'd1', 'd', 'Do', 'DP', 'EMC', 'EI', 'ET', 'EX', 'f*', 'f', 'F', 'gs',
220 4
          'g', 'G',  'h', 'i', 'ID', 'I', 'j', 'J', 'k', 'K', 'l', 'm', 'MP', 'M', 'n',
221 4
          'q', 'Q', 're', 'rg', 'ri', 'rI', 'RG', 'scn', 'sc', 'sh', 's', 'SCN', 'SC',
222
          'S', 'T*', 'Tc', 'Td', 'TD', 'Tf', 'TJ', 'Tj', 'TL', 'Tm', 'Tr', 'Ts', 'Tw',
223 4
          'Tz', 'v', 'w', 'W*', 'W', 'y', '\'', '"',
224
        ];
225
        foreach ($operators as $operator) {
226
            $content = preg_replace(
227 31
                '/(?<!\w|\/)'.preg_quote($operator, '/').'(?![\w10\*])/',
228
                $operator."\r\n",
229
                $content
230 20
            );
231
        }
232 20
233 20
        // Restore the original content of the dictionary << >> commands
234 19
        $dictstore = array_reverse($dictstore, true);
235
        foreach ($dictstore as $id => $dict) {
236
            $content = str_replace('###'.$id.'###', $dict, $content);
237 20
        }
238 20
239 18
        // Restore the original string content
240
        $pdfstrings = array_reverse($pdfstrings, true);
241
        foreach ($pdfstrings as $id => $text) {
242 20
            // Strings may contain escaped newlines, or literal newlines
243 18
            // and we should clean these up before replacing the string
244
            // back into the content stream; this ensures no strings are
245
            // split between two lines (every command must be on one line)
246 2
            $text = str_replace(
247
                ["\\\r\n", "\\\r", "\\\n", "\r", "\n"],
248
                ['', '', '', '\r', '\n'],
249
                $text
250
            );
251
252 20
            $content = str_replace('@@@'.$id.'@@@', $text, $content);
253
        }
254 20
255 20
        $content = trim(preg_replace(['/(\r\n){2,}/', '/\r\n +/'], "\r\n", $content));
256 20
257 20
        return $content;
258
    }
259 20
260 20
    /**
261
     * getSectionsText() now takes an entire, unformatted document
262 20
     * stream as a string, cleans it, then filters out commands that
263
     * aren't needed for text positioning/extraction. It returns an
264 20
     * array of unprocessed PDF commands, one command per element.
265 18
     */
266 18
    public function getSectionsText(?string $content): array
267 18
    {
268
        $sections = [];
269 18
270 18
        // A cleaned stream has one command on every line, so split the
271 18
        // cleaned stream content on \r\n into an array
272 1
        $textCleaned = preg_split(
273 1
            '/(\r\n|\n|\r)/',
274
            $this->cleanContent($content),
275 1
            -1,
276
            \PREG_SPLIT_NO_EMPTY
277
        );
278 18
279 5
        $inTextBlock = false;
280
        foreach ($textCleaned as $line) {
281
            $line = trim($line);
282 18
283 15
            // Skip empty lines
284 15
            if ('' === $line) {
285 15
                continue;
286 15
            }
287 15
288
            // If a 'BT' is encountered, set the $inTextBlock flag
289
            if (preg_match('/BT$/', $line)) {
290 11
                $inTextBlock = true;
291 15
                $sections[] = $line;
292 15
293
                // If an 'ET' is encountered, unset the $inTextBlock flag
294 12
            } elseif ('ET' == $line) {
295
                $inTextBlock = false;
296 15
                $sections[] = $line;
297 15
            } elseif ($inTextBlock) {
298
                // If we are inside a BT ... ET text block, save all lines
299
                $sections[] = trim($line);
300 18
            } else {
301 3
                // Otherwise, if we are outside of a text block, only
302 3
                // save specific, necessary lines. Care should be taken
303 3
                // to ensure a command being checked for *only* matches
304 3
                // that command. For instance, a simple search for 'c'
305 3
                // may also match the 'sc' command. See the command
306
                // list in the cleanContent() method above.
307
                // Add more commands to save here as you find them in
308
                // weird PDFs!
309 3
                if ('q' == $line[-1] || 'Q' == $line[-1]) {
310
                    // Save and restore graphics state commands
311 18
                    $sections[] = $line;
312 18
                } elseif (preg_match('/(?<!\w)B[DM]C$/', $line)) {
313 18
                    // Begin marked content sequence
314 18
                    $sections[] = $line;
315 18
                } elseif (preg_match('/(?<!\w)[DM]P$/', $line)) {
316
                    // Marked content point
317
                    $sections[] = $line;
318
                } elseif (preg_match('/(?<!\w)EMC$/', $line)) {
319
                    // End marked content sequence
320
                    $sections[] = $line;
321 18
                } elseif (preg_match('/(?<!\w)cm$/', $line)) {
322 16
                    // Graphics position change commands
323
                    $sections[] = $line;
324
                } elseif (preg_match('/(?<!\w)Tf$/', $line)) {
325 18
                    // Font change commands
326
                    $sections[] = $line;
327 18
                } elseif (preg_match('/(?<!\w)Do$/', $line)) {
328
                    // Invoke named XObject command
329 5
                    $sections[] = $line;
330 5
                }
331
            }
332 18
        }
333
334 6
        return $sections;
335 6
    }
336
337 18
    private function getDefaultFont(Page $page = null): Font
338 18
    {
339 13
        $fonts = [];
340
        if (null !== $page) {
341 17
            $fonts = $page->getFonts();
342 18
        }
343 18
344 18
        $firstFont = $this->document->getFirstFont();
345
        if (null !== $firstFont) {
346
            $fonts[] = $firstFont;
347 15
        }
348 1
349 1
        if (\count($fonts) > 0) {
350
            return reset($fonts);
351 15
        }
352 14
353 14
        return new Font($this->document, null, null, $this->config);
354 14
    }
355 14
356 14
    /**
357 14
     * Decode a '[]TJ' command and attempt to use alternate fonts if
358 12
     * the current font results in output that contains Unicode control
359
     * characters. See Font::decodeText for a full description of
360
     * $textMatrix
361 14
     *
362 14
     * @param array<int,array<string,string|bool>> $command
363 14
     * @param array<string,float>                  $textMatrix
364 10
     */
365
    private function getTJUsingFontFallback(
366
        Font $font,
367 14
        array $command,
368 14
        array $textMatrix = ['a' => 1, 'b' => 0, 'i' => 0, 'j' => 1],
369
        Page $page = null
370
    ): string {
371 12
        $orig_text = $font->decodeText($command, $textMatrix);
372
        $text = $orig_text;
373
374
        // If we make this a Config option, we can add a check if it's
375 12
        // enabled here.
376 4
        if (null !== $page) {
377
            $font_ids = array_keys($page->getFonts());
378
379 12
            // If the decoded text contains UTF-8 control characters
380
            // then the font page being used is probably the wrong one.
381
            // Loop through the rest of the fonts to see if we can get
382
            // a good decode. Allow x09 to x0d which are whitespace.
383
            while (preg_match('/[\x00-\x08\x0e-\x1f\x7f]/u', $text) || false !== strpos(bin2hex($text), '00')) {
384 12
                // If we're out of font IDs, then give up and use the
385 4
                // original string
386 4
                if (0 == \count($font_ids)) {
387
                    return $orig_text;
388 11
                }
389
390
                // Try the next font ID
391 11
                $font = $page->getFont(array_shift($font_ids));
392 4
                $text = $font->decodeText($command, $textMatrix);
393 4
            }
394 4
        }
395 4
396
        return $text;
397
    }
398 4
399
    /**
400 4
     * Expects a string that is a full PDF dictionary object, including
401
     * the outer enclosing << >> angle brackets.
402
     *
403 4
     * @throws \Exception
404
     */
405 9
    public function parseDictionary(string $dictionary): array
406 8
    {
407 2
        // Normalize whitespace
408
        $dictionary = preg_replace(['/\r/', '/\n/', '/\s{2,}/'], ' ', trim($dictionary));
409 8
410
        if ('<<' != substr($dictionary, 0, 2)) {
411
            throw new \Exception('Not a valid dictionary object.');
412 8
        }
413
414
        $parsed = [];
415 8
        $stack = [];
416 3
        $currentName = '';
417
        $arrayTypeNumeric = false;
418 8
419 3
        // Remove outer layer of dictionary, and split on tokens
420
        $split = preg_split(
421 7
            '/(<<|>>|\[|\]|\/[^\s\/\[\]\(\)<>]*)/',
422
            trim(preg_replace('/^<<|>>$/', '', $dictionary)),
423
            -1,
424 7
            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
425 7
        );
426
427
        foreach ($split as $token) {
428 7
            $token = trim($token);
429 7
            switch ($token) {
430 1
                case '':
431
                    break;
432 6
433
                    // Open numeric array
434
                case '[':
435 6
                    $parsed[$currentName] = [];
436 6
                    $arrayTypeNumeric = true;
437
438
                    // Move up one level in the stack
439
                    $stack[\count($stack)] = &$parsed;
440
                    $parsed = &$parsed[$currentName];
441
                    $currentName = '';
442
                    break;
443
444
                    // Open hashed array
445 18
                case '<<':
446 1
                    $parsed[$currentName] = [];
447 1
                    $arrayTypeNumeric = false;
448
449
                    // Move up one level in the stack
450 18
                    $stack[\count($stack)] = &$parsed;
451
                    $parsed = &$parsed[$currentName];
452
                    $currentName = '';
453 20
                    break;
454
455
                    // Close numeric array
456
                case ']':
457
                    // Revert string type arrays back to a single element
458
                    if (\is_array($parsed) && 1 == \count($parsed)
459 6
                        && isset($parsed[0]) && \is_string($parsed[0])
460
                        && '' !== $parsed[0] && '/' != $parsed[0][0]) {
461 6
                        $parsed = '['.$parsed[0].']';
462 6
                    }
463 6
                    // Close hashed array
464
                    // no break
465 6
                case '>>':
466 6
                    $arrayTypeNumeric = false;
467
468 6
                    // Move down one level in the stack
469 6
                    $parsed = &$stack[\count($stack) - 1];
470
                    unset($stack[\count($stack) - 1]);
471 6
                    break;
472 3
473
                default:
474
                    // If value begins with a slash, then this is a name
475 6
                    // Add it to the appropriate array
476 6
                    if ('/' == substr($token, 0, 1)) {
477
                        $currentName = substr($token, 1);
478
                        if (true == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
479 6
                            $parsed[] = $currentName;
480
                            $currentName = '';
481
                        }
482 6
                    } elseif ('' != $currentName) {
483 6
                        if (false == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
484 6
                            $parsed[$currentName] = $token;
485 6
                        }
486 6
                        $currentName = '';
487
                    } elseif ('' == $currentName) {
488 6
                        $parsed[] = $token;
489
                    }
490 6
            }
491 6
        }
492 5
493
        return $parsed;
494 6
    }
495 6
496 6
    /**
497 6
     * getText() leverages getTextArray() to get the content of the
498
     * document, setting the addPositionWhitespace flag to true so
499
     * whitespace is inserted in a logical way for reading by humans.
500 5
     */
501 4
    public function getText(Page $page = null): string
502
    {
503 5
        $this->addPositionWhitespace = true;
504 4
        $result = $this->getTextArray($page);
505
        $this->addPositionWhitespace = false;
506
507 5
        return implode('', $result).' ';
508
    }
509
510
    /**
511 5
     * getTextArray() returns the text objects of a document in an
512 2
     * array. By default no positioning whitespace is added to the
513
     * output unless the addPositionWhitespace flag is set to true.
514
     *
515 5
     * @throws \Exception
516
     */
517
    public function getTextArray(Page $page = null): array
518
    {
519
        $result = [];
520 5
        $text = [];
521
522 4
        $marked_stack = [];
523
        $last_written_position = false;
524 4
525
        $sections = $this->getSectionsText($this->content);
526
        $current_font = $this->getDefaultFont($page);
527 4
528
        $current_position = ['x' => false, 'y' => false];
529
        $current_position_tm = [
530
            'a' => 1, 'b' => 0, 'c' => 0,
531
            'i' => 0, 'j' => 1, 'k' => 0,
532
            'x' => false, 'y' => false, 'z' => 1,
533
        ];
534
        $current_position_td = ['x' => 0, 'y' => 0];
535
        $current_position_cm = [
536
            'a' => 1, 'b' => 0, 'c' => 0,
537 4
            'i' => 0, 'j' => 1, 'k' => 0,
538 4
            'x' => 0, 'y' => 0, 'z' => 1,
539 2
        ];
540
541 4
        $clipped_font = [];
542
        $clipped_position_cm = [];
543
544 4
        self::$recursionStack[] = $this->getUniqueId();
545
546
        foreach ($sections as $section) {
547 4
            $commands = $this->getCommandsText($section);
548
            foreach ($commands as $command) {
549
                switch ($command[self::OPERATOR]) {
550 4
                    // Begin text object
551 1
                    case 'BT':
552
                        // Reset text positioning matrices
553 4
                        $current_position_tm = [
554
                            'a' => 1, 'b' => 0, 'c' => 0,
555
                            'i' => 0, 'j' => 1, 'k' => 0,
556 4
                            'x' => false, 'y' => false, 'z' => 1,
557 4
                        ];
558
                        $current_position_td = ['x' => 0, 'y' => 0];
559
                        break;
560 4
561 4
                        // Begin marked content sequence with property list
562 2
                    case 'BDC':
563
                        if (preg_match('/(<<.*>>)$/', $command[self::COMMAND], $match)) {
0 ignored issues
show
Bug introduced by
It seems like $command[self::COMMAND] can also be of type array and array<mixed,array<string,mixed|string>>; however, parameter $subject of preg_match() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

563
                        if (preg_match('/(<<.*>>)$/', /** @scrutinizer ignore-type */ $command[self::COMMAND], $match)) {
Loading history...
564 2
                            $dict = $this->parseDictionary($match[1]);
565
566
                            // Check for ActualText block
567 2
                            if (isset($dict['ActualText']) && \is_string($dict['ActualText']) && '' !== $dict['ActualText']) {
568 2
                                if ('[' == $dict['ActualText'][0]) {
569
                                    // Simulate a 'TJ' command on the stack
570
                                    $marked_stack[] = [
571
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'TJ')[0],
572
                                    ];
573
                                } elseif ('<' == $dict['ActualText'][0] || '(' == $dict['ActualText'][0]) {
574
                                    // Simulate a 'Tj' command on the stack
575
                                    $marked_stack[] = [
576 6
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'Tj')[0],
577
                                    ];
578
                                }
579 29
                            }
580
                        }
581 29
                        break;
582
583 29
                        // Begin marked content sequence
584 29
                    case 'BMC':
585 29
                        if ('ReversedChars' == $command[self::COMMAND]) {
586
                            // Upon encountering a ReversedChars command,
587 29
                            // add the characters we've built up so far to
588 29
                            // the result array
589 29
                            $result = array_merge($result, $text);
590
591 29
                            // Start a fresh $text array that will contain
592 29
                            // reversed characters
593 29
                            $text = [];
594 29
595 29
                            // Add the reversed text flag to the stack
596 29
                            $marked_stack[] = ['ReversedChars' => true];
597
                        }
598
                        break;
599
600 29
                        // set graphics position matrix
601 29
                    case 'cm':
602 29
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
0 ignored issues
show
Bug introduced by
It seems like $command[self::COMMAND] can also be of type array and array<mixed,array<string,mixed|string>>; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

602
                        $args = preg_split('/\s+/s', /** @scrutinizer ignore-type */ $command[self::COMMAND]);
Loading history...
603 11
                        $current_position_cm = [
604 11
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
605 11
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
606
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
607
                        ];
608
                        break;
609 11
610 11
                    case 'Do':
611 11
                        if (null !== $page) {
612
                            $args = preg_split('/\s/s', $command[self::COMMAND]);
613 29
                            $id = trim(array_pop($args), '/ ');
614
                            $xobject = $page->getXObject($id);
615 29
616 29
                            // @todo $xobject could be a ElementXRef object, which would then throw an error
617
                            if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack)) {
618 25
                                // Not a circular reference.
619 25
                                $text[] = $xobject->getText($page);
620 25
                            }
621
                        }
622 25
                        break;
623
624 25
                        // Marked content point with (DP) & without (MP) property list
625 25
                    case 'DP':
626 25
                    case 'MP':
627
                        break;
628
629 25
                        // End text object
630 25
                    case 'ET':
631
                        break;
632 25
633
                        // Store current selected font and graphics matrix
634 29
                    case 'q':
635 29
                        $clipped_font[] = $current_font;
636
                        $clipped_position_cm[] = $current_position_cm;
637 14
                        break;
638 14
639 14
                        // Restore previous selected font and graphics matrix
640 14
                    case 'Q':
641 14
                        $current_font = array_pop($clipped_font);
642 14
                        $current_position_cm = array_pop($clipped_position_cm);
643
                        break;
644
645 14
                        // End marked content sequence
646 9
                    case 'EMC':
647 9
                        $data = false;
648
                        if (\count($marked_stack)) {
649 14
                            $marked = array_pop($marked_stack);
650
                            $action = key($marked);
651 29
                            $data = $marked[$action];
652 29
653 22
                            switch ($action) {
654 22
                                // If we are in ReversedChars mode...
655 22
                                case 'ReversedChars':
656 22
                                    // Reverse the characters we've built up so far
657 22
                                    foreach ($text as $key => $t) {
658 22
                                        $text[$key] = implode('', array_reverse(
659 22
                                            mb_str_split($t, 1, mb_internal_encoding())
0 ignored issues
show
Bug introduced by
It seems like mb_internal_encoding() can also be of type true; however, parameter $encoding of mb_str_split() does only seem to accept null|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

659
                                            mb_str_split($t, 1, /** @scrutinizer ignore-type */ mb_internal_encoding())
Loading history...
660
                                        ));
661
                                    }
662 22
663 22
                                    // Add these characters to the result array
664 22
                                    $result = array_merge($result, $text);
665
666
                                    // Start a fresh $text array that will contain
667 16
                                    // non-reversed characters
668 16
                                    $text = [];
669
                                    break;
670 22
671
                                case 'ActualText':
672
                                    // Use the content of the ActualText as a command
673
                                    $command = $data;
674
                                    break;
675 22
                            }
676
                        }
677 22
678 22
                        // If this EMC command has been transformed into a 'Tj'
679
                        // or 'TJ' command because of being ActualText, then bypass
680 22
                        // the break to proceed to the writing section below.
681
                        if ('Tj' != $command[self::OPERATOR] && 'TJ' != $command[self::OPERATOR]) {
682 22
                            break;
683 22
                        }
684
685 22
                        // no break
686 18
                    case "'":
687 18
                    case '"':
688
                        if ("'" == $command[self::OPERATOR] || '"' == $command[self::OPERATOR]) {
689
                            // Move to next line and write text
690 22
                            $current_position['x'] = 0;
691
                            $current_position_td['x'] = 0;
692
                            $current_position_td['y'] += 10;
693 29
                        }
694 1
                        // no break
695 29
                    case 'Tj':
696 29
                        $command[self::COMMAND] = [$command];
697 29
                        // no break
698
                    case 'TJ':
699
                        // Check the marked content stack for flags
700
                        $actual_text = false;
701 29
                        $reverse_text = false;
702 29
                        foreach ($marked_stack as $marked) {
703 29
                            if (isset($marked['ActualText'])) {
704 24
                                $actual_text = true;
705 22
                            }
706 22
                            if (isset($marked['ReversedChars'])) {
707 22
                                $reverse_text = true;
708 17
                            }
709 17
                        }
710 17
711 17
                        // Account for text position ONLY just before we write text
712 17
                        if (false === $actual_text && \is_array($last_written_position)) {
713
                            // If $last_written_position is an array, that
714
                            // means we have stored text position coordinates
715
                            // for placing an ActualText
716 29
                            $currentX = $last_written_position[0];
717 29
                            $currentY = $last_written_position[1];
718 29
                            $last_written_position = false;
719 29
                        } else {
720 29
                            $currentX = $current_position_cm['x'] + $current_position_tm['x'] + $current_position_td['x'];
721
                            $currentY = $current_position_cm['y'] + $current_position_tm['y'] + $current_position_td['y'];
722
                        }
723 25
                        $whiteSpace = '';
724
725
                        if (true === $this->addPositionWhitespace && false !== $current_position['x']) {
726
                            if (abs($currentY - $current_position['y']) > 9) {
727 29
                                $whiteSpace = "\n";
728
                            } else {
729
                                $curX = $currentX - $current_position['x'];
730 42
                                $factorX = 10 * $current_position_tm['a'] + 10 * $current_position_tm['i'];
731
                                if (true === $reverse_text) {
732
                                    if ($curX < -abs($factorX * 8)) {
733
                                        $whiteSpace = "\t";
734
                                    } elseif ($curX < -abs($factorX)) {
735
                                        $whiteSpace = ' ';
736 42
                                    }
737 42
                                } else {
738 8
                                    if ($curX > ($factorX * 8)) {
739 8
                                        $whiteSpace = "\t";
740 3
                                    } elseif ($curX > $factorX) {
741
                                        $whiteSpace = ' ';
742 6
                                    }
743 6
                                }
744
                            }
745
                        }
746
747
                        $newtext = $this->getTJUsingFontFallback(
748 42
                            $current_font,
749 41
                            $command[self::COMMAND],
750
                            $current_position_tm,
751 42
                            $page
752 41
                        );
753
754 42
                        // If there is no ActualText pending then write
755 6
                        if (false === $actual_text) {
756
                            if (false !== $reverse_text) {
757 42
                                // If we are in ReversedChars mode, add the whitespace last
758 41
                                $text[] = str_replace(["\r", "\n"], '', $newtext).$whiteSpace;
759 41
                            } else {
760
                                // Otherwise add the whitespace first
761 41
                                $text[] = $whiteSpace.str_replace(["\r", "\n"], '', $newtext);
762 41
                            }
763
764
                            // Record the position of this inserted text for comparison
765
                            // with the next text block.
766
                            // Provide a 'fudge' factor guess on how wide this text block
767
                            // is based on the number of characters. This helps limit the
768 42
                            // number of tabs inserted, but isn't perfect.
769
                            $factor = 6;
770
                            if (true === $reverse_text) {
771
                                $factor = -$factor;
772
                            }
773
                            $current_position = [
774
                                'x' => $currentX + mb_strlen($newtext) * $factor,
775 20
                                'y' => $currentY,
776
                            ];
777 20
                        } elseif (false === $last_written_position) {
778
                            // If there is an ActualText in the pipeline
779
                            // store the position this undisplayed text
780
                            // *would* have been written to, so the
781
                            // ActualText is displayed in the right spot
782
                            $last_written_position = [$currentX, $currentY];
783
                        }
784
                        break;
785
786
                        // move to start of next line
787
                    case 'T*':
788
                        $current_position['x'] = 0;
789
                        $current_position_td['x'] = 0;
790
                        $current_position_td['y'] += 10;
791
                        break;
792
793
                        // set character spacing
794
                    case 'Tc':
795
                        break;
796
797
                        // move text current point and set leading
798
                    case 'Td':
799
                    case 'TD':
800
                        // move text current point
801
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
802
                        $y = (float) array_pop($args);
803
                        $x = (float) array_pop($args);
804
805
                        $current_position_td = [
806
                            'x' => $current_position_td['x'] + $x * $current_position_tm['a'] + $x * $current_position_tm['i'],
807
                            'y' => $current_position_td['y'] + $y * $current_position_tm['b'] + $y * $current_position_tm['j'],
808
                        ];
809
                        break;
810
811
                    case 'Tf':
812
                        list($id) = preg_split('/\s/s', $command[self::COMMAND]);
813
                        $id = trim($id, '/');
814
                        if (null !== $page) {
815
                            $new_font = $page->getFont($id);
816
                            // If an invalid font ID is given, do not update the font.
817
                            // This should theoretically never happen, as the PDF spec states for the Tf operator:
818
                            // "The specified font value shall match a resource name in the Font entry of the default resource dictionary"
819
                            // (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 435)
820
                            // But we want to make sure that malformed PDFs do not simply crash.
821
                            if (null !== $new_font) {
822
                                $current_font = $new_font;
823
                            }
824
                        }
825
                        break;
826
827
                        // set leading
828
                    case 'TL':
829
                        break;
830
831
                        // set text position matrix
832
                    case 'Tm':
833
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
834
                        $current_position_tm = [
835
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
836
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
837
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
838
                        ];
839
                        break;
840
841
                        // set text rendering mode
842
                    case 'Ts':
843
                        break;
844
845
                        // set super/subscripting text rise
846
                    case 'Ts':
847
                        break;
848
849
                        // set word spacing
850
                    case 'Tw':
851
                        break;
852
853
                        // set horizontal scaling
854
                    case 'Tz':
855
                        break;
856
857
                    default:
858
                }
859
            }
860
        }
861
862
        $result = array_merge($result, $text);
863
864
        return $result;
865
    }
866
867
    /**
868
     * getCommandsText() expects the content of $text_part to be an
869
     * already formatted, single-line command from a document stream.
870
     * The companion function getSectionsText() returns a document
871
     * stream as an array of single commands for just this purpose.
872
     *
873
     * A better name for this function would be getCommandText()
874
     * since it now always works on just one command.
875
     */
876
    public function getCommandsText(string $text_part): array
877
    {
878
        $commands = $matches = [];
879
880
        preg_match('/^(([\/\[\(<])?.*)(?<!\w)([a-z01\'\"*]+)$/i', $text_part, $matches);
881
882
        $type = $matches[2];
883
        $operator = $matches[3];
884
        $command = trim($matches[1]);
885
886
        if ('TJ' == $operator) {
887
            $subcommand = [];
888
            $command = trim($command, '[]');
889
            do {
890
                $oldCommand = $command;
891
892
                // Search for parentheses string () format
893
                if (preg_match('/^ *\((.*?)(?<![^\\\\]\\\\)\) *(-?[\d.]+)?/', $command, $tjmatch)) {
894
                    $subcommand[] = [
895
                        self::TYPE => '(',
896
                        self::OPERATOR => 'TJ',
897
                        self::COMMAND => $tjmatch[1],
898
                    ];
899
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
900
                        $subcommand[] = [
901
                            self::TYPE => 'n',
902
                            self::OPERATOR => '',
903
                            self::COMMAND => $tjmatch[2],
904
                        ];
905
                    }
906
                    $command = substr($command, \strlen($tjmatch[0]));
907
                }
908
909
                // Search for hexadecimal <> format
910
                if (preg_match('/^ *<([0-9a-f\s]*)> *(-?[\d.]+)?/i', $command, $tjmatch)) {
911
                    $tjmatch[1] = preg_replace('/\s/', '', $tjmatch[1]);
912
                    $subcommand[] = [
913
                        self::TYPE => '<',
914
                        self::OPERATOR => 'TJ',
915
                        self::COMMAND => $tjmatch[1],
916
                    ];
917
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
918
                        $subcommand[] = [
919
                            self::TYPE => 'n',
920
                            self::OPERATOR => '',
921
                            self::COMMAND => $tjmatch[2],
922
                        ];
923
                    }
924
                    $command = substr($command, \strlen($tjmatch[0]));
925
                }
926
            } while ($command != $oldCommand);
927
928
            $command = $subcommand;
929
        } elseif ('Tj' == $operator || "'" == $operator || '"' == $operator) {
930
            // Depending on the string type, trim the data of the
931
            // appropriate delimiters
932
            if ('(' == $type) {
933
                // Don't use trim() here since a () string may end with
934
                // a balanced or escaped right parentheses, and trim()
935
                // will delete both. Both strings below are valid:
936
                //   eg. (String())
937
                //   eg. (String\))
938
                $command = preg_replace('/^\(|\)$/', '', $command);
939
            } elseif ('<' == $type) {
940
                $command = trim($command, '<>');
941
            }
942
        } elseif ('/' == $type) {
943
            $command = substr($command, 1);
944
        }
945
946
        $commands[] = [
947
            self::TYPE => $type,
948
            self::OPERATOR => $operator,
949
            self::COMMAND => $command,
950
        ];
951
952
        return $commands;
953
    }
954
955
    public static function factory(
956
        Document $document,
957
        Header $header,
958
        ?string $content,
959
        Config $config = null
960
    ): self {
961
        switch ($header->get('Type')->getContent()) {
962
            case 'XObject':
963
                switch ($header->get('Subtype')->getContent()) {
964
                    case 'Image':
965
                        return new Image($document, $header, $config->getRetainImageContent() ? $content : null, $config);
0 ignored issues
show
Bug introduced by
The method getRetainImageContent() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

965
                        return new Image($document, $header, $config->/** @scrutinizer ignore-call */ getRetainImageContent() ? $content : null, $config);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
966
967
                    case 'Form':
968
                        return new Form($document, $header, $content, $config);
969
                }
970
971
                return new self($document, $header, $content, $config);
972
973
            case 'Pages':
974
                return new Pages($document, $header, $content, $config);
975
976
            case 'Page':
977
                return new Page($document, $header, $content, $config);
978
979
            case 'Encoding':
980
                return new Encoding($document, $header, $content, $config);
981
982
            case 'Font':
983
                $subtype = $header->get('Subtype')->getContent();
984
                $classname = '\Smalot\PdfParser\Font\Font'.$subtype;
985
986
                if (class_exists($classname)) {
987
                    return new $classname($document, $header, $content, $config);
988
                }
989
990
                return new Font($document, $header, $content, $config);
991
992
            default:
993
                return new self($document, $header, $content, $config);
994
        }
995
    }
996
997
    /**
998
     * Returns unique id identifying the object.
999
     */
1000
    protected function getUniqueId(): string
1001
    {
1002
        return spl_object_hash($this);
1003
    }
1004
}
1005