Test Failed
Pull Request — master (#634)
by
unknown
02:08
created

PDFObject::parseDictionary()   D

Complexity

Conditions 19
Paths 26

Size

Total Lines 89
Code Lines 53

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 45
CRAP Score 20.2404

Importance

Changes 1
Bugs 0 Features 0
Metric Value
cc 19
eloc 53
c 1
b 0
f 0
nc 26
nop 1
dl 0
loc 89
ccs 45
cts 53
cp 0.8491
crap 20.2404
rs 4.5166

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\XObject\Form;
36
use Smalot\PdfParser\XObject\Image;
37
38
/**
39
 * Class PDFObject
40
 */
41
class PDFObject
42
{
43
    public const TYPE = 't';
44
45
    public const OPERATOR = 'o';
46
47
    public const COMMAND = 'c';
48
49
    /**
50
     * The recursion stack.
51
     *
52
     * @var array
53
     */
54
    public static $recursionStack = [];
55
56
    /**
57
     * @var Document
58
     */
59
    protected $document;
60
61
    /**
62
     * @var Header
63
     */
64
    protected $header;
65
66
    /**
67
     * @var string
68
     */
69
    protected $content;
70
71
    /**
72
     * @var Config
73
     */
74
    protected $config;
75
76 62
    /**
77
     * @var bool
78
     */
79
    protected $addPositionWhitespace = false;
80
81
    public function __construct(
82 62
        Document $document,
83 62
        Header $header = null,
84 62
        string $content = null,
85 62
        Config $config = null
86 62
    ) {
87
        $this->document = $document;
88 49
        $this->header = $header ?? new Header();
89
        $this->content = $content;
90 49
        $this->config = $config;
91
    }
92 3
93
    public function init()
94 3
    {
95
    }
96
97 49
    public function getDocument(): Document
98
    {
99 49
        return $this->document;
100
    }
101
102 3
    public function getHeader(): ?Header
103
    {
104 3
        return $this->header;
105
    }
106
107
    public function getConfig(): ?Config
108
    {
109
        return $this->config;
110 50
    }
111
112 50
    /**
113
     * @return Element|PDFObject|Header
114
     */
115 47
    public function get(string $name)
116
    {
117 47
        return $this->header->get($name);
118
    }
119
120 3
    public function has(string $name): bool
121
    {
122 3
        return $this->header->has($name);
123
    }
124
125 38
    public function getDetails(bool $deep = true): array
126
    {
127 38
        return $this->header->getDetails($deep);
128
    }
129
130 32
    public function getContent(): ?string
131
    {
132 32
        return $this->content;
133 32
    }
134
135
    /**
136 32
     * Takes a string of PDF document stream text and formats it into
137 32
     * a multi-line string with one PDF command on each line, separated
138
     * by \r\n. If the given string is null, or binary data is detected
139
     * instead of a document stream then return an empty string.
140
     *
141
     * @internal For internal use only, not part of the public API
142 32
     */
143 32
    public function cleanContent(?string $content): string
144 22
    {
145
        if (null === $content) {
146
            return '';
147
        }
148 32
149 32
        // Find all strings () and replace them so they aren't affected
150 21
        // by the next steps
151
        $pdfstrings = [];
152
        $attempt = '(';
153
        while (preg_match('/'.preg_quote($attempt, '/').'.*?(?<![^\\\\]\\\\)\)/s', $content, $text)) {
154 32
            // PDF strings can contain unescaped parentheses as long as
155 32
            // they're balanced, so check for balanced parentheses
156 32
            $left = preg_match_all('/(?<![^\\\\]\\\\)\(/', $text[0]);
157 32
            $right = preg_match_all('/(?<![^\\\\]\\\\)\)/', $text[0]);
158 32
159 18
            if ($left == $right) {
160
                // Replace the string with a unique placeholder
161
                $id = uniqid('STRING_', true);
162 32
                $pdfstrings[$id] = $text[0];
163
                $content = preg_replace(
164 32
                    '/'.preg_quote($text[0], '/').'/',
165 18
                    '@@@'.$id.'@@@',
166
                    $content,
167
                    1
168
                );
169
170
                // Reset to search for the next string
171 32
                $attempt = '(';
172 32
            } else {
173
                // We had unbalanced parentheses, so use the current
174
                // match as a base to find a longer string
175 32
                $attempt = $text[0];
176
            }
177 32
        }
178 7
179
        // Remove all carriage returns and line-feeds from the document stream
180
        $content = str_replace(["\r", "\n"], ' ', trim($content));
181 32
182 32
        // Find all dictionary << >> commands and replace them so they
183 11
        // aren't affected by the next steps
184
        $dictstore = [];
185
        while (preg_match('/(<<.*?>> *)(BDC|BMC|DP|MP)/', $content, $dicttext)) {
186 32
            $dictid = uniqid('DICT_', true);
187
            $dictstore[$dictid] = $dicttext[1];
188
            $content = preg_replace(
189 31
                '/'.preg_quote($dicttext[0], '/').'/',
190
                ' ###'.$dictid.'###'.$dicttext[2],
191 31
                $content,
192 31
                1
193 31
            );
194
        }
195
196 31
        // Now that all strings and dictionaries are hidden, the only
197 29
        // PDF commands left should all be plain text.
198 29
        // Detect text encoding of the current string to prevent reading
199 29
        // content streams that are images, etc. This prevents PHP
200
        // error messages when JPEG content is sent to this function
201
        // by the sample file '12249.pdf' from:
202 29
        // https://github.com/smalot/pdfparser/issues/458
203 29
        if (false === mb_detect_encoding($content, null, true)) {
204
            return '';
205
        }
206 29
207
        // Normalize white-space in the document stream
208
        $content = preg_replace('/\s{2,}/', ' ', $content);
209
210 29
        // Find all valid PDF operators and add \r\n after each; this
211
        // ensures there is just one command on every line
212 29
        // Source: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A
213
        // Source: https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A
214
        // Note: PDF Reference 1.7 lists 'I' and 'rI' as valid commands, while
215
        //       PDF 32000:2008 lists them as 'i' and 'ri' respectively. Both versions
216
        //       appear here in the list for completeness.
217 31
        $operators = [
218 4
          'b*', 'b', 'BDC', 'BMC', 'B*', 'BI', 'BT', 'BX', 'B', 'cm', 'cs', 'c', 'CS',
219 4
          'd0', 'd1', 'd', 'Do', 'DP', 'EMC', 'EI', 'ET', 'EX', 'f*', 'f', 'F', 'gs',
220 4
          'g', 'G',  'h', 'i', 'ID', 'I', 'j', 'J', 'k', 'K', 'l', 'm', 'MP', 'M', 'n',
221 4
          'q', 'Q', 're', 'rg', 'ri', 'rI', 'RG', 'scn', 'sc', 'sh', 's', 'SCN', 'SC',
222
          'S', 'T*', 'Tc', 'Td', 'TD', 'Tf', 'TJ', 'Tj', 'TL', 'Tm', 'Tr', 'Ts', 'Tw',
223 4
          'Tz', 'v', 'w', 'W*', 'W', 'y', '\'', '"',
224
        ];
225
        foreach ($operators as $operator) {
226
            $content = preg_replace(
227 31
                '/(?<!\w|\/)'.preg_quote($operator, '/').'(?![\w10\*])/',
228
                $operator."\r\n",
229
                $content
230 20
            );
231
        }
232 20
233 20
        // Restore the original content of the dictionary << >> commands
234 19
        $dictstore = array_reverse($dictstore, true);
235
        foreach ($dictstore as $id => $dict) {
236
            $content = str_replace('###'.$id.'###', $dict, $content);
237 20
        }
238 20
239 18
        // Restore the original string content
240
        $pdfstrings = array_reverse($pdfstrings, true);
241
        foreach ($pdfstrings as $id => $text) {
242 20
            // Strings may contain escaped newlines, or literal newlines
243 18
            // and we should clean these up before replacing the string
244
            // back into the content stream; this ensures no strings are
245
            // split between two lines (every command must be on one line)
246 2
            $text = str_replace(
247
                ["\\\r\n", "\\\r", "\\\n", "\r", "\n"],
248
                ['', '', '', '\r', '\n'],
249
                $text
250
            );
251
252 20
            $content = str_replace('@@@'.$id.'@@@', $text, $content);
253
        }
254 20
255 20
        $content = trim(preg_replace(['/(\r\n){2,}/', '/\r\n +/'], "\r\n", $content));
256 20
257 20
        return $content;
258
    }
259 20
260 20
    /**
261
     * getSectionsText() now takes an entire, unformatted document
262 20
     * stream as a string, cleans it, then filters out commands that
263
     * aren't needed for text positioning/extraction. It returns an
264 20
     * array of unprocessed PDF commands, one command per element.
265 18
     */
266 18
    public function getSectionsText(?string $content): array
267 18
    {
268
        $sections = [];
269 18
270 18
        // A cleaned stream has one command on every line, so split the
271 18
        // cleaned stream content on \r\n into an array
272 1
        $textCleaned = preg_split(
273 1
            '/(\r\n|\n|\r)/',
274
            $this->cleanContent($content),
275 1
            -1,
276
            \PREG_SPLIT_NO_EMPTY
277
        );
278 18
279 5
        $inTextBlock = false;
280
        foreach ($textCleaned as $line) {
281
            $line = trim($line);
282 18
283 15
            // Skip empty lines
284 15
            if ('' === $line) {
285 15
                continue;
286 15
            }
287 15
288
            // If a 'BT' is encountered, set the $inTextBlock flag
289
            if (preg_match('/BT$/', $line)) {
290 11
                $inTextBlock = true;
291 15
                $sections[] = $line;
292 15
293
                // If an 'ET' is encountered, unset the $inTextBlock flag
294 12
            } elseif ('ET' == $line) {
295
                $inTextBlock = false;
296 15
                $sections[] = $line;
297 15
            } elseif ($inTextBlock) {
298
                // If we are inside a BT ... ET text block, save all lines
299
                $sections[] = trim($line);
300 18
            } else {
301 3
                // Otherwise, if we are outside of a text block, only
302 3
                // save specific, necessary lines. Care should be taken
303 3
                // to ensure a command being checked for *only* matches
304 3
                // that command. For instance, a simple search for 'c'
305 3
                // may also match the 'sc' command. See the command
306
                // list in the cleanContent() method above.
307
                // Add more commands to save here as you find them in
308
                // weird PDFs!
309 3
                if ('q' == $line[-1] || 'Q' == $line[-1]) {
310
                    // Save and restore graphics state commands
311 18
                    $sections[] = $line;
312 18
                } elseif (preg_match('/(?<!\w)B[DM]C$/', $line)) {
313 18
                    // Begin marked content sequence
314 18
                    $sections[] = $line;
315 18
                } elseif (preg_match('/(?<!\w)[DM]P$/', $line)) {
316
                    // Marked content point
317
                    $sections[] = $line;
318
                } elseif (preg_match('/(?<!\w)EMC$/', $line)) {
319
                    // End marked content sequence
320
                    $sections[] = $line;
321 18
                } elseif (preg_match('/(?<!\w)cm$/', $line)) {
322 16
                    // Graphics position change commands
323
                    $sections[] = $line;
324
                } elseif (preg_match('/(?<!\w)Tf$/', $line)) {
325 18
                    // Font change commands
326
                    $sections[] = $line;
327 18
                } elseif (preg_match('/(?<!\w)Do$/', $line)) {
328
                    // Invoke named XObject command
329 5
                    $sections[] = $line;
330 5
                }
331
            }
332 18
        }
333
334 6
        return $sections;
335 6
    }
336
337 18
    private function getDefaultFont(Page $page = null): Font
338 18
    {
339 13
        $fonts = [];
340
        if (null !== $page) {
341 17
            $fonts = $page->getFonts();
342 18
        }
343 18
344 18
        $firstFont = $this->document->getFirstFont();
345
        if (null !== $firstFont) {
346
            $fonts[] = $firstFont;
347 15
        }
348 1
349 1
        if (\count($fonts) > 0) {
350
            return reset($fonts);
351 15
        }
352 14
353 14
        return new Font($this->document, null, null, $this->config);
354 14
    }
355 14
356 14
    /**
357 14
     * Decode a '[]TJ' command and attempt to use alternate fonts if
358 12
     * the current font results in output that contains Unicode control
359
     * characters. See Font::decodeText for a full description of
360
     * $textMatrix
361 14
     *
362 14
     * @param array<int,array<string,string|bool>> $command
363 14
     * @param array<string,float>                  $textMatrix
364 10
     */
365
    private function getTJUsingFontFallback(
366
        Font $font,
367 14
        array $command,
368 14
        array $textMatrix = ['a' => 1, 'b' => 0, 'i' => 0, 'j' => 1],
369
        Page $page = null
370
    ): string {
371 12
        $orig_text = $font->decodeText($command, $textMatrix);
372
        $text = $orig_text;
373
374
        // If we make this a Config option, we can add a check if it's
375 12
        // enabled here.
376 4
        if (null !== $page) {
377
            $font_ids = array_keys($page->getFonts());
378
379 12
            // If the decoded text contains UTF-8 control characters
380
            // then the font page being used is probably the wrong one.
381
            // Loop through the rest of the fonts to see if we can get
382
            // a good decode. Allow x09 to x0d which are whitespace.
383
            while (preg_match('/[\x00-\x08\x0e-\x1f\x7f]/u', $text) || false !== strpos(bin2hex($text), '00')) {
384 12
                // If we're out of font IDs, then give up and use the
385 4
                // original string
386 4
                if (0 == \count($font_ids)) {
387
                    return $orig_text;
388 11
                }
389
390
                // Try the next font ID
391 11
                $font = $page->getFont(array_shift($font_ids));
392 4
                $text = $font->decodeText($command, $textMatrix);
393 4
            }
394 4
        }
395 4
396
        return $text;
397
    }
398 4
399
    /**
400 4
     * Expects a string that is a full PDF dictionary object, including
401
     * the outer enclosing << >> angle brackets.
402
     *
403 4
     * @throws \Exception
404
     */
405 9
    public function parseDictionary(string $dictionary): array
406 8
    {
407 2
        // Normalize whitespace
408
        $dictionary = preg_replace(['/\r/', '/\n/', '/\s{2,}/'], ' ', trim($dictionary));
409 8
410
        if ('<<' != substr($dictionary, 0, 2)) {
411
            throw new \Exception('Not a valid dictionary object.');
412 8
        }
413
414
        $parsed = [];
415 8
        $stack = [];
416 3
        $currentName = '';
417
        $arrayTypeNumeric = false;
418 8
419 3
        // Remove outer layer of dictionary, and split on tokens
420
        $split = preg_split(
421 7
            '/(<<|>>|\[|\]|\/[^\s\/\[\]\(\)<>]*)/',
422
            trim(preg_replace('/^<<|>>$/', '', $dictionary)),
423
            -1,
424 7
            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
425 7
        );
426
427
        foreach ($split as $token) {
428 7
            $token = trim($token);
429 7
            switch ($token) {
430 1
                case '':
431
                    break;
432 6
433
                    // Open numeric array
434
                case '[':
435 6
                    $parsed[$currentName] = [];
436 6
                    $arrayTypeNumeric = true;
437
438
                    // Move up one level in the stack
439
                    $stack[\count($stack)] = &$parsed;
440
                    $parsed = &$parsed[$currentName];
441
                    $currentName = '';
442
                    break;
443
444
                    // Open hashed array
445 18
                case '<<':
446 1
                    $parsed[$currentName] = [];
447 1
                    $arrayTypeNumeric = false;
448
449
                    // Move up one level in the stack
450 18
                    $stack[\count($stack)] = &$parsed;
451
                    $parsed = &$parsed[$currentName];
452
                    $currentName = '';
453 20
                    break;
454
455
                    // Close numeric array
456
                case ']':
457
                    // Revert string type arrays back to a single element
458
                    if (\is_array($parsed) && 1 == \count($parsed)
459 6
                        && isset($parsed[0]) && \is_string($parsed[0])
460
                        && '' !== $parsed[0] && '/' != $parsed[0][0]) {
461 6
                        $parsed = '['.$parsed[0].']';
462 6
                    }
463 6
                    // Close hashed array
464
                    // no break
465 6
                case '>>':
466 6
                    $arrayTypeNumeric = false;
467
468 6
                    // Move down one level in the stack
469 6
                    $parsed = &$stack[\count($stack) - 1];
470
                    unset($stack[\count($stack) - 1]);
471 6
                    break;
472 3
473
                default:
474
                    // If value begins with a slash, then this is a name
475 6
                    // Add it to the appropriate array
476 6
                    if ('/' == substr($token, 0, 1)) {
477
                        $currentName = substr($token, 1);
478
                        if (true == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
479 6
                            $parsed[] = $currentName;
480
                            $currentName = '';
481
                        }
482 6
                    } elseif ('' != $currentName) {
483 6
                        if (false == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
484 6
                            $parsed[$currentName] = $token;
485 6
                        }
486 6
                        $currentName = '';
487
                    } elseif ('' == $currentName) {
488 6
                        $parsed[] = $token;
489
                    }
490 6
            }
491 6
        }
492 5
493
        return $parsed;
494 6
    }
495 6
496 6
    /**
497 6
     * getText() leverages getTextArray() to get the content of the
498
     * document, setting the addPositionWhitespace flag to true so
499
     * whitespace is inserted in a logical way for reading by humans.
500 5
     */
501 4
    public function getText(Page $page = null): string
502
    {
503 5
        $this->addPositionWhitespace = true;
504 4
        $result = $this->getTextArray($page);
505
        $this->addPositionWhitespace = false;
506
507 5
        return implode('', $result).' ';
508
    }
509
510
    /**
511 5
     * getTextArray() returns the text objects of a document in an
512 2
     * array. By default no positioning whitespace is added to the
513
     * output unless the addPositionWhitespace flag is set to true.
514
     *
515 5
     * @throws \Exception
516
     */
517
    public function getTextArray(Page $page = null): array
518
    {
519
        $result = [];
520 5
        $text = [];
521
522 4
        $marked_stack = [];
523
        $last_written_position = false;
524 4
525
        $sections = $this->getSectionsText($this->content);
526
        $current_font = $this->getDefaultFont($page);
527 4
528
        $current_position = ['x' => false, 'y' => false];
529
        $current_position_tm = [
530
            'a' => 1, 'b' => 0, 'c' => 0,
531
            'i' => 0, 'j' => 1, 'k' => 0,
532
            'x' => false, 'y' => false, 'z' => 1,
533
        ];
534
        $current_position_td = ['x' => 0, 'y' => 0];
535
        $current_position_cm = [
536
            'a' => 1, 'b' => 0, 'c' => 0,
537 4
            'i' => 0, 'j' => 1, 'k' => 0,
538 4
            'x' => 0, 'y' => 0, 'z' => 1,
539 2
        ];
540
541 4
        $clipped_font = [];
542
        $clipped_position_cm = [];
543
544 4
        self::$recursionStack[] = $this->getUniqueId();
545
546
        foreach ($sections as $section) {
547 4
            $commands = $this->getCommandsText($section);
548
            foreach ($commands as $command) {
549
                switch ($command[self::OPERATOR]) {
550 4
                    // Begin text object
551 1
                    case 'BT':
552
                        // Reset text positioning matrices
553 4
                        $current_position_tm = [
554
                            'a' => 1, 'b' => 0, 'c' => 0,
555
                            'i' => 0, 'j' => 1, 'k' => 0,
556 4
                            'x' => false, 'y' => false, 'z' => 1,
557 4
                        ];
558
                        $current_position_td = ['x' => 0, 'y' => 0];
559
                        break;
560 4
561 4
                        // Begin marked content sequence with property list
562 2
                    case 'BDC':
563
                        if (preg_match('/(<<.*>>)$/', $command[self::COMMAND], $match)) {
0 ignored issues
show
Bug introduced by
It seems like $command[self::COMMAND] can also be of type array and array<mixed,array<string,mixed|string>>; however, parameter $subject of preg_match() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

563
                        if (preg_match('/(<<.*>>)$/', /** @scrutinizer ignore-type */ $command[self::COMMAND], $match)) {
Loading history...
564 2
                            $dict = $this->parseDictionary($match[1]);
565
566
                            // Check for ActualText block
567 2
                            if (isset($dict['ActualText']) && \is_string($dict['ActualText']) && '' !== $dict['ActualText']) {
568 2
                                if ('[' == $dict['ActualText'][0]) {
569
                                    // Simulate a 'TJ' command on the stack
570
                                    $marked_stack[] = [
571
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'TJ')[0],
572
                                    ];
573
                                } elseif ('<' == $dict['ActualText'][0] || '(' == $dict['ActualText'][0]) {
574
                                    // Simulate a 'Tj' command on the stack
575
                                    $marked_stack[] = [
576 6
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'Tj')[0],
577
                                    ];
578
                                }
579 29
                            }
580
                        }
581 29
                        break;
582
583 29
                        // Begin marked content sequence
584 29
                    case 'BMC':
585 29
                        if ('ReversedChars' == $command[self::COMMAND]) {
586
                            // Upon encountering a ReversedChars command,
587 29
                            // add the characters we've built up so far to
588 29
                            // the result array
589 29
                            $result = array_merge($result, $text);
590
591 29
                            // Start a fresh $text array that will contain
592 29
                            // reversed characters
593 29
                            $text = [];
594 29
595 29
                            // Add the reversed text flag to the stack
596 29
                            $marked_stack[] = ['ReversedChars' => true];
597
                        }
598
                        break;
599
600 29
                        // set graphics position matrix
601 29
                    case 'cm':
602 29
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
0 ignored issues
show
Bug introduced by
It seems like $command[self::COMMAND] can also be of type array and array<mixed,array<string,mixed|string>>; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

602
                        $args = preg_split('/\s+/s', /** @scrutinizer ignore-type */ $command[self::COMMAND]);
Loading history...
603 11
                        $current_position_cm = [
604 11
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
605 11
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
606
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
607
                        ];
608
                        break;
609 11
610 11
                    case 'Do':
611 11
                        if (null !== $page) {
612
                            $args = preg_split('/\s/s', $command[self::COMMAND]);
613 29
                            $id = trim(array_pop($args), '/ ');
614
                            $xobject = $page->getXObject($id);
615 29
616 29
                            // @todo $xobject could be a ElementXRef object, which would then throw an error
617
                            if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack)) {
618 25
                                // Not a circular reference.
619 25
                                $text[] = $xobject->getText($page);
620 25
                            }
621
                        }
622 25
                        break;
623
624 25
                        // Marked content point with (DP) & without (MP) property list
625 25
                    case 'DP':
626 25
                    case 'MP':
627
                        break;
628
629 25
                        // End text object
630 25
                    case 'ET':
631
                        break;
632 25
633
                        // Store current selected font and graphics matrix
634 29
                    case 'q':
635 29
                        $clipped_font[] = $current_font;
636
                        $clipped_position_cm[] = $current_position_cm;
637 14
                        break;
638 14
639 14
                        // Restore previous selected font and graphics matrix
640 14
                    case 'Q':
641 14
                        $current_font = array_pop($clipped_font);
642 14
                        $current_position_cm = array_pop($clipped_position_cm);
643
                        break;
644
645 14
                        // End marked content sequence
646 9
                    case 'EMC':
647 9
                        $data = false;
648
                        if (\count($marked_stack)) {
649 14
                            $marked = array_pop($marked_stack);
650
                            $action = key($marked);
651 29
                            $data = $marked[$action];
652 29
653 22
                            switch ($action) {
654 22
                                // If we are in ReversedChars mode...
655 22
                                case 'ReversedChars':
656 22
                                    // Reverse the characters we've built up so far
657 22
                                    foreach ($text as $key => $t) {
658 22
                                        $text[$key] = implode('', array_reverse(
659 22
                                            mb_str_split($t, 1, mb_internal_encoding())
0 ignored issues
show
Bug introduced by
It seems like mb_internal_encoding() can also be of type true; however, parameter $encoding of mb_str_split() does only seem to accept null|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

659
                                            mb_str_split($t, 1, /** @scrutinizer ignore-type */ mb_internal_encoding())
Loading history...
660
                                        ));
661
                                    }
662 22
663 22
                                    // Add these characters to the result array
664 22
                                    $result = array_merge($result, $text);
665
666
                                    // Start a fresh $text array that will contain
667 16
                                    // non-reversed characters
668 16
                                    $text = [];
669
                                    break;
670 22
671
                                case 'ActualText':
672
                                    // Use the content of the ActualText as a command
673
                                    $command = $data;
674
                                    break;
675 22
                            }
676
                        }
677 22
678 22
                        // If this EMC command has been transformed into a 'Tj'
679
                        // or 'TJ' command because of being ActualText, then bypass
680 22
                        // the break to proceed to the writing section below.
681
                        if ('Tj' != $command[self::OPERATOR] && 'TJ' != $command[self::OPERATOR]) {
682 22
                            break;
683 22
                        }
684
685 22
                        // no break
686 18
                    case "'":
687 18
                    case '"':
688
                        if ("'" == $command[self::OPERATOR] || '"' == $command[self::OPERATOR]) {
689
                            // Move to next line and write text
690 22
                            $current_position['x'] = 0;
691
                            $current_position_td['x'] = 0;
692
                            $current_position_td['y'] += 10;
693 29
                        }
694 1
                        // no break
695 29
                    case 'Tj':
696 29
                        $command[self::COMMAND] = [$command];
697 29
                        // no break
698
                    case 'TJ':
699
                        // Check the marked content stack for flags
700
                        $actual_text = false;
701 29
                        $reverse_text = false;
702 29
                        foreach ($marked_stack as $marked) {
703 29
                            if (isset($marked['ActualText'])) {
704 24
                                $actual_text = true;
705 22
                            }
706 22
                            if (isset($marked['ReversedChars'])) {
707 22
                                $reverse_text = true;
708 17
                            }
709 17
                        }
710 17
711 17
                        // Account for text position ONLY just before we write text
712 17
                        if (false === $actual_text && \is_array($last_written_position)) {
713
                            // If $last_written_position is an array, that
714
                            // means we have stored text position coordinates
715
                            // for placing an ActualText
716 29
                            $currentX = $last_written_position[0];
717 29
                            $currentY = $last_written_position[1];
718 29
                            $last_written_position = false;
719 29
                        } else {
720 29
                            $currentX = $current_position_cm['x'] + $current_position_tm['x'] + $current_position_td['x'];
721
                            $currentY = $current_position_cm['y'] + $current_position_tm['y'] + $current_position_td['y'];
722
                        }
723 25
                        $whiteSpace = '';
724
725
                        if (true === $this->addPositionWhitespace && false !== $current_position['x']) {
726
                            if (abs($currentY - $current_position['y']) > 9) {
727 29
                                $whiteSpace = "\n";
728
                            } else {
729
                                $curX = $currentX - $current_position['x'];
730 42
                                $factorX = 10 * $current_position_tm['a'] + 10 * $current_position_tm['i'];
731
                                if (true === $reverse_text) {
732
                                    if ($curX < -abs($factorX * 8)) {
733
                                        $whiteSpace = "\t";
734
                                    } elseif ($curX < -abs($factorX)) {
735
                                        $whiteSpace = ' ';
736 42
                                    }
737 42
                                } else {
738 8
                                    if ($curX > ($factorX * 8)) {
739 8
                                        $whiteSpace = "\t";
740 3
                                    } elseif ($curX > $factorX) {
741
                                        $whiteSpace = ' ';
742 6
                                    }
743 6
                                }
744
                            }
745
                        }
746
747
                        $newtext = $this->getTJUsingFontFallback(
748 42
                            $current_font,
749 41
                            $command[self::COMMAND],
750
                            $current_position_tm,
751 42
                            $page
752 41
                        );
753
754 42
                        // If there is no ActualText pending then write
755 6
                        if (false === $actual_text) {
756
                            if (false !== $reverse_text) {
757 42
                                // If we are in ReversedChars mode, add the whitespace last
758 41
                                $text[] = str_replace(["\r", "\n"], '', $newtext).$whiteSpace;
759 41
                            } else {
760
                                // Otherwise add the whitespace first
761 41
                                $text[] = $whiteSpace.str_replace(["\r", "\n"], '', $newtext);
762 41
                            }
763
764
                            // Record the position of this inserted text for comparison
765
                            // with the next text block.
766
                            // Provide a 'fudge' factor guess on how wide this text block
767
                            // is based on the number of characters. This helps limit the
768 42
                            // number of tabs inserted, but isn't perfect.
769
                            $factor = 6;
770
                            if (true === $reverse_text) {
771
                                $factor = -$factor;
772
                            }
773
                            $current_position = [
774
                                'x' => $currentX + mb_strlen($newtext) * $factor,
775 20
                                'y' => $currentY,
776
                            ];
777 20
                        } elseif (false === $last_written_position) {
778
                            // If there is an ActualText in the pipeline
779
                            // store the position this undisplayed text
780
                            // *would* have been written to, so the
781
                            // ActualText is displayed in the right spot
782
                            $last_written_position = [$currentX, $currentY];
783
                        }
784
                        break;
785
786
                        // move to start of next line
787
                    case 'T*':
788
                        $current_position['x'] = 0;
789
                        $current_position_td['x'] = 0;
790
                        $current_position_td['y'] += 10;
791
                        break;
792
793
                        // set character spacing
794
                    case 'Tc':
795
                        break;
796
797
                        // move text current point and set leading
798
                    case 'Td':
799
                    case 'TD':
800
                        // move text current point
801
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
802
                        $y = (float) array_pop($args);
803
                        $x = (float) array_pop($args);
804
805
                        $current_position_td = [
806
                            'x' => $current_position_td['x'] + $x * $current_position_tm['a'] + $x * $current_position_tm['i'],
807
                            'y' => $current_position_td['y'] + $y * $current_position_tm['b'] + $y * $current_position_tm['j'],
808
                        ];
809
                        break;
810
811
                    case 'Tf':
812
                        list($id) = preg_split('/\s/s', $command[self::COMMAND]);
813
                        $id = trim($id, '/');
814
                        if (null !== $page) {
815
                            $new_font = $page->getFont($id);
816
                            // If an invalid font ID is given, do not update the font.
817
                            // This should theoretically never happen, as the PDF spec states for the Tf operator:
818
                            // "The specified font value shall match a resource name in the Font entry of the default resource dictionary"
819
                            // (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 435)
820
                            // But we want to make sure that malformed PDFs do not simply crash.
821
                            if (null !== $new_font) {
822
                                $current_font = $new_font;
823
                            }
824
                        }
825
                        break;
826
827
                        // set leading
828
                    case 'TL':
829
                        break;
830
831
                        // set text position matrix
832
                    case 'Tm':
833
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
834
                        $current_position_tm = [
835
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
836
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
837
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
838
                        ];
839
                        break;
840
841
                        // set text rendering mode
842
                    case 'Ts':
843
                        break;
844
845
                        // set super/subscripting text rise
846
                    case 'Ts':
847
                        break;
848
849
                        // set word spacing
850
                    case 'Tw':
851
                        break;
852
853
                        // set horizontal scaling
854
                    case 'Tz':
855
                        break;
856
857
                    default:
858
                }
859
            }
860
        }
861
862
        $result = array_merge($result, $text);
863
864
        return $result;
865
    }
866
867
    /**
868
     * getCommandsText() expects the content of $text_part to be an
869
     * already formatted, single-line command from a document stream.
870
     * The companion function getSectionsText() returns a document
871
     * stream as an array of single commands for just this purpose.
872
     *
873
     * A better name for this function would be getCommandText()
874
     * since it now always works on just one command.
875
     */
876
    public function getCommandsText(string $text_part): array
877
    {
878
        $commands = $matches = [];
879
880
        preg_match('/^(([\/\[\(<])?.*)(?<!\w)([a-z01\'\"*]+)$/i', $text_part, $matches);
881
882
        $type = $matches[2];
883
        $operator = $matches[3];
884
        $command = trim($matches[1]);
885
886
        if ('TJ' == $operator) {
887
            $subcommand = [];
888
            $command = trim($command, '[]');
889
            do {
890
                $oldCommand = $command;
891
892
                // Search for parentheses string () format
893
                if (preg_match('/^ *\((.*?)(?<![^\\\\]\\\\)\) *(-?[\d.]+)?/', $command, $tjmatch)) {
894
                    $subcommand[] = [
895
                        self::TYPE => '(',
896
                        self::OPERATOR => 'TJ',
897
                        self::COMMAND => $tjmatch[1],
898
                    ];
899
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
900
                        $subcommand[] = [
901
                            self::TYPE => 'n',
902
                            self::OPERATOR => '',
903
                            self::COMMAND => $tjmatch[2],
904
                        ];
905
                    }
906
                    $command = substr($command, \strlen($tjmatch[0]));
907
                }
908
909
                // Search for hexadecimal <> format
910
                if (preg_match('/^ *<([0-9a-f\s]*)> *(-?[\d.]+)?/i', $command, $tjmatch)) {
911
                    $tjmatch[1] = preg_replace('/\s/', '', $tjmatch[1]);
912
                    $subcommand[] = [
913
                        self::TYPE => '<',
914
                        self::OPERATOR => 'TJ',
915
                        self::COMMAND => $tjmatch[1],
916
                    ];
917
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
918
                        $subcommand[] = [
919
                            self::TYPE => 'n',
920
                            self::OPERATOR => '',
921
                            self::COMMAND => $tjmatch[2],
922
                        ];
923
                    }
924
                    $command = substr($command, \strlen($tjmatch[0]));
925
                }
926
            } while ($command != $oldCommand);
927
928
            $command = $subcommand;
929
        } elseif ('Tj' == $operator || "'" == $operator || '"' == $operator) {
930
            // Depending on the string type, trim the data of the
931
            // appropriate delimiters
932
            if ('(' == $type) {
933
                // Don't use trim() here since a () string may end with
934
                // a balanced or escaped right parentheses, and trim()
935
                // will delete both. Both strings below are valid:
936
                //   eg. (String())
937
                //   eg. (String\))
938
                $command = preg_replace('/^\(|\)$/', '', $command);
939
            } elseif ('<' == $type) {
940
                $command = trim($command, '<>');
941
            }
942
        } elseif ('/' == $type) {
943
            $command = substr($command, 1);
944
        }
945
946
        $commands[] = [
947
            self::TYPE => $type,
948
            self::OPERATOR => $operator,
949
            self::COMMAND => $command,
950
        ];
951
952
        return $commands;
953
    }
954
955
    public static function factory(
956
        Document $document,
957
        Header $header,
958
        ?string $content,
959
        Config $config = null
960
    ): self {
961
        switch ($header->get('Type')->getContent()) {
962
            case 'XObject':
963
                switch ($header->get('Subtype')->getContent()) {
964
                    case 'Image':
965
                        return new Image($document, $header, $config->getRetainImageContent() ? $content : null, $config);
0 ignored issues
show
Bug introduced by
The method getRetainImageContent() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

965
                        return new Image($document, $header, $config->/** @scrutinizer ignore-call */ getRetainImageContent() ? $content : null, $config);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
966
967
                    case 'Form':
968
                        return new Form($document, $header, $content, $config);
969
                }
970
971
                return new self($document, $header, $content, $config);
972
973
            case 'Pages':
974
                return new Pages($document, $header, $content, $config);
975
976
            case 'Page':
977
                return new Page($document, $header, $content, $config);
978
979
            case 'Encoding':
980
                return new Encoding($document, $header, $content, $config);
981
982
            case 'Font':
983
                $subtype = $header->get('Subtype')->getContent();
984
                $classname = '\Smalot\PdfParser\Font\Font'.$subtype;
985
986
                if (class_exists($classname)) {
987
                    return new $classname($document, $header, $content, $config);
988
                }
989
990
                return new Font($document, $header, $content, $config);
991
992
            default:
993
                return new self($document, $header, $content, $config);
994
        }
995
    }
996
997
    /**
998
     * Returns unique id identifying the object.
999
     */
1000
    protected function getUniqueId(): string
1001
    {
1002
        return spl_object_hash($this);
1003
    }
1004
}
1005