Test Failed
Pull Request — master (#634)
by
unknown
02:14
created

PDFObject::getCommandsText()   D

Complexity

Conditions 18
Paths 15

Size

Total Lines 82
Code Lines 49

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 0
CRAP Score 342

Importance

Changes 2
Bugs 0 Features 0
Metric Value
cc 18
eloc 49
c 2
b 0
f 0
nc 15
nop 2
dl 0
loc 82
ccs 0
cts 0
cp 0
crap 342
rs 4.8666

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\XObject\Form;
36
use Smalot\PdfParser\XObject\Image;
37
38
/**
39
 * Class PDFObject
40
 */
41
class PDFObject
42
{
43
    public const TYPE = 't';
44
45
    public const OPERATOR = 'o';
46
47
    public const COMMAND = 'c';
48
49
    /**
50
     * The recursion stack.
51
     *
52
     * @var array
53
     */
54
    public static $recursionStack = [];
55
56
    /**
57
     * @var Document
58
     */
59
    protected $document;
60
61
    /**
62
     * @var Header
63
     */
64
    protected $header;
65
66
    /**
67
     * @var string
68
     */
69
    protected $content;
70
71
    /**
72
     * @var Config
73
     */
74
    protected $config;
75
76 62
    /**
77
     * @var bool
78
     */
79
    protected $addPositionWhitespace = false;
80
81
    public function __construct(
82 62
        Document $document,
83 62
        Header $header = null,
84 62
        string $content = null,
85 62
        Config $config = null
86 62
    ) {
87
        $this->document = $document;
88 49
        $this->header = $header ?? new Header();
89
        $this->content = $content;
90 49
        $this->config = $config;
91
    }
92 3
93
    public function init()
94 3
    {
95
    }
96
97 49
    public function getDocument(): Document
98
    {
99 49
        return $this->document;
100
    }
101
102 3
    public function getHeader(): ?Header
103
    {
104 3
        return $this->header;
105
    }
106
107
    public function getConfig(): ?Config
108
    {
109
        return $this->config;
110 50
    }
111
112 50
    /**
113
     * @return Element|PDFObject|Header
114
     */
115 47
    public function get(string $name)
116
    {
117 47
        return $this->header->get($name);
118
    }
119
120 3
    public function has(string $name): bool
121
    {
122 3
        return $this->header->has($name);
123
    }
124
125 38
    public function getDetails(bool $deep = true): array
126
    {
127 38
        return $this->header->getDetails($deep);
128
    }
129
130 32
    public function getContent(): ?string
131
    {
132 32
        return $this->content;
133 32
    }
134
135
    /**
136 32
     * Creates a duplicate of the document stream with
137 32
     * strings and other items replaced by $char. Formerly
138
     * getSectionsText() used this output to more easily gather offset
139
     * values to extract text from the *actual* document stream.
140
     *
141
     * @deprecated function is no longer used and will be removed in a future release
142 32
     *
143 32
     * @internal
144 22
     */
145
    public function cleanContent(string $content, string $char = 'X')
146
    {
147
        $char = $char[0];
148 32
        $content = str_replace(['\\\\', '\\)', '\\('], $char.$char, $content);
149 32
150 21
        // Remove image bloc with binary content
151
        preg_match_all('/\s(BI\s.*?(\sID\s).*?(\sEI))\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
152
        foreach ($matches[0] as $part) {
153
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
154 32
        }
155 32
156 32
        // Clean content in square brackets [.....]
157 32
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);
0 ignored issues
show
Unused Code introduced by
The call to preg_match_all() has too many arguments starting with PREG_OFFSET_CAPTURE. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

157
        /** @scrutinizer ignore-call */ 
158
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
158 32
        foreach ($matches[1] as $part) {
159 18
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
160
        }
161
162 32
        // Clean content in round brackets (.....)
163
        preg_match_all('/\((.*?)\)/s', $content, $matches, \PREG_OFFSET_CAPTURE);
164 32
        foreach ($matches[1] as $part) {
165 18
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
166
        }
167
168
        // Clean structure
169
        if ($parts = preg_split('/(<|>)/s', $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
0 ignored issues
show
Bug introduced by
It seems like $content can also be of type array; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

169
        if ($parts = preg_split('/(<|>)/s', /** @scrutinizer ignore-type */ $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
Loading history...
170
            $content = '';
171 32
            $level = 0;
172 32
            foreach ($parts as $part) {
173
                if ('<' == $part) {
174
                    ++$level;
175 32
                }
176
177 32
                $content .= (0 == $level ? $part : str_repeat($char, \strlen($part)));
178 7
179
                if ('>' == $part) {
180
                    --$level;
181 32
                }
182 32
            }
183 11
        }
184
185
        // Clean BDC and EMC markup
186 32
        preg_match_all(
187
            '/(\/[A-Za-z0-9\_]*\s*'.preg_quote($char).'*BDC)/s',
188
            $content,
189 31
            $matches,
190
            \PREG_OFFSET_CAPTURE
191 31
        );
192 31
        foreach ($matches[1] as $part) {
193 31
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
194
        }
195
196 31
        preg_match_all('/\s(EMC)\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
197 29
        foreach ($matches[1] as $part) {
198 29
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
199 29
        }
200
201
        return $content;
202 29
    }
203 29
204
    /**
205
     * Takes a string of PDF document stream text and formats
206 29
     * it into a multi-line string with one PDF command on each line,
207
     * separated by \r\n. If the given string is null, or binary data
208
     * is detected instead of a document stream then return an empty
209
     * string.
210 29
     *
211
     * @internal
212 29
     */
213
    public function formatContent(?string $content): string
214
    {
215
        if (null === $content) {
216
            return '';
217 31
        }
218 4
219 4
        // Find all strings () and replace them so they aren't affected
220 4
        // by the next steps
221 4
        $pdfstrings = [];
222
        $attempt = '(';
223 4
        while (preg_match('/'.preg_quote($attempt, '/').'.*?(?<![^\\\\]\\\\)\)/s', $content, $text)) {
224
            // PDF strings can contain unescaped parentheses as long as
225
            // they're balanced, so check for balanced parentheses
226
            $left = preg_match_all('/(?<![^\\\\]\\\\)\(/', $text[0]);
227 31
            $right = preg_match_all('/(?<![^\\\\]\\\\)\)/', $text[0]);
228
229
            if ($left == $right) {
230 20
                // Replace the string with a unique placeholder
231
                $id = uniqid('STRING_', true);
232 20
                $pdfstrings[$id] = $text[0];
233 20
                $content = preg_replace(
234 19
                    '/'.preg_quote($text[0], '/').'/',
235
                    '@@@'.$id.'@@@',
236
                    $content,
237 20
                    1
238 20
                );
239 18
240
                // Reset to search for the next string
241
                $attempt = '(';
242 20
            } else {
243 18
                // We had unbalanced parentheses, so use the current
244
                // match as a base to find a longer string
245
                $attempt = $text[0];
246 2
            }
247
        }
248
249
        // Remove all carriage returns and line-feeds from the document stream
250
        $content = str_replace(["\r", "\n"], ' ', trim($content));
251
252 20
        // Find all dictionary << >> commands and replace them so they
253
        // aren't affected by the next steps
254 20
        $dictstore = [];
255 20
        while (preg_match('/(<<.*?>> *)(BDC|BMC|DP|MP)/', $content, $dicttext)) {
256 20
            $dictid = uniqid('DICT_', true);
257 20
            $dictstore[$dictid] = $dicttext[1];
258
            $content = preg_replace(
259 20
                '/'.preg_quote($dicttext[0], '/').'/',
260 20
                ' ###'.$dictid.'###'.$dicttext[2],
261
                $content,
262 20
                1
263
            );
264 20
        }
265 18
266 18
        // Now that all strings and dictionaries are hidden, the only
267 18
        // PDF commands left should all be plain text.
268
        // Detect text encoding of the current string to prevent reading
269 18
        // content streams that are images, etc. This prevents PHP
270 18
        // error messages when JPEG content is sent to this function
271 18
        // by the sample file '12249.pdf' from:
272 1
        // https://github.com/smalot/pdfparser/issues/458
273 1
        if (false === mb_detect_encoding($content, null, true)) {
274
            return '';
275 1
        }
276
277
        // Normalize white-space in the document stream
278 18
        $content = preg_replace('/\s{2,}/', ' ', $content);
279 5
280
        // Find all valid PDF operators and add \r\n after each; this
281
        // ensures there is just one command on every line
282 18
        // Source: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A
283 15
        // Source: https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A
284 15
        // Note: PDF Reference 1.7 lists 'I' and 'rI' as valid commands, while
285 15
        //       PDF 32000:2008 lists them as 'i' and 'ri' respectively. Both versions
286 15
        //       appear here in the list for completeness.
287 15
        $operators = [
288
          'b*', 'b', 'BDC', 'BMC', 'B*', 'BI', 'BT', 'BX', 'B', 'cm', 'cs', 'c', 'CS',
289
          'd0', 'd1', 'd', 'Do', 'DP', 'EMC', 'EI', 'ET', 'EX', 'f*', 'f', 'F', 'gs',
290 11
          'g', 'G',  'h', 'i', 'ID', 'I', 'j', 'J', 'k', 'K', 'l', 'm', 'MP', 'M', 'n',
291 15
          'q', 'Q', 're', 'rg', 'ri', 'rI', 'RG', 'scn', 'sc', 'sh', 's', 'SCN', 'SC',
292 15
          'S', 'T*', 'Tc', 'Td', 'TD', 'Tf', 'TJ', 'Tj', 'TL', 'Tm', 'Tr', 'Ts', 'Tw',
293
          'Tz', 'v', 'w', 'W*', 'W', 'y', '\'', '"',
294 12
        ];
295
        foreach ($operators as $operator) {
296 15
            $content = preg_replace(
297 15
                '/(?<!\w|\/)'.preg_quote($operator, '/').'(?![\w10\*])/',
298
                $operator."\r\n",
299
                $content
300 18
            );
301 3
        }
302 3
303 3
        // Restore the original content of the dictionary << >> commands
304 3
        $dictstore = array_reverse($dictstore, true);
305 3
        foreach ($dictstore as $id => $dict) {
306
            $content = str_replace('###'.$id.'###', $dict, $content);
307
        }
308
309 3
        // Restore the original string content
310
        $pdfstrings = array_reverse($pdfstrings, true);
311 18
        foreach ($pdfstrings as $id => $text) {
312 18
            // Strings may contain escaped newlines, or literal newlines
313 18
            // and we should clean these up before replacing the string
314 18
            // back into the content stream; this ensures no strings are
315 18
            // split between two lines (every command must be on one line)
316
            $text = str_replace(
317
                ["\\\r\n", "\\\r", "\\\n", "\r", "\n"],
318
                ['', '', '', '\r', '\n'],
319
                $text
320
            );
321 18
322 16
            $content = str_replace('@@@'.$id.'@@@', $text, $content);
323
        }
324
325 18
        $content = trim(preg_replace(['/(\r\n){2,}/', '/\r\n +/'], "\r\n", $content));
326
327 18
        return $content;
328
    }
329 5
330 5
    /**
331
     * getSectionsText() now takes an entire, unformatted
332 18
     * document stream as a string, cleans it, then filters out
333
     * commands that aren't needed for text positioning/extraction. It
334 6
     * returns an array of unprocessed PDF commands, one command per
335 6
     * element.
336
     *
337 18
     * @internal
338 18
     */
339 13
    public function getSectionsText(?string $content): array
340
    {
341 17
        $sections = [];
342 18
343 18
        // A cleaned stream has one command on every line, so split the
344 18
        // cleaned stream content on \r\n into an array
345
        $textCleaned = preg_split(
346
            '/(\r\n|\n|\r)/',
347 15
            $this->formatContent($content),
348 1
            -1,
349 1
            \PREG_SPLIT_NO_EMPTY
350
        );
351 15
352 14
        $inTextBlock = false;
353 14
        foreach ($textCleaned as $line) {
354 14
            $line = trim($line);
355 14
356 14
            // Skip empty lines
357 14
            if ('' === $line) {
358 12
                continue;
359
            }
360
361 14
            // If a 'BT' is encountered, set the $inTextBlock flag
362 14
            if (preg_match('/BT$/', $line)) {
363 14
                $inTextBlock = true;
364 10
                $sections[] = $line;
365
366
                // If an 'ET' is encountered, unset the $inTextBlock flag
367 14
            } elseif ('ET' == $line) {
368 14
                $inTextBlock = false;
369
                $sections[] = $line;
370
            } elseif ($inTextBlock) {
371 12
                // If we are inside a BT ... ET text block, save all lines
372
                $sections[] = trim($line);
373
            } else {
374
                // Otherwise, if we are outside of a text block, only
375 12
                // save specific, necessary lines. Care should be taken
376 4
                // to ensure a command being checked for *only* matches
377
                // that command. For instance, a simple search for 'c'
378
                // may also match the 'sc' command. See the command
379 12
                // list in the formatContent() method above.
380
                // Add more commands to save here as you find them in
381
                // weird PDFs!
382
                if ('q' == $line[-1] || 'Q' == $line[-1]) {
383
                    // Save and restore graphics state commands
384 12
                    $sections[] = $line;
385 4
                } elseif (preg_match('/(?<!\w)B[DM]C$/', $line)) {
386 4
                    // Begin marked content sequence
387
                    $sections[] = $line;
388 11
                } elseif (preg_match('/(?<!\w)[DM]P$/', $line)) {
389
                    // Marked content point
390
                    $sections[] = $line;
391 11
                } elseif (preg_match('/(?<!\w)EMC$/', $line)) {
392 4
                    // End marked content sequence
393 4
                    $sections[] = $line;
394 4
                } elseif (preg_match('/(?<!\w)cm$/', $line)) {
395 4
                    // Graphics position change commands
396
                    $sections[] = $line;
397
                } elseif (preg_match('/(?<!\w)Tf$/', $line)) {
398 4
                    // Font change commands
399
                    $sections[] = $line;
400 4
                } elseif (preg_match('/(?<!\w)Do$/', $line)) {
401
                    // Invoke named XObject command
402
                    $sections[] = $line;
403 4
                }
404
            }
405 9
        }
406 8
407 2
        return $sections;
408
    }
409 8
410
    private function getDefaultFont(Page $page = null): Font
411
    {
412 8
        $fonts = [];
413
        if (null !== $page) {
414
            $fonts = $page->getFonts();
415 8
        }
416 3
417
        $firstFont = $this->document->getFirstFont();
418 8
        if (null !== $firstFont) {
419 3
            $fonts[] = $firstFont;
420
        }
421 7
422
        if (\count($fonts) > 0) {
423
            return reset($fonts);
424 7
        }
425 7
426
        return new Font($this->document, null, null, $this->config);
427
    }
428 7
429 7
    /**
430 1
     * Decode a '[]TJ' command and attempt to use alternate
431
     * fonts if the current font results in output that contains
432 6
     * Unicode control characters.
433
     *
434
     * @internal
435 6
     *
436 6
     * @param array<int,array<string,string|bool>> $command
437
     */
438
    private function getTJUsingFontFallback(Font $font, array $command, Page $page = null, float $fontFactor = 4): string
439
    {
440
        $orig_text = $font->decodeText($command, $fontFactor);
441
        $text = $orig_text;
442
443
        // If we make this a Config option, we can add a check if it's
444
        // enabled here.
445 18
        if (null !== $page) {
446 1
            $font_ids = array_keys($page->getFonts());
447 1
448
            // If the decoded text contains UTF-8 control characters
449
            // then the font page being used is probably the wrong one.
450 18
            // Loop through the rest of the fonts to see if we can get
451
            // a good decode. Allow x09 to x0d which are whitespace.
452
            while (preg_match('/[\x00-\x08\x0e-\x1f\x7f]/u', $text) || false !== strpos(bin2hex($text), '00')) {
453 20
                // If we're out of font IDs, then give up and use the
454
                // original string
455
                if (0 == \count($font_ids)) {
456
                    return $orig_text;
457
                }
458
459 6
                // Try the next font ID
460
                $font = $page->getFont(array_shift($font_ids));
461 6
                $text = $font->decodeText($command, $fontFactor);
462 6
            }
463 6
        }
464
465 6
        return $text;
466 6
    }
467
468 6
    /**
469 6
     * Expects a string that is a full PDF dictionary object,
470
     * including the outer enclosing << >> angle brackets
471 6
     *
472 3
     * @internal
473
     *
474
     * @throws \Exception
475 6
     */
476 6
    public function parseDictionary(string $dictionary): array
477
    {
478
        // Normalize whitespace
479 6
        $dictionary = preg_replace(['/\r/', '/\n/', '/\s{2,}/'], ' ', trim($dictionary));
480
481
        if ('<<' != substr($dictionary, 0, 2)) {
482 6
            throw new \Exception('Not a valid dictionary object.');
483 6
        }
484 6
485 6
        $parsed = [];
486 6
        $stack = [];
487
        $currentName = '';
488 6
        $arrayTypeNumeric = false;
489
490 6
        // Remove outer layer of dictionary, and split on tokens
491 6
        $split = preg_split(
492 5
            '/(<<|>>|\[|\]|\/[^\s\/\[\]\(\)<>]*)/',
493
            trim(preg_replace('/^<<|>>$/', '', $dictionary)),
494 6
            -1,
495 6
            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
496 6
        );
497 6
498
        foreach ($split as $token) {
499
            $token = trim($token);
500 5
            switch ($token) {
501 4
                case '':
502
                    break;
503 5
504 4
                    // Open numeric array
505
                case '[':
506
                    $parsed[$currentName] = [];
507 5
                    $arrayTypeNumeric = true;
508
509
                    // Move up one level in the stack
510
                    $stack[\count($stack)] = &$parsed;
511 5
                    $parsed = &$parsed[$currentName];
512 2
                    $currentName = '';
513
                    break;
514
515 5
                    // Open hashed array
516
                case '<<':
517
                    $parsed[$currentName] = [];
518
                    $arrayTypeNumeric = false;
519
520 5
                    // Move up one level in the stack
521
                    $stack[\count($stack)] = &$parsed;
522 4
                    $parsed = &$parsed[$currentName];
523
                    $currentName = '';
524 4
                    break;
525
526
                    // Close numeric array
527 4
                case ']':
528
                    // Revert string type arrays back to a single element
529
                    if (\is_array($parsed) && 1 == \count($parsed)
530
                        && isset($parsed[0]) && \is_string($parsed[0])
531
                        && '' !== $parsed[0] && '/' != $parsed[0][0]) {
532
                        $parsed = '['.$parsed[0].']';
533
                    }
534
                    // Close hashed array
535
                    // no break
536
                case '>>':
537 4
                    $arrayTypeNumeric = false;
538 4
539 2
                    // Move down one level in the stack
540
                    $parsed = &$stack[\count($stack) - 1];
541 4
                    unset($stack[\count($stack) - 1]);
542
                    break;
543
544 4
                default:
545
                    // If value begins with a slash, then this is a name
546
                    // Add it to the appropriate array
547 4
                    if ('/' == substr($token, 0, 1)) {
548
                        $currentName = substr($token, 1);
549
                        if (true == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
550 4
                            $parsed[] = $currentName;
551 1
                            $currentName = '';
552
                        }
553 4
                    } elseif ('' != $currentName) {
554
                        if (false == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
555
                            $parsed[$currentName] = $token;
556 4
                        }
557 4
                        $currentName = '';
558
                    } elseif ('' == $currentName) {
559
                        $parsed[] = $token;
560 4
                    }
561 4
            }
562 2
        }
563
564 2
        return $parsed;
565
    }
566
567 2
    /**
568 2
     * Returns the text content of a PDF as a string. Attempts to add
569
     * whitespace for spacing and line-breaks where appropriate.
570
     *
571
     * getText() leverages getTextArray() to get the content
572
     * of the document, setting the addPositionWhitespace flag to true
573
     * so whitespace is inserted in a logical way for reading by
574
     * humans.
575
     */
576 6
    public function getText(Page $page = null): string
577
    {
578
        $this->addPositionWhitespace = true;
579 29
        $result = $this->getTextArray($page);
580
        $this->addPositionWhitespace = false;
581 29
582
        return implode('', $result).' ';
583 29
    }
584 29
585 29
    /**
586
     * Returns the text content of a PDF as an array of strings. No
587 29
     * extra whitespace is inserted besides what is actually encoded in
588 29
     * the PDF text.
589 29
     *
590
     * @throws \Exception
591 29
     */
592 29
    public function getTextArray(Page $page = null): array
593 29
    {
594 29
        $result = [];
595 29
        $text = [];
596 29
597
        $marked_stack = [];
598
        $last_written_position = false;
599
600 29
        $sections = $this->getSectionsText($this->content);
601 29
        $current_font = $this->getDefaultFont($page);
602 29
        $current_font_size = 1;
603 11
        $current_text_leading = 0;
604 11
605 11
        $current_position = ['x' => false, 'y' => false];
606
        $current_position_tm = [
607
            'a' => 1, 'b' => 0, 'c' => 0,
608
            'i' => 0, 'j' => 1, 'k' => 0,
609 11
            'x' => 0, 'y' => 0, 'z' => 1,
610 11
        ];
611 11
        $current_position_td = ['x' => 0, 'y' => 0];
612
        $current_position_cm = [
613 29
            'a' => 1, 'b' => 0, 'c' => 0,
614
            'i' => 0, 'j' => 1, 'k' => 0,
615 29
            'x' => 0, 'y' => 0, 'z' => 1,
616 29
        ];
617
618 25
        $clipped_font = [];
619 25
        $clipped_position_cm = [];
620 25
621
        self::$recursionStack[] = $this->getUniqueId();
622 25
623
        foreach ($sections as $section) {
624 25
            $commands = $this->getCommandsText($section);
625 25
            foreach ($commands as $command) {
626 25
                switch ($command[self::OPERATOR]) {
627
                    // Begin text object
628
                    case 'BT':
629 25
                        // Reset text positioning matrices
630 25
                        $current_position_tm = [
631
                            'a' => 1, 'b' => 0, 'c' => 0,
632 25
                            'i' => 0, 'j' => 1, 'k' => 0,
633
                            'x' => 0, 'y' => 0, 'z' => 1,
634 29
                        ];
635 29
                        $current_position_td = ['x' => 0, 'y' => 0];
636
                        $current_text_leading = 0;
637 14
                        break;
638 14
639 14
                        // Begin marked content sequence with property list
640 14
                    case 'BDC':
641 14
                        if (preg_match('/(<<.*>>)$/', $command[self::COMMAND], $match)) {
642 14
                            $dict = $this->parseDictionary($match[1]);
643
644
                            // Check for ActualText block
645 14
                            if (isset($dict['ActualText']) && \is_string($dict['ActualText']) && '' !== $dict['ActualText']) {
646 9
                                if ('[' == $dict['ActualText'][0]) {
647 9
                                    // Simulate a 'TJ' command on the stack
648
                                    $marked_stack[] = [
649 14
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'TJ')[0],
650
                                    ];
651 29
                                } elseif ('<' == $dict['ActualText'][0] || '(' == $dict['ActualText'][0]) {
652 29
                                    // Simulate a 'Tj' command on the stack
653 22
                                    $marked_stack[] = [
654 22
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'Tj')[0],
655 22
                                    ];
656 22
                                }
657 22
                            }
658 22
                        }
659 22
                        break;
660
661
                        // Begin marked content sequence
662 22
                    case 'BMC':
663 22
                        if ('ReversedChars' == $command[self::COMMAND]) {
664 22
                            // Upon encountering a ReversedChars command,
665
                            // add the characters we've built up so far to
666
                            // the result array
667 16
                            $result = array_merge($result, $text);
668 16
669
                            // Start a fresh $text array that will contain
670 22
                            // reversed characters
671
                            $text = [];
672
673
                            // Add the reversed text flag to the stack
674
                            $marked_stack[] = ['ReversedChars' => true];
675 22
                        }
676
                        break;
677 22
678 22
                        // set graphics position matrix
679
                    case 'cm':
680 22
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
681
                        $current_position_cm = [
682 22
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
683 22
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
684
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
685 22
                        ];
686 18
                        break;
687 18
688
                    case 'Do':
689
                        if (null !== $page) {
690 22
                            $args = preg_split('/\s/s', $command[self::COMMAND]);
691
                            $id = trim(array_pop($args), '/ ');
692
                            $xobject = $page->getXObject($id);
693 29
694 1
                            // @todo $xobject could be a ElementXRef object, which would then throw an error
695 29
                            if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack)) {
696 29
                                // Not a circular reference.
697 29
                                $text[] = $xobject->getText($page);
698
                            }
699
                        }
700
                        break;
701 29
702 29
                        // Marked content point with (DP) & without (MP) property list
703 29
                    case 'DP':
704 24
                    case 'MP':
705 22
                        break;
706 22
707 22
                        // End text object
708 17
                    case 'ET':
709 17
                        break;
710 17
711 17
                        // Store current selected font and graphics matrix
712 17
                    case 'q':
713
                        $clipped_font[] = [$current_font, $current_font_size];
714
                        $clipped_position_cm[] = $current_position_cm;
715
                        break;
716 29
717 29
                        // Restore previous selected font and graphics matrix
718 29
                    case 'Q':
719 29
                        list($current_font, $current_font_size) = array_pop($clipped_font);
720 29
                        $current_position_cm = array_pop($clipped_position_cm);
721
                        break;
722
723 25
                        // End marked content sequence
724
                    case 'EMC':
725
                        $data = false;
726
                        if (\count($marked_stack)) {
727 29
                            $marked = array_pop($marked_stack);
728
                            $action = key($marked);
729
                            $data = $marked[$action];
730 42
731
                            switch ($action) {
732
                                // If we are in ReversedChars mode...
733
                                case 'ReversedChars':
734
                                    // Reverse the characters we've built up so far
735
                                    foreach ($text as $key => $t) {
736 42
                                        $text[$key] = implode('', array_reverse(
737 42
                                            mb_str_split($t, 1, mb_internal_encoding())
0 ignored issues
show
Bug introduced by
It seems like mb_internal_encoding() can also be of type true; however, parameter $encoding of mb_str_split() does only seem to accept null|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

737
                                            mb_str_split($t, 1, /** @scrutinizer ignore-type */ mb_internal_encoding())
Loading history...
738 8
                                        ));
739 8
                                    }
740 3
741
                                    // Add these characters to the result array
742 6
                                    $result = array_merge($result, $text);
743 6
744
                                    // Start a fresh $text array that will contain
745
                                    // non-reversed characters
746
                                    $text = [];
747
                                    break;
748 42
749 41
                                case 'ActualText':
750
                                    // Use the content of the ActualText as a command
751 42
                                    $command = $data;
752 41
                                    break;
753
                            }
754 42
                        }
755 6
756
                        // If this EMC command has been transformed into a 'Tj'
757 42
                        // or 'TJ' command because of being ActualText, then bypass
758 41
                        // the break to proceed to the writing section below.
759 41
                        if ('Tj' != $command[self::OPERATOR] && 'TJ' != $command[self::OPERATOR]) {
760
                            break;
761 41
                        }
762 41
763
                        // no break
764
                    case "'":
765
                    case '"':
766
                        if ("'" == $command[self::OPERATOR] || '"' == $command[self::OPERATOR]) {
767
                            // Move to next line and write text
768 42
                            $current_position['x'] = 0;
769
                            $current_position_td['x'] = 0;
770
                            $current_position_td['y'] += $current_text_leading;
771
                        }
772
                        // no break
773
                    case 'Tj':
774
                        $command[self::COMMAND] = [$command];
775 20
                        // no break
776
                    case 'TJ':
777 20
                        // Check the marked content stack for flags
778
                        $actual_text = false;
779
                        $reverse_text = false;
780
                        foreach ($marked_stack as $marked) {
781
                            if (isset($marked['ActualText'])) {
782
                                $actual_text = true;
783
                            }
784
                            if (isset($marked['ReversedChars'])) {
785
                                $reverse_text = true;
786
                            }
787
                        }
788
789
                        // Account for text position ONLY just before we write text
790
                        if (false === $actual_text && \is_array($last_written_position)) {
791
                            // If $last_written_position is an array, that
792
                            // means we have stored text position coordinates
793
                            // for placing an ActualText
794
                            $currentX = $last_written_position[0];
795
                            $currentY = $last_written_position[1];
796
                            $last_written_position = false;
797
                        } else {
798
                            $currentX = $current_position_cm['x'] + $current_position_tm['x'] + $current_position_td['x'];
799
                            $currentY = $current_position_cm['y'] + $current_position_tm['y'] + $current_position_td['y'];
800
                        }
801
                        $whiteSpace = '';
802
803
                        $factorX = -$current_font_size * $current_position_tm['a'] - $current_font_size * $current_position_tm['i'];
804
                        $factorY = $current_font_size * $current_position_tm['b'] + $current_font_size * $current_position_tm['j'];
805
806
                        if (true === $this->addPositionWhitespace && false !== $current_position['x']) {
807
                            $curY = $currentY - $current_position['y'];
808
                            if (abs($curY) >= abs($factorY) / 4) {
809
                                $whiteSpace = "\n";
810
                            } else {
811
                                if (true === $reverse_text) {
812
                                    $curX = $current_position['x'] - $currentX;
813
                                } else {
814
                                    $curX = $currentX - $current_position['x'];
815
                                }
816
817
                                // In abs($factorX * 7) below, the 7 is chosen arbitrarily
818
                                // as the number of apparent "spaces" in a document we
819
                                // would need before considering them a "tab". In the
820
                                // future, we might offer this value to users as a config
821
                                // option.
822
                                if ($curX >= abs($factorX * 7)) {
823
                                    $whiteSpace = "\t";
824
                                } elseif ($curX >= abs($factorX * 2)) {
825
                                    $whiteSpace = ' ';
826
                                }
827
                            }
828
                        }
829
830
                        $newtext = $this->getTJUsingFontFallback(
831
                            $current_font,
832
                            $command[self::COMMAND],
833
                            $page,
834
                            $factorX
835
                        );
836
837
                        // If there is no ActualText pending then write
838
                        if (false === $actual_text) {
839
                            $newtext = str_replace(["\r", "\n"], '', $newtext);
840
                            if (false !== $reverse_text) {
841
                                // If we are in ReversedChars mode, add the whitespace last
842
                                $text[] = preg_replace('/  $/', ' ', $newtext.$whiteSpace);
843
                            } else {
844
                                // Otherwise add the whitespace first
845
                                if (' ' === $whiteSpace && isset($text[\count($text) - 1])) {
846
                                    $text[\count($text) - 1] = preg_replace('/ $/', '', $text[\count($text) - 1]);
847
                                }
848
                                $text[] = preg_replace('/^[ \t]{2}/', ' ', $whiteSpace.$newtext);
849
                            }
850
851
                            // Record the position of this inserted text for comparison
852
                            // with the next text block.
853
                            // Provide a 'fudge' factor guess on how wide this text block
854
                            // is based on the number of characters. This helps limit the
855
                            // number of tabs inserted, but isn't perfect.
856
                            $factor = $factorX / 2;
857
                            $current_position = [
858
                                'x' => $currentX - mb_strlen($newtext) * $factor,
859
                                'y' => $currentY,
860
                            ];
861
                        } elseif (false === $last_written_position) {
862
                            // If there is an ActualText in the pipeline
863
                            // store the position this undisplayed text
864
                            // *would* have been written to, so the
865
                            // ActualText is displayed in the right spot
866
                            $last_written_position = [$currentX, $currentY];
867
                            $current_position['x'] = $currentX;
868
                        }
869
                        break;
870
871
                        // move to start of next line
872
                    case 'T*':
873
                        $current_position['x'] = 0;
874
                        $current_position_td['x'] = 0;
875
                        $current_position_td['y'] += $current_text_leading;
876
                        break;
877
878
                        // set character spacing
879
                    case 'Tc':
880
                        break;
881
882
                        // move text current point and set leading
883
                    case 'Td':
884
                    case 'TD':
885
                        // move text current point
886
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
887
                        $y = (float) array_pop($args);
888
                        $x = (float) array_pop($args);
889
890
                        if ('TD' == $command[self::OPERATOR]) {
891
                            $current_text_leading = -$y * $current_position_tm['b'] - $y * $current_position_tm['j'];
892
                        }
893
894
                        $current_position_td = [
895
                            'x' => $current_position_td['x'] + $x * $current_position_tm['a'] + $x * $current_position_tm['i'],
896
                            'y' => $current_position_td['y'] + $y * $current_position_tm['b'] + $y * $current_position_tm['j'],
897
                        ];
898
                        break;
899
900
                    case 'Tf':
901
                        $args = preg_split('/\s/s', $command[self::COMMAND]);
902
                        $size = (float) array_pop($args);
903
                        $id = trim(array_pop($args), '/');
904
                        if (null !== $page) {
905
                            $new_font = $page->getFont($id);
906
                            // If an invalid font ID is given, do not update the font.
907
                            // This should theoretically never happen, as the PDF spec states for the Tf operator:
908
                            // "The specified font value shall match a resource name in the Font entry of the default resource dictionary"
909
                            // (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 435)
910
                            // But we want to make sure that malformed PDFs do not simply crash.
911
                            if (null !== $new_font) {
912
                                $current_font = $new_font;
913
                                $current_font_size = $size;
914
                            }
915
                        }
916
                        break;
917
918
                        // set leading
919
                    case 'TL':
920
                        $y = (float) $command[self::COMMAND];
921
                        $current_text_leading = -$y * $current_position_tm['b'] + -$y * $current_position_tm['j'];
922
                        break;
923
924
                        // set text position matrix
925
                    case 'Tm':
926
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
927
                        $current_position_tm = [
928
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
929
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
930
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
931
                        ];
932
                        break;
933
934
                        // set text rendering mode
935
                    case 'Ts':
936
                        break;
937
938
                        // set super/subscripting text rise
939
                    case 'Ts':
940
                        break;
941
942
                        // set word spacing
943
                    case 'Tw':
944
                        break;
945
946
                        // set horizontal scaling
947
                    case 'Tz':
948
                        break;
949
950
                    default:
951
                }
952
            }
953
        }
954
955
        $result = array_merge($result, $text);
956
957
        return $result;
958
    }
959
960
    /**
961
     * getCommandsText() expects the content of $text_part to be an
962
     * already formatted, single-line command from a document stream.
963
     * The companion function getSectionsText() returns a document
964
     * stream as an array of single commands for just this purpose.
965
     * Because of this, the argument $offset is no longer used, and
966
     * may be removed in a future PdfParser release.
967
     *
968
     * A better name for this function would be getCommandText()
969
     * since it now always works on just one command.
970
     */
971
    public function getCommandsText(string $text_part, int &$offset = 0): array
0 ignored issues
show
Unused Code introduced by
The parameter $offset is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

971
    public function getCommandsText(string $text_part, /** @scrutinizer ignore-unused */ int &$offset = 0): array

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
972
    {
973
        $commands = $matches = [];
974
975
        preg_match('/^(([\/\[\(<])?.*)(?<!\w)([a-z01\'\"*]+)$/i', $text_part, $matches);
976
977
        // If no valid command is detected, return an empty array
978
        if (!isset($matches[1]) || !isset($matches[2]) || !isset($matches[3])) {
979
            return [];
980
        }
981
982
        $type = $matches[2];
983
        $operator = $matches[3];
984
        $command = trim($matches[1]);
985
986
        if ('TJ' == $operator) {
987
            $subcommand = [];
988
            $command = trim($command, '[]');
989
            do {
990
                $oldCommand = $command;
991
992
                // Search for parentheses string () format
993
                if (preg_match('/^ *\((.*?)(?<![^\\\\]\\\\)\) *(-?[\d.]+)?/', $command, $tjmatch)) {
994
                    $subcommand[] = [
995
                        self::TYPE => '(',
996
                        self::OPERATOR => 'TJ',
997
                        self::COMMAND => $tjmatch[1],
998
                    ];
999
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
1000
                        $subcommand[] = [
1001
                            self::TYPE => 'n',
1002
                            self::OPERATOR => '',
1003
                            self::COMMAND => $tjmatch[2],
1004
                        ];
1005
                    }
1006
                    $command = substr($command, \strlen($tjmatch[0]));
1007
                }
1008
1009
                // Search for hexadecimal <> format
1010
                if (preg_match('/^ *<([0-9a-f\s]*)> *(-?[\d.]+)?/i', $command, $tjmatch)) {
1011
                    $tjmatch[1] = preg_replace('/\s/', '', $tjmatch[1]);
1012
                    $subcommand[] = [
1013
                        self::TYPE => '<',
1014
                        self::OPERATOR => 'TJ',
1015
                        self::COMMAND => $tjmatch[1],
1016
                    ];
1017
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
1018
                        $subcommand[] = [
1019
                            self::TYPE => 'n',
1020
                            self::OPERATOR => '',
1021
                            self::COMMAND => $tjmatch[2],
1022
                        ];
1023
                    }
1024
                    $command = substr($command, \strlen($tjmatch[0]));
1025
                }
1026
            } while ($command != $oldCommand);
1027
1028
            $command = $subcommand;
1029
        } elseif ('Tj' == $operator || "'" == $operator || '"' == $operator) {
1030
            // Depending on the string type, trim the data of the
1031
            // appropriate delimiters
1032
            if ('(' == $type) {
1033
                // Don't use trim() here since a () string may end with
1034
                // a balanced or escaped right parentheses, and trim()
1035
                // will delete both. Both strings below are valid:
1036
                //   eg. (String())
1037
                //   eg. (String\))
1038
                $command = preg_replace('/^\(|\)$/', '', $command);
1039
            } elseif ('<' == $type) {
1040
                $command = trim($command, '<>');
1041
            }
1042
        } elseif ('/' == $type) {
1043
            $command = substr($command, 1);
1044
        }
1045
1046
        $commands[] = [
1047
            self::TYPE => $type,
1048
            self::OPERATOR => $operator,
1049
            self::COMMAND => $command,
1050
        ];
1051
1052
        return $commands;
1053
    }
1054
1055
    public static function factory(
1056
        Document $document,
1057
        Header $header,
1058
        ?string $content,
1059
        Config $config = null
1060
    ): self {
1061
        switch ($header->get('Type')->getContent()) {
1062
            case 'XObject':
1063
                switch ($header->get('Subtype')->getContent()) {
1064
                    case 'Image':
1065
                        return new Image($document, $header, $config->getRetainImageContent() ? $content : null, $config);
0 ignored issues
show
Bug introduced by
The method getRetainImageContent() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1065
                        return new Image($document, $header, $config->/** @scrutinizer ignore-call */ getRetainImageContent() ? $content : null, $config);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
1066
1067
                    case 'Form':
1068
                        return new Form($document, $header, $content, $config);
1069
                }
1070
1071
                return new self($document, $header, $content, $config);
1072
1073
            case 'Pages':
1074
                return new Pages($document, $header, $content, $config);
1075
1076
            case 'Page':
1077
                return new Page($document, $header, $content, $config);
1078
1079
            case 'Encoding':
1080
                return new Encoding($document, $header, $content, $config);
1081
1082
            case 'Font':
1083
                $subtype = $header->get('Subtype')->getContent();
1084
                $classname = '\Smalot\PdfParser\Font\Font'.$subtype;
1085
1086
                if (class_exists($classname)) {
1087
                    return new $classname($document, $header, $content, $config);
1088
                }
1089
1090
                return new Font($document, $header, $content, $config);
1091
1092
            default:
1093
                return new self($document, $header, $content, $config);
1094
        }
1095
    }
1096
1097
    /**
1098
     * Returns unique id identifying the object.
1099
     */
1100
    protected function getUniqueId(): string
1101
    {
1102
        return spl_object_hash($this);
1103
    }
1104
}
1105