Test Failed
Pull Request — master (#634)
by
unknown
02:06
created

PDFObject::getSectionsText()   C

Complexity

Conditions 14
Paths 13

Size

Total Lines 69
Code Lines 34

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 41
CRAP Score 14.1377

Importance

Changes 0
Metric Value
cc 14
eloc 34
c 0
b 0
f 0
nc 13
nop 1
dl 0
loc 69
ccs 41
cts 45
cp 0.9111
crap 14.1377
rs 6.2666

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\XObject\Form;
36
use Smalot\PdfParser\XObject\Image;
37
38
/**
39
 * Class PDFObject
40
 */
41
class PDFObject
42
{
43
    public const TYPE = 't';
44
45
    public const OPERATOR = 'o';
46
47
    public const COMMAND = 'c';
48
49
    /**
50
     * The recursion stack.
51
     *
52
     * @var array
53
     */
54
    public static $recursionStack = [];
55
56
    /**
57
     * @var Document
58
     */
59
    protected $document;
60
61
    /**
62
     * @var Header
63
     */
64
    protected $header;
65
66
    /**
67
     * @var string
68
     */
69
    protected $content;
70
71
    /**
72
     * @var Config
73
     */
74
    protected $config;
75
76 62
    /**
77
     * @var bool
78
     */
79
    protected $addPositionWhitespace = false;
80
81
    public function __construct(
82 62
        Document $document,
83 62
        Header $header = null,
84 62
        string $content = null,
85 62
        Config $config = null
86 62
    ) {
87
        $this->document = $document;
88 49
        $this->header = $header ?? new Header();
89
        $this->content = $content;
90 49
        $this->config = $config;
91
    }
92 3
93
    public function init()
94 3
    {
95
    }
96
97 49
    public function getDocument(): Document
98
    {
99 49
        return $this->document;
100
    }
101
102 3
    public function getHeader(): ?Header
103
    {
104 3
        return $this->header;
105
    }
106
107
    public function getConfig(): ?Config
108
    {
109
        return $this->config;
110 50
    }
111
112 50
    /**
113
     * @return Element|PDFObject|Header
114
     */
115 47
    public function get(string $name)
116
    {
117 47
        return $this->header->get($name);
118
    }
119
120 3
    public function has(string $name): bool
121
    {
122 3
        return $this->header->has($name);
123
    }
124
125 38
    public function getDetails(bool $deep = true): array
126
    {
127 38
        return $this->header->getDetails($deep);
128
    }
129
130 32
    public function getContent(): ?string
131
    {
132 32
        return $this->content;
133 32
    }
134
135
    /**
136 32
     * Creates a duplicate of the document stream with
137 32
     * strings and other items replaced by $char. Formerly
138
     * getSectionsText() used this output to more easily gather offset
139
     * values to extract text from the *actual* document stream.
140
     *
141
     * @deprecated Function is no longer used and will be removed in a future release.
142 32
     * @internal
143 32
     */
144 22
    public function cleanContent(string $content, string $char = 'X')
145
    {
146
        $char = $char[0];
147
        $content = str_replace(['\\\\', '\\)', '\\('], $char.$char, $content);
148 32
149 32
        // Remove image bloc with binary content
150 21
        preg_match_all('/\s(BI\s.*?(\sID\s).*?(\sEI))\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
151
        foreach ($matches[0] as $part) {
152
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
153
        }
154 32
155 32
        // Clean content in square brackets [.....]
156 32
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);
0 ignored issues
show
Unused Code introduced by
The call to preg_match_all() has too many arguments starting with PREG_OFFSET_CAPTURE. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

156
        /** @scrutinizer ignore-call */ 
157
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
157 32
        foreach ($matches[1] as $part) {
158 32
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
159 18
        }
160
161
        // Clean content in round brackets (.....)
162 32
        preg_match_all('/\((.*?)\)/s', $content, $matches, \PREG_OFFSET_CAPTURE);
163
        foreach ($matches[1] as $part) {
164 32
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
165 18
        }
166
167
        // Clean structure
168
        if ($parts = preg_split('/(<|>)/s', $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
0 ignored issues
show
Bug introduced by
It seems like $content can also be of type array; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

168
        if ($parts = preg_split('/(<|>)/s', /** @scrutinizer ignore-type */ $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
Loading history...
169
            $content = '';
170
            $level = 0;
171 32
            foreach ($parts as $part) {
172 32
                if ('<' == $part) {
173
                    ++$level;
174
                }
175 32
176
                $content .= (0 == $level ? $part : str_repeat($char, \strlen($part)));
177 32
178 7
                if ('>' == $part) {
179
                    --$level;
180
                }
181 32
            }
182 32
        }
183 11
184
        // Clean BDC and EMC markup
185
        preg_match_all(
186 32
            '/(\/[A-Za-z0-9\_]*\s*'.preg_quote($char).'*BDC)/s',
187
            $content,
188
            $matches,
189 31
            \PREG_OFFSET_CAPTURE
190
        );
191 31
        foreach ($matches[1] as $part) {
192 31
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
193 31
        }
194
195
        preg_match_all('/\s(EMC)\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
196 31
        foreach ($matches[1] as $part) {
197 29
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
198 29
        }
199 29
200
        return $content;
201
    }
202 29
203 29
    /**
204
     * Takes a string of PDF document stream text and formats
205
     * it into a multi-line string with one PDF command on each line,
206 29
     * separated by \r\n. If the given string is null, or binary data
207
     * is detected instead of a document stream then return an empty
208
     * string.
209
     *
210 29
     * @internal
211
     */
212 29
    public function formatContent(?string $content): string
213
    {
214
        if (null === $content) {
215
            return '';
216
        }
217 31
218 4
        // Find all strings () and replace them so they aren't affected
219 4
        // by the next steps
220 4
        $pdfstrings = [];
221 4
        $attempt = '(';
222
        while (preg_match('/'.preg_quote($attempt, '/').'.*?(?<![^\\\\]\\\\)\)/s', $content, $text)) {
223 4
            // PDF strings can contain unescaped parentheses as long as
224
            // they're balanced, so check for balanced parentheses
225
            $left = preg_match_all('/(?<![^\\\\]\\\\)\(/', $text[0]);
226
            $right = preg_match_all('/(?<![^\\\\]\\\\)\)/', $text[0]);
227 31
228
            if ($left == $right) {
229
                // Replace the string with a unique placeholder
230 20
                $id = uniqid('STRING_', true);
231
                $pdfstrings[$id] = $text[0];
232 20
                $content = preg_replace(
233 20
                    '/'.preg_quote($text[0], '/').'/',
234 19
                    '@@@'.$id.'@@@',
235
                    $content,
236
                    1
237 20
                );
238 20
239 18
                // Reset to search for the next string
240
                $attempt = '(';
241
            } else {
242 20
                // We had unbalanced parentheses, so use the current
243 18
                // match as a base to find a longer string
244
                $attempt = $text[0];
245
            }
246 2
        }
247
248
        // Remove all carriage returns and line-feeds from the document stream
249
        $content = str_replace(["\r", "\n"], ' ', trim($content));
250
251
        // Find all dictionary << >> commands and replace them so they
252 20
        // aren't affected by the next steps
253
        $dictstore = [];
254 20
        while (preg_match('/(<<.*?>> *)(BDC|BMC|DP|MP)/', $content, $dicttext)) {
255 20
            $dictid = uniqid('DICT_', true);
256 20
            $dictstore[$dictid] = $dicttext[1];
257 20
            $content = preg_replace(
258
                '/'.preg_quote($dicttext[0], '/').'/',
259 20
                ' ###'.$dictid.'###'.$dicttext[2],
260 20
                $content,
261
                1
262 20
            );
263
        }
264 20
265 18
        // Now that all strings and dictionaries are hidden, the only
266 18
        // PDF commands left should all be plain text.
267 18
        // Detect text encoding of the current string to prevent reading
268
        // content streams that are images, etc. This prevents PHP
269 18
        // error messages when JPEG content is sent to this function
270 18
        // by the sample file '12249.pdf' from:
271 18
        // https://github.com/smalot/pdfparser/issues/458
272 1
        if (false === mb_detect_encoding($content, null, true)) {
273 1
            return '';
274
        }
275 1
276
        // Normalize white-space in the document stream
277
        $content = preg_replace('/\s{2,}/', ' ', $content);
278 18
279 5
        // Find all valid PDF operators and add \r\n after each; this
280
        // ensures there is just one command on every line
281
        // Source: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A
282 18
        // Source: https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A
283 15
        // Note: PDF Reference 1.7 lists 'I' and 'rI' as valid commands, while
284 15
        //       PDF 32000:2008 lists them as 'i' and 'ri' respectively. Both versions
285 15
        //       appear here in the list for completeness.
286 15
        $operators = [
287 15
          'b*', 'b', 'BDC', 'BMC', 'B*', 'BI', 'BT', 'BX', 'B', 'cm', 'cs', 'c', 'CS',
288
          'd0', 'd1', 'd', 'Do', 'DP', 'EMC', 'EI', 'ET', 'EX', 'f*', 'f', 'F', 'gs',
289
          'g', 'G',  'h', 'i', 'ID', 'I', 'j', 'J', 'k', 'K', 'l', 'm', 'MP', 'M', 'n',
290 11
          'q', 'Q', 're', 'rg', 'ri', 'rI', 'RG', 'scn', 'sc', 'sh', 's', 'SCN', 'SC',
291 15
          'S', 'T*', 'Tc', 'Td', 'TD', 'Tf', 'TJ', 'Tj', 'TL', 'Tm', 'Tr', 'Ts', 'Tw',
292 15
          'Tz', 'v', 'w', 'W*', 'W', 'y', '\'', '"',
293
        ];
294 12
        foreach ($operators as $operator) {
295
            $content = preg_replace(
296 15
                '/(?<!\w|\/)'.preg_quote($operator, '/').'(?![\w10\*])/',
297 15
                $operator."\r\n",
298
                $content
299
            );
300 18
        }
301 3
302 3
        // Restore the original content of the dictionary << >> commands
303 3
        $dictstore = array_reverse($dictstore, true);
304 3
        foreach ($dictstore as $id => $dict) {
305 3
            $content = str_replace('###'.$id.'###', $dict, $content);
306
        }
307
308
        // Restore the original string content
309 3
        $pdfstrings = array_reverse($pdfstrings, true);
310
        foreach ($pdfstrings as $id => $text) {
311 18
            // Strings may contain escaped newlines, or literal newlines
312 18
            // and we should clean these up before replacing the string
313 18
            // back into the content stream; this ensures no strings are
314 18
            // split between two lines (every command must be on one line)
315 18
            $text = str_replace(
316
                ["\\\r\n", "\\\r", "\\\n", "\r", "\n"],
317
                ['', '', '', '\r', '\n'],
318
                $text
319
            );
320
321 18
            $content = str_replace('@@@'.$id.'@@@', $text, $content);
322 16
        }
323
324
        $content = trim(preg_replace(['/(\r\n){2,}/', '/\r\n +/'], "\r\n", $content));
325 18
326
        return $content;
327 18
    }
328
329 5
    /**
330 5
     * getSectionsText() now takes an entire, unformatted
331
     * document stream as a string, cleans it, then filters out
332 18
     * commands that aren't needed for text positioning/extraction. It
333
     * returns an array of unprocessed PDF commands, one command per
334 6
     * element.
335 6
     *
336
     * @internal
337 18
     */
338 18
    public function getSectionsText(?string $content): array
339 13
    {
340
        $sections = [];
341 17
342 18
        // A cleaned stream has one command on every line, so split the
343 18
        // cleaned stream content on \r\n into an array
344 18
        $textCleaned = preg_split(
345
            '/(\r\n|\n|\r)/',
346
            $this->formatContent($content),
347 15
            -1,
348 1
            \PREG_SPLIT_NO_EMPTY
349 1
        );
350
351 15
        $inTextBlock = false;
352 14
        foreach ($textCleaned as $line) {
353 14
            $line = trim($line);
354 14
355 14
            // Skip empty lines
356 14
            if ('' === $line) {
357 14
                continue;
358 12
            }
359
360
            // If a 'BT' is encountered, set the $inTextBlock flag
361 14
            if (preg_match('/BT$/', $line)) {
362 14
                $inTextBlock = true;
363 14
                $sections[] = $line;
364 10
365
                // If an 'ET' is encountered, unset the $inTextBlock flag
366
            } elseif ('ET' == $line) {
367 14
                $inTextBlock = false;
368 14
                $sections[] = $line;
369
            } elseif ($inTextBlock) {
370
                // If we are inside a BT ... ET text block, save all lines
371 12
                $sections[] = trim($line);
372
            } else {
373
                // Otherwise, if we are outside of a text block, only
374
                // save specific, necessary lines. Care should be taken
375 12
                // to ensure a command being checked for *only* matches
376 4
                // that command. For instance, a simple search for 'c'
377
                // may also match the 'sc' command. See the command
378
                // list in the formatContent() method above.
379 12
                // Add more commands to save here as you find them in
380
                // weird PDFs!
381
                if ('q' == $line[-1] || 'Q' == $line[-1]) {
382
                    // Save and restore graphics state commands
383
                    $sections[] = $line;
384 12
                } elseif (preg_match('/(?<!\w)B[DM]C$/', $line)) {
385 4
                    // Begin marked content sequence
386 4
                    $sections[] = $line;
387
                } elseif (preg_match('/(?<!\w)[DM]P$/', $line)) {
388 11
                    // Marked content point
389
                    $sections[] = $line;
390
                } elseif (preg_match('/(?<!\w)EMC$/', $line)) {
391 11
                    // End marked content sequence
392 4
                    $sections[] = $line;
393 4
                } elseif (preg_match('/(?<!\w)cm$/', $line)) {
394 4
                    // Graphics position change commands
395 4
                    $sections[] = $line;
396
                } elseif (preg_match('/(?<!\w)Tf$/', $line)) {
397
                    // Font change commands
398 4
                    $sections[] = $line;
399
                } elseif (preg_match('/(?<!\w)Do$/', $line)) {
400 4
                    // Invoke named XObject command
401
                    $sections[] = $line;
402
                }
403 4
            }
404
        }
405 9
406 8
        return $sections;
407 2
    }
408
409 8
    private function getDefaultFont(Page $page = null): Font
410
    {
411
        $fonts = [];
412 8
        if (null !== $page) {
413
            $fonts = $page->getFonts();
414
        }
415 8
416 3
        $firstFont = $this->document->getFirstFont();
417
        if (null !== $firstFont) {
418 8
            $fonts[] = $firstFont;
419 3
        }
420
421 7
        if (\count($fonts) > 0) {
422
            return reset($fonts);
423
        }
424 7
425 7
        return new Font($this->document, null, null, $this->config);
426
    }
427
428 7
    /**
429 7
     * Decode a '[]TJ' command and attempt to use alternate
430 1
     * fonts if the current font results in output that contains
431
     * Unicode control characters.
432 6
     *
433
     * @internal
434
     * @param array<int,array<string,string|bool>> $command
435 6
     */
436 6
    private function getTJUsingFontFallback(Font $font, array $command, Page $page = null, float $fontFactor = 4): string
437
    {
438
        $orig_text = $font->decodeText($command, $fontFactor);
439
        $text = $orig_text;
440
441
        // If we make this a Config option, we can add a check if it's
442
        // enabled here.
443
        if (null !== $page) {
444
            $font_ids = array_keys($page->getFonts());
445 18
446 1
            // If the decoded text contains UTF-8 control characters
447 1
            // then the font page being used is probably the wrong one.
448
            // Loop through the rest of the fonts to see if we can get
449
            // a good decode. Allow x09 to x0d which are whitespace.
450 18
            while (preg_match('/[\x00-\x08\x0e-\x1f\x7f]/u', $text) || false !== strpos(bin2hex($text), '00')) {
451
                // If we're out of font IDs, then give up and use the
452
                // original string
453 20
                if (0 == \count($font_ids)) {
454
                    return $orig_text;
455
                }
456
457
                // Try the next font ID
458
                $font = $page->getFont(array_shift($font_ids));
459 6
                $text = $font->decodeText($command, $fontFactor);
460
            }
461 6
        }
462 6
463 6
        return $text;
464
    }
465 6
466 6
    /**
467
     * Expects a string that is a full PDF dictionary object,
468 6
     * including the outer enclosing << >> angle brackets
469 6
     *
470
     * @internal
471 6
     * @throws \Exception
472 3
     */
473
    public function parseDictionary(string $dictionary): array
474
    {
475 6
        // Normalize whitespace
476 6
        $dictionary = preg_replace(['/\r/', '/\n/', '/\s{2,}/'], ' ', trim($dictionary));
477
478
        if ('<<' != substr($dictionary, 0, 2)) {
479 6
            throw new \Exception('Not a valid dictionary object.');
480
        }
481
482 6
        $parsed = [];
483 6
        $stack = [];
484 6
        $currentName = '';
485 6
        $arrayTypeNumeric = false;
486 6
487
        // Remove outer layer of dictionary, and split on tokens
488 6
        $split = preg_split(
489
            '/(<<|>>|\[|\]|\/[^\s\/\[\]\(\)<>]*)/',
490 6
            trim(preg_replace('/^<<|>>$/', '', $dictionary)),
491 6
            -1,
492 5
            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
493
        );
494 6
495 6
        foreach ($split as $token) {
496 6
            $token = trim($token);
497 6
            switch ($token) {
498
                case '':
499
                    break;
500 5
501 4
                    // Open numeric array
502
                case '[':
503 5
                    $parsed[$currentName] = [];
504 4
                    $arrayTypeNumeric = true;
505
506
                    // Move up one level in the stack
507 5
                    $stack[\count($stack)] = &$parsed;
508
                    $parsed = &$parsed[$currentName];
509
                    $currentName = '';
510
                    break;
511 5
512 2
                    // Open hashed array
513
                case '<<':
514
                    $parsed[$currentName] = [];
515 5
                    $arrayTypeNumeric = false;
516
517
                    // Move up one level in the stack
518
                    $stack[\count($stack)] = &$parsed;
519
                    $parsed = &$parsed[$currentName];
520 5
                    $currentName = '';
521
                    break;
522 4
523
                    // Close numeric array
524 4
                case ']':
525
                    // Revert string type arrays back to a single element
526
                    if (\is_array($parsed) && 1 == \count($parsed)
527 4
                        && isset($parsed[0]) && \is_string($parsed[0])
528
                        && '' !== $parsed[0] && '/' != $parsed[0][0]) {
529
                        $parsed = '['.$parsed[0].']';
530
                    }
531
                    // Close hashed array
532
                    // no break
533
                case '>>':
534
                    $arrayTypeNumeric = false;
535
536
                    // Move down one level in the stack
537 4
                    $parsed = &$stack[\count($stack) - 1];
538 4
                    unset($stack[\count($stack) - 1]);
539 2
                    break;
540
541 4
                default:
542
                    // If value begins with a slash, then this is a name
543
                    // Add it to the appropriate array
544 4
                    if ('/' == substr($token, 0, 1)) {
545
                        $currentName = substr($token, 1);
546
                        if (true == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
547 4
                            $parsed[] = $currentName;
548
                            $currentName = '';
549
                        }
550 4
                    } elseif ('' != $currentName) {
551 1
                        if (false == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
552
                            $parsed[$currentName] = $token;
553 4
                        }
554
                        $currentName = '';
555
                    } elseif ('' == $currentName) {
556 4
                        $parsed[] = $token;
557 4
                    }
558
            }
559
        }
560 4
561 4
        return $parsed;
562 2
    }
563
564 2
    /**
565
     * Returns the text content of a PDF as a string. Attempts to add
566
     * whitespace for spacing and line-breaks where appropriate.
567 2
     *
568 2
     * getText() leverages getTextArray() to get the content
569
     * of the document, setting the addPositionWhitespace flag to true
570
     * so whitespace is inserted in a logical way for reading by
571
     * humans.
572
     */
573
    public function getText(Page $page = null): string
574
    {
575
        $this->addPositionWhitespace = true;
576 6
        $result = $this->getTextArray($page);
577
        $this->addPositionWhitespace = false;
578
579 29
        return implode('', $result).' ';
580
    }
581 29
582
    /**
583 29
     * Returns the text content of a PDF as an array of strings. No
584 29
     * extra whitespace is inserted besides what is actually encoded in
585 29
     * the PDF text.
586
     *
587 29
     * @throws \Exception
588 29
     */
589 29
    public function getTextArray(Page $page = null): array
590
    {
591 29
        $result = [];
592 29
        $text = [];
593 29
594 29
        $marked_stack = [];
595 29
        $last_written_position = false;
596 29
597
        $sections = $this->getSectionsText($this->content);
598
        $current_font = $this->getDefaultFont($page);
599
        $current_font_size = 1;
600 29
        $current_text_leading = 0;
601 29
602 29
        $current_position = ['x' => false, 'y' => false];
603 11
        $current_position_tm = [
604 11
            'a' => 1, 'b' => 0, 'c' => 0,
605 11
            'i' => 0, 'j' => 1, 'k' => 0,
606
            'x' => 0, 'y' => 0, 'z' => 1,
607
        ];
608
        $current_position_td = ['x' => 0, 'y' => 0];
609 11
        $current_position_cm = [
610 11
            'a' => 1, 'b' => 0, 'c' => 0,
611 11
            'i' => 0, 'j' => 1, 'k' => 0,
612
            'x' => 0, 'y' => 0, 'z' => 1,
613 29
        ];
614
615 29
        $clipped_font = [];
616 29
        $clipped_position_cm = [];
617
618 25
        self::$recursionStack[] = $this->getUniqueId();
619 25
620 25
        foreach ($sections as $section) {
621
            $commands = $this->getCommandsText($section);
622 25
            foreach ($commands as $command) {
623
                switch ($command[self::OPERATOR]) {
624 25
                    // Begin text object
625 25
                    case 'BT':
626 25
                        // Reset text positioning matrices
627
                        $current_position_tm = [
628
                            'a' => 1, 'b' => 0, 'c' => 0,
629 25
                            'i' => 0, 'j' => 1, 'k' => 0,
630 25
                            'x' => 0, 'y' => 0, 'z' => 1,
631
                        ];
632 25
                        $current_position_td = ['x' => 0, 'y' => 0];
633
                        $current_text_leading = 0;
634 29
                        break;
635 29
636
                        // Begin marked content sequence with property list
637 14
                    case 'BDC':
638 14
                        if (preg_match('/(<<.*>>)$/', $command[self::COMMAND], $match)) {
0 ignored issues
show
Bug introduced by
It seems like $command[self::COMMAND] can also be of type array and array<mixed,array<string,mixed|string>>; however, parameter $subject of preg_match() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

638
                        if (preg_match('/(<<.*>>)$/', /** @scrutinizer ignore-type */ $command[self::COMMAND], $match)) {
Loading history...
639 14
                            $dict = $this->parseDictionary($match[1]);
640 14
641 14
                            // Check for ActualText block
642 14
                            if (isset($dict['ActualText']) && \is_string($dict['ActualText']) && '' !== $dict['ActualText']) {
643
                                if ('[' == $dict['ActualText'][0]) {
644
                                    // Simulate a 'TJ' command on the stack
645 14
                                    $marked_stack[] = [
646 9
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'TJ')[0],
647 9
                                    ];
648
                                } elseif ('<' == $dict['ActualText'][0] || '(' == $dict['ActualText'][0]) {
649 14
                                    // Simulate a 'Tj' command on the stack
650
                                    $marked_stack[] = [
651 29
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'Tj')[0],
652 29
                                    ];
653 22
                                }
654 22
                            }
655 22
                        }
656 22
                        break;
657 22
658 22
                        // Begin marked content sequence
659 22
                    case 'BMC':
660
                        if ('ReversedChars' == $command[self::COMMAND]) {
661
                            // Upon encountering a ReversedChars command,
662 22
                            // add the characters we've built up so far to
663 22
                            // the result array
664 22
                            $result = array_merge($result, $text);
665
666
                            // Start a fresh $text array that will contain
667 16
                            // reversed characters
668 16
                            $text = [];
669
670 22
                            // Add the reversed text flag to the stack
671
                            $marked_stack[] = ['ReversedChars' => true];
672
                        }
673
                        break;
674
675 22
                        // set graphics position matrix
676
                    case 'cm':
677 22
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
0 ignored issues
show
Bug introduced by
It seems like $command[self::COMMAND] can also be of type array and array<mixed,array<string,mixed|string>>; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

677
                        $args = preg_split('/\s+/s', /** @scrutinizer ignore-type */ $command[self::COMMAND]);
Loading history...
678 22
                        $current_position_cm = [
679
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
680 22
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
681
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
682 22
                        ];
683 22
                        break;
684
685 22
                    case 'Do':
686 18
                        if (null !== $page) {
687 18
                            $args = preg_split('/\s/s', $command[self::COMMAND]);
688
                            $id = trim(array_pop($args), '/ ');
689
                            $xobject = $page->getXObject($id);
690 22
691
                            // @todo $xobject could be a ElementXRef object, which would then throw an error
692
                            if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack)) {
693 29
                                // Not a circular reference.
694 1
                                $text[] = $xobject->getText($page);
695 29
                            }
696 29
                        }
697 29
                        break;
698
699
                        // Marked content point with (DP) & without (MP) property list
700
                    case 'DP':
701 29
                    case 'MP':
702 29
                        break;
703 29
704 24
                        // End text object
705 22
                    case 'ET':
706 22
                        break;
707 22
708 17
                        // Store current selected font and graphics matrix
709 17
                    case 'q':
710 17
                        $clipped_font[] = [$current_font, $current_font_size];
711 17
                        $clipped_position_cm[] = $current_position_cm;
712 17
                        break;
713
714
                        // Restore previous selected font and graphics matrix
715
                    case 'Q':
716 29
                        list($current_font, $current_font_size) = array_pop($clipped_font);
717 29
                        $current_position_cm = array_pop($clipped_position_cm);
718 29
                        break;
719 29
720 29
                        // End marked content sequence
721
                    case 'EMC':
722
                        $data = false;
723 25
                        if (\count($marked_stack)) {
724
                            $marked = array_pop($marked_stack);
725
                            $action = key($marked);
726
                            $data = $marked[$action];
727 29
728
                            switch ($action) {
729
                                // If we are in ReversedChars mode...
730 42
                                case 'ReversedChars':
731
                                    // Reverse the characters we've built up so far
732
                                    foreach ($text as $key => $t) {
733
                                        $text[$key] = implode('', array_reverse(
734
                                            mb_str_split($t, 1, mb_internal_encoding())
0 ignored issues
show
Bug introduced by
It seems like mb_internal_encoding() can also be of type true; however, parameter $encoding of mb_str_split() does only seem to accept null|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

734
                                            mb_str_split($t, 1, /** @scrutinizer ignore-type */ mb_internal_encoding())
Loading history...
735
                                        ));
736 42
                                    }
737 42
738 8
                                    // Add these characters to the result array
739 8
                                    $result = array_merge($result, $text);
740 3
741
                                    // Start a fresh $text array that will contain
742 6
                                    // non-reversed characters
743 6
                                    $text = [];
744
                                    break;
745
746
                                case 'ActualText':
747
                                    // Use the content of the ActualText as a command
748 42
                                    $command = $data;
749 41
                                    break;
750
                            }
751 42
                        }
752 41
753
                        // If this EMC command has been transformed into a 'Tj'
754 42
                        // or 'TJ' command because of being ActualText, then bypass
755 6
                        // the break to proceed to the writing section below.
756
                        if ('Tj' != $command[self::OPERATOR] && 'TJ' != $command[self::OPERATOR]) {
757 42
                            break;
758 41
                        }
759 41
760
                        // no break
761 41
                    case "'":
762 41
                    case '"':
763
                        if ("'" == $command[self::OPERATOR] || '"' == $command[self::OPERATOR]) {
764
                            // Move to next line and write text
765
                            $current_position['x'] = 0;
766
                            $current_position_td['x'] = 0;
767
                            $current_position_td['y'] += $current_text_leading;
768 42
                        }
769
                        // no break
770
                    case 'Tj':
771
                        $command[self::COMMAND] = [$command];
772
                        // no break
773
                    case 'TJ':
774
                        // Check the marked content stack for flags
775 20
                        $actual_text = false;
776
                        $reverse_text = false;
777 20
                        foreach ($marked_stack as $marked) {
778
                            if (isset($marked['ActualText'])) {
779
                                $actual_text = true;
780
                            }
781
                            if (isset($marked['ReversedChars'])) {
782
                                $reverse_text = true;
783
                            }
784
                        }
785
786
                        // Account for text position ONLY just before we write text
787
                        if (false === $actual_text && \is_array($last_written_position)) {
788
                            // If $last_written_position is an array, that
789
                            // means we have stored text position coordinates
790
                            // for placing an ActualText
791
                            $currentX = $last_written_position[0];
792
                            $currentY = $last_written_position[1];
793
                            $last_written_position = false;
794
                        } else {
795
                            $currentX = $current_position_cm['x'] + $current_position_tm['x'] + $current_position_td['x'];
796
                            $currentY = $current_position_cm['y'] + $current_position_tm['y'] + $current_position_td['y'];
797
                        }
798
                        $whiteSpace = '';
799
800
                        $factorX = -$current_font_size * $current_position_tm['a'] - $current_font_size * $current_position_tm['i'];
801
                        $factorY = $current_font_size * $current_position_tm['b'] + $current_font_size * $current_position_tm['j'];
802
803
                        if (true === $this->addPositionWhitespace && false !== $current_position['x']) {
804
                            $curY = $currentY - $current_position['y'];
805
                            if (abs($curY) >= abs($factorY) / 4) {
806
                                $whiteSpace = "\n";
807
                            } else {
808
                                if (true === $reverse_text) {
809
                                    $curX = $current_position['x'] - $currentX;
810
                                } else {
811
                                    $curX = $currentX - $current_position['x'];
812
                                }
813
814
                                // In abs($factorX * 7) below, the 7 is chosen arbitrarily
815
                                // as the number of apparent "spaces" in a document we
816
                                // would need before considering them a "tab". In the
817
                                // future, we might offer this value to users as a config
818
                                // option.
819
                                if ($curX >= abs($factorX * 7)) {
820
                                    $whiteSpace = "\t";
821
                                } elseif ($curX >= abs($factorX * 2)) {
822
                                    $whiteSpace = ' ';
823
                                }
824
                            }
825
                        }
826
827
                        $newtext = $this->getTJUsingFontFallback(
828
                            $current_font,
829
                            $command[self::COMMAND],
830
                            $page,
831
                            $factorX
832
                        );
833
834
                        // If there is no ActualText pending then write
835
                        if (false === $actual_text) {
836
                            $newtext = str_replace(["\r", "\n"], '', $newtext);
837
                            if (false !== $reverse_text) {
838
                                // If we are in ReversedChars mode, add the whitespace last
839
                                $text[] = preg_replace('/  $/', ' ', $newtext.$whiteSpace);
840
                            } else {
841
                                // Otherwise add the whitespace first
842
                                if (' ' === $whiteSpace && isset($text[\count($text) - 1])) {
843
                                    $text[\count($text) - 1] = preg_replace('/ $/', '', $text[\count($text) - 1]);
844
                                }
845
                                $text[] = preg_replace('/^[ \t]{2}/', ' ', $whiteSpace.$newtext);
846
                            }
847
848
                            // Record the position of this inserted text for comparison
849
                            // with the next text block.
850
                            // Provide a 'fudge' factor guess on how wide this text block
851
                            // is based on the number of characters. This helps limit the
852
                            // number of tabs inserted, but isn't perfect.
853
                            $factor = $factorX / 2;
854
                            $current_position = [
855
                                'x' => $currentX - mb_strlen($newtext) * $factor,
856
                                'y' => $currentY,
857
                            ];
858
                        } elseif (false === $last_written_position) {
859
                            // If there is an ActualText in the pipeline
860
                            // store the position this undisplayed text
861
                            // *would* have been written to, so the
862
                            // ActualText is displayed in the right spot
863
                            $last_written_position = [$currentX, $currentY];
864
                            $current_position['x'] = $currentX;
865
                        }
866
                        break;
867
868
                        // move to start of next line
869
                    case 'T*':
870
                        $current_position['x'] = 0;
871
                        $current_position_td['x'] = 0;
872
                        $current_position_td['y'] += $current_text_leading;
873
                        break;
874
875
                        // set character spacing
876
                    case 'Tc':
877
                        break;
878
879
                        // move text current point and set leading
880
                    case 'Td':
881
                    case 'TD':
882
                        // move text current point
883
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
884
                        $y = (float) array_pop($args);
885
                        $x = (float) array_pop($args);
886
887
                        if ('TD' == $command[self::OPERATOR]) {
888
                            $current_text_leading = -$y * $current_position_tm['b'] - $y * $current_position_tm['j'];
889
                        }
890
891
                        $current_position_td = [
892
                            'x' => $current_position_td['x'] + $x * $current_position_tm['a'] + $x * $current_position_tm['i'],
893
                            'y' => $current_position_td['y'] + $y * $current_position_tm['b'] + $y * $current_position_tm['j'],
894
                        ];
895
                        break;
896
897
                    case 'Tf':
898
                        $args = preg_split('/\s/s', $command[self::COMMAND]);
899
                        $size = (float) array_pop($args);
900
                        $id = trim(array_pop($args), '/');
901
                        if (null !== $page) {
902
                            $new_font = $page->getFont($id);
903
                            // If an invalid font ID is given, do not update the font.
904
                            // This should theoretically never happen, as the PDF spec states for the Tf operator:
905
                            // "The specified font value shall match a resource name in the Font entry of the default resource dictionary"
906
                            // (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 435)
907
                            // But we want to make sure that malformed PDFs do not simply crash.
908
                            if (null !== $new_font) {
909
                                $current_font = $new_font;
910
                                $current_font_size = $size;
911
                            }
912
                        }
913
                        break;
914
915
                        // set leading
916
                    case 'TL':
917
                        $y = (float) $command[self::COMMAND];
918
                        $current_text_leading = -$y * $current_position_tm['b'] + -$y * $current_position_tm['j'];
919
                        break;
920
921
                        // set text position matrix
922
                    case 'Tm':
923
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
924
                        $current_position_tm = [
925
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
926
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
927
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
928
                        ];
929
                        break;
930
931
                        // set text rendering mode
932
                    case 'Ts':
933
                        break;
934
935
                        // set super/subscripting text rise
936
                    case 'Ts':
937
                        break;
938
939
                        // set word spacing
940
                    case 'Tw':
941
                        break;
942
943
                        // set horizontal scaling
944
                    case 'Tz':
945
                        break;
946
947
                    default:
948
                }
949
            }
950
        }
951
952
        $result = array_merge($result, $text);
953
954
        return $result;
955
    }
956
957
    /**
958
     * getCommandsText() expects the content of $text_part to be an
959
     * already formatted, single-line command from a document stream.
960
     * The companion function getSectionsText() returns a document
961
     * stream as an array of single commands for just this purpose.
962
     * Because of this, the argument $offset is no longer used, and
963
     * may be removed in a future PdfParser release.
964
     *
965
     * A better name for this function would be getCommandText()
966
     * since it now always works on just one command.
967
     */
968
    public function getCommandsText(string $text_part, int &$offset = 0): array
0 ignored issues
show
Unused Code introduced by
The parameter $offset is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

968
    public function getCommandsText(string $text_part, /** @scrutinizer ignore-unused */ int &$offset = 0): array

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
969
    {
970
        $commands = $matches = [];
971
972
        preg_match('/^(([\/\[\(<])?.*)(?<!\w)([a-z01\'\"*]+)$/i', $text_part, $matches);
973
974
        $type = $matches[2];
975
        $operator = $matches[3];
976
        $command = trim($matches[1]);
977
978
        if ('TJ' == $operator) {
979
            $subcommand = [];
980
            $command = trim($command, '[]');
981
            do {
982
                $oldCommand = $command;
983
984
                // Search for parentheses string () format
985
                if (preg_match('/^ *\((.*?)(?<![^\\\\]\\\\)\) *(-?[\d.]+)?/', $command, $tjmatch)) {
986
                    $subcommand[] = [
987
                        self::TYPE => '(',
988
                        self::OPERATOR => 'TJ',
989
                        self::COMMAND => $tjmatch[1],
990
                    ];
991
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
992
                        $subcommand[] = [
993
                            self::TYPE => 'n',
994
                            self::OPERATOR => '',
995
                            self::COMMAND => $tjmatch[2],
996
                        ];
997
                    }
998
                    $command = substr($command, \strlen($tjmatch[0]));
999
                }
1000
1001
                // Search for hexadecimal <> format
1002
                if (preg_match('/^ *<([0-9a-f\s]*)> *(-?[\d.]+)?/i', $command, $tjmatch)) {
1003
                    $tjmatch[1] = preg_replace('/\s/', '', $tjmatch[1]);
1004
                    $subcommand[] = [
1005
                        self::TYPE => '<',
1006
                        self::OPERATOR => 'TJ',
1007
                        self::COMMAND => $tjmatch[1],
1008
                    ];
1009
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
1010
                        $subcommand[] = [
1011
                            self::TYPE => 'n',
1012
                            self::OPERATOR => '',
1013
                            self::COMMAND => $tjmatch[2],
1014
                        ];
1015
                    }
1016
                    $command = substr($command, \strlen($tjmatch[0]));
1017
                }
1018
            } while ($command != $oldCommand);
1019
1020
            $command = $subcommand;
1021
        } elseif ('Tj' == $operator || "'" == $operator || '"' == $operator) {
1022
            // Depending on the string type, trim the data of the
1023
            // appropriate delimiters
1024
            if ('(' == $type) {
1025
                // Don't use trim() here since a () string may end with
1026
                // a balanced or escaped right parentheses, and trim()
1027
                // will delete both. Both strings below are valid:
1028
                //   eg. (String())
1029
                //   eg. (String\))
1030
                $command = preg_replace('/^\(|\)$/', '', $command);
1031
            } elseif ('<' == $type) {
1032
                $command = trim($command, '<>');
1033
            }
1034
        } elseif ('/' == $type) {
1035
            $command = substr($command, 1);
1036
        }
1037
1038
        $commands[] = [
1039
            self::TYPE => $type,
1040
            self::OPERATOR => $operator,
1041
            self::COMMAND => $command,
1042
        ];
1043
1044
        return $commands;
1045
    }
1046
1047
    public static function factory(
1048
        Document $document,
1049
        Header $header,
1050
        ?string $content,
1051
        Config $config = null
1052
    ): self {
1053
        switch ($header->get('Type')->getContent()) {
1054
            case 'XObject':
1055
                switch ($header->get('Subtype')->getContent()) {
1056
                    case 'Image':
1057
                        return new Image($document, $header, $config->getRetainImageContent() ? $content : null, $config);
0 ignored issues
show
Bug introduced by
The method getRetainImageContent() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1057
                        return new Image($document, $header, $config->/** @scrutinizer ignore-call */ getRetainImageContent() ? $content : null, $config);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
1058
1059
                    case 'Form':
1060
                        return new Form($document, $header, $content, $config);
1061
                }
1062
1063
                return new self($document, $header, $content, $config);
1064
1065
            case 'Pages':
1066
                return new Pages($document, $header, $content, $config);
1067
1068
            case 'Page':
1069
                return new Page($document, $header, $content, $config);
1070
1071
            case 'Encoding':
1072
                return new Encoding($document, $header, $content, $config);
1073
1074
            case 'Font':
1075
                $subtype = $header->get('Subtype')->getContent();
1076
                $classname = '\Smalot\PdfParser\Font\Font'.$subtype;
1077
1078
                if (class_exists($classname)) {
1079
                    return new $classname($document, $header, $content, $config);
1080
                }
1081
1082
                return new Font($document, $header, $content, $config);
1083
1084
            default:
1085
                return new self($document, $header, $content, $config);
1086
        }
1087
    }
1088
1089
    /**
1090
     * Returns unique id identifying the object.
1091
     */
1092
    protected function getUniqueId(): string
1093
    {
1094
        return spl_object_hash($this);
1095
    }
1096
}
1097