Test Failed
Pull Request — master (#634)
by Konrad
02:24
created

PDFObject::getTextArray()   F

Complexity

Conditions 68
Paths > 20000

Size

Total Lines 348
Code Lines 197

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 116
CRAP Score 68.3257

Importance

Changes 2
Bugs 0 Features 0
Metric Value
cc 68
eloc 197
c 2
b 0
f 0
nc 24723
nop 1
dl 0
loc 348
ccs 116
cts 121
cp 0.9587
crap 68.3257
rs 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\XObject\Form;
36
use Smalot\PdfParser\XObject\Image;
37
38
/**
39
 * Class PDFObject
40
 */
41
class PDFObject
42
{
43
    public const TYPE = 't';
44
45
    public const OPERATOR = 'o';
46
47
    public const COMMAND = 'c';
48
49
    /**
50
     * The recursion stack.
51
     *
52
     * @var array
53
     */
54
    public static $recursionStack = [];
55
56
    /**
57
     * @var Document
58
     */
59
    protected $document;
60
61
    /**
62
     * @var Header
63
     */
64
    protected $header;
65
66
    /**
67
     * @var string
68
     */
69
    protected $content;
70
71
    /**
72
     * @var Config
73
     */
74
    protected $config;
75
76 62
    /**
77
     * @var bool
78
     */
79
    protected $addPositionWhitespace = false;
80
81
    public function __construct(
82 62
        Document $document,
83 62
        Header $header = null,
84 62
        string $content = null,
85 62
        Config $config = null
86 62
    ) {
87
        $this->document = $document;
88 49
        $this->header = $header ?? new Header();
89
        $this->content = $content;
90 49
        $this->config = $config;
91
    }
92 3
93
    public function init()
94 3
    {
95
    }
96
97 49
    public function getDocument(): Document
98
    {
99 49
        return $this->document;
100
    }
101
102 3
    public function getHeader(): ?Header
103
    {
104 3
        return $this->header;
105
    }
106
107
    public function getConfig(): ?Config
108
    {
109
        return $this->config;
110 50
    }
111
112 50
    /**
113
     * @return Element|PDFObject|Header
114
     */
115 47
    public function get(string $name)
116
    {
117 47
        return $this->header->get($name);
118
    }
119
120 3
    public function has(string $name): bool
121
    {
122 3
        return $this->header->has($name);
123
    }
124
125 38
    public function getDetails(bool $deep = true): array
126
    {
127 38
        return $this->header->getDetails($deep);
128
    }
129
130 32
    public function getContent(): ?string
131
    {
132 32
        return $this->content;
133 32
    }
134
135
    /**
136 32
     * Creates a duplicate of the document stream with strings and other
137 32
     * items replaced by $char. Formerly getSectionsText() used this
138
     * output to more easily gather offset values to extract text from
139
     * the *actual* document stream. As getSectionsText() now uses
140
     * formatContent() instead, this function is no longer used, and
141
     * could be deleted in a future version of PDFParser.
142 32
     *
143 32
     * @internal For internal use only, not part of the public API
144 22
     */
145
    public function cleanContent(string $content, string $char = 'X')
146
    {
147
        $char = $char[0];
148 32
        $content = str_replace(['\\\\', '\\)', '\\('], $char.$char, $content);
149 32
150 21
        // Remove image bloc with binary content
151
        preg_match_all('/\s(BI\s.*?(\sID\s).*?(\sEI))\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
152
        foreach ($matches[0] as $part) {
153
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
154 32
        }
155 32
156 32
        // Clean content in square brackets [.....]
157 32
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);
0 ignored issues
show
Unused Code introduced by
The call to preg_match_all() has too many arguments starting with PREG_OFFSET_CAPTURE. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

157
        /** @scrutinizer ignore-call */ 
158
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
158 32
        foreach ($matches[1] as $part) {
159 18
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
160
        }
161
162 32
        // Clean content in round brackets (.....)
163
        preg_match_all('/\((.*?)\)/s', $content, $matches, \PREG_OFFSET_CAPTURE);
164 32
        foreach ($matches[1] as $part) {
165 18
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
166
        }
167
168
        // Clean structure
169
        if ($parts = preg_split('/(<|>)/s', $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
0 ignored issues
show
Bug introduced by
It seems like $content can also be of type array; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

169
        if ($parts = preg_split('/(<|>)/s', /** @scrutinizer ignore-type */ $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
Loading history...
170
            $content = '';
171 32
            $level = 0;
172 32
            foreach ($parts as $part) {
173
                if ('<' == $part) {
174
                    ++$level;
175 32
                }
176
177 32
                $content .= (0 == $level ? $part : str_repeat($char, \strlen($part)));
178 7
179
                if ('>' == $part) {
180
                    --$level;
181 32
                }
182 32
            }
183 11
        }
184
185
        // Clean BDC and EMC markup
186 32
        preg_match_all(
187
            '/(\/[A-Za-z0-9\_]*\s*'.preg_quote($char).'*BDC)/s',
188
            $content,
189 31
            $matches,
190
            \PREG_OFFSET_CAPTURE
191 31
        );
192 31
        foreach ($matches[1] as $part) {
193 31
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
194
        }
195
196 31
        preg_match_all('/\s(EMC)\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
197 29
        foreach ($matches[1] as $part) {
198 29
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
199 29
        }
200
201
        return $content;
202 29
    }
203 29
204
    /**
205
     * Takes a string of PDF document stream text and formats it into
206 29
     * a multi-line string with one PDF command on each line, separated
207
     * by \r\n. If the given string is null, or binary data is detected
208
     * instead of a document stream then return an empty string.
209
     */
210 29
    public function formatContent(?string $content): string
211
    {
212 29
        if (null === $content) {
213
            return '';
214
        }
215
216
        // Find all strings () and replace them so they aren't affected
217 31
        // by the next steps
218 4
        $pdfstrings = [];
219 4
        $attempt = '(';
220 4
        while (preg_match('/'.preg_quote($attempt, '/').'.*?(?<![^\\\\]\\\\)\)/s', $content, $text)) {
221 4
            // PDF strings can contain unescaped parentheses as long as
222
            // they're balanced, so check for balanced parentheses
223 4
            $left = preg_match_all('/(?<![^\\\\]\\\\)\(/', $text[0]);
224
            $right = preg_match_all('/(?<![^\\\\]\\\\)\)/', $text[0]);
225
226
            if ($left == $right) {
227 31
                // Replace the string with a unique placeholder
228
                $id = uniqid('STRING_', true);
229
                $pdfstrings[$id] = $text[0];
230 20
                $content = preg_replace(
231
                    '/'.preg_quote($text[0], '/').'/',
232 20
                    '@@@'.$id.'@@@',
233 20
                    $content,
234 19
                    1
235
                );
236
237 20
                // Reset to search for the next string
238 20
                $attempt = '(';
239 18
            } else {
240
                // We had unbalanced parentheses, so use the current
241
                // match as a base to find a longer string
242 20
                $attempt = $text[0];
243 18
            }
244
        }
245
246 2
        // Remove all carriage returns and line-feeds from the document stream
247
        $content = str_replace(["\r", "\n"], ' ', trim($content));
248
249
        // Find all dictionary << >> commands and replace them so they
250
        // aren't affected by the next steps
251
        $dictstore = [];
252 20
        while (preg_match('/(<<.*?>> *)(BDC|BMC|DP|MP)/', $content, $dicttext)) {
253
            $dictid = uniqid('DICT_', true);
254 20
            $dictstore[$dictid] = $dicttext[1];
255 20
            $content = preg_replace(
256 20
                '/'.preg_quote($dicttext[0], '/').'/',
257 20
                ' ###'.$dictid.'###'.$dicttext[2],
258
                $content,
259 20
                1
260 20
            );
261
        }
262 20
263
        // Now that all strings and dictionaries are hidden, the only
264 20
        // PDF commands left should all be plain text.
265 18
        // Detect text encoding of the current string to prevent reading
266 18
        // content streams that are images, etc. This prevents PHP
267 18
        // error messages when JPEG content is sent to this function
268
        // by the sample file '12249.pdf' from:
269 18
        // https://github.com/smalot/pdfparser/issues/458
270 18
        if (false === mb_detect_encoding($content, null, true)) {
271 18
            return '';
272 1
        }
273 1
274
        // Normalize white-space in the document stream
275 1
        $content = preg_replace('/\s{2,}/', ' ', $content);
276
277
        // Find all valid PDF operators and add \r\n after each; this
278 18
        // ensures there is just one command on every line
279 5
        // Source: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A
280
        // Source: https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A
281
        // Note: PDF Reference 1.7 lists 'I' and 'rI' as valid commands, while
282 18
        //       PDF 32000:2008 lists them as 'i' and 'ri' respectively. Both versions
283 15
        //       appear here in the list for completeness.
284 15
        $operators = [
285 15
          'b*', 'b', 'BDC', 'BMC', 'B*', 'BI', 'BT', 'BX', 'B', 'cm', 'cs', 'c', 'CS',
286 15
          'd0', 'd1', 'd', 'Do', 'DP', 'EMC', 'EI', 'ET', 'EX', 'f*', 'f', 'F', 'gs',
287 15
          'g', 'G',  'h', 'i', 'ID', 'I', 'j', 'J', 'k', 'K', 'l', 'm', 'MP', 'M', 'n',
288
          'q', 'Q', 're', 'rg', 'ri', 'rI', 'RG', 'scn', 'sc', 'sh', 's', 'SCN', 'SC',
289
          'S', 'T*', 'Tc', 'Td', 'TD', 'Tf', 'TJ', 'Tj', 'TL', 'Tm', 'Tr', 'Ts', 'Tw',
290 11
          'Tz', 'v', 'w', 'W*', 'W', 'y', '\'', '"',
291 15
        ];
292 15
        foreach ($operators as $operator) {
293
            $content = preg_replace(
294 12
                '/(?<!\w|\/)'.preg_quote($operator, '/').'(?![\w10\*])/',
295
                $operator."\r\n",
296 15
                $content
297 15
            );
298
        }
299
300 18
        // Restore the original content of the dictionary << >> commands
301 3
        $dictstore = array_reverse($dictstore, true);
302 3
        foreach ($dictstore as $id => $dict) {
303 3
            $content = str_replace('###'.$id.'###', $dict, $content);
304 3
        }
305 3
306
        // Restore the original string content
307
        $pdfstrings = array_reverse($pdfstrings, true);
308
        foreach ($pdfstrings as $id => $text) {
309 3
            // Strings may contain escaped newlines, or literal newlines
310
            // and we should clean these up before replacing the string
311 18
            // back into the content stream; this ensures no strings are
312 18
            // split between two lines (every command must be on one line)
313 18
            $text = str_replace(
314 18
                ["\\\r\n", "\\\r", "\\\n", "\r", "\n"],
315 18
                ['', '', '', '\r', '\n'],
316
                $text
317
            );
318
319
            $content = str_replace('@@@'.$id.'@@@', $text, $content);
320
        }
321 18
322 16
        $content = trim(preg_replace(['/(\r\n){2,}/', '/\r\n +/'], "\r\n", $content));
323
324
        return $content;
325 18
    }
326
327 18
    /**
328
     * getSectionsText() now takes an entire, unformatted document
329 5
     * stream as a string, cleans it, then filters out commands that
330 5
     * aren't needed for text positioning/extraction. It returns an
331
     * array of unprocessed PDF commands, one command per element.
332 18
     */
333
    public function getSectionsText(?string $content): array
334 6
    {
335 6
        $sections = [];
336
337 18
        // A cleaned stream has one command on every line, so split the
338 18
        // cleaned stream content on \r\n into an array
339 13
        $textCleaned = preg_split(
340
            '/(\r\n|\n|\r)/',
341 17
            $this->formatContent($content),
342 18
            -1,
343 18
            \PREG_SPLIT_NO_EMPTY
344 18
        );
345
346
        $inTextBlock = false;
347 15
        foreach ($textCleaned as $line) {
348 1
            $line = trim($line);
349 1
350
            // Skip empty lines
351 15
            if ('' === $line) {
352 14
                continue;
353 14
            }
354 14
355 14
            // If a 'BT' is encountered, set the $inTextBlock flag
356 14
            if (preg_match('/BT$/', $line)) {
357 14
                $inTextBlock = true;
358 12
                $sections[] = $line;
359
360
                // If an 'ET' is encountered, unset the $inTextBlock flag
361 14
            } elseif ('ET' == $line) {
362 14
                $inTextBlock = false;
363 14
                $sections[] = $line;
364 10
            } elseif ($inTextBlock) {
365
                // If we are inside a BT ... ET text block, save all lines
366
                $sections[] = trim($line);
367 14
            } else {
368 14
                // Otherwise, if we are outside of a text block, only
369
                // save specific, necessary lines. Care should be taken
370
                // to ensure a command being checked for *only* matches
371 12
                // that command. For instance, a simple search for 'c'
372
                // may also match the 'sc' command. See the command
373
                // list in the formatContent() method above.
374
                // Add more commands to save here as you find them in
375 12
                // weird PDFs!
376 4
                if ('q' == $line[-1] || 'Q' == $line[-1]) {
377
                    // Save and restore graphics state commands
378
                    $sections[] = $line;
379 12
                } elseif (preg_match('/(?<!\w)B[DM]C$/', $line)) {
380
                    // Begin marked content sequence
381
                    $sections[] = $line;
382
                } elseif (preg_match('/(?<!\w)[DM]P$/', $line)) {
383
                    // Marked content point
384 12
                    $sections[] = $line;
385 4
                } elseif (preg_match('/(?<!\w)EMC$/', $line)) {
386 4
                    // End marked content sequence
387
                    $sections[] = $line;
388 11
                } elseif (preg_match('/(?<!\w)cm$/', $line)) {
389
                    // Graphics position change commands
390
                    $sections[] = $line;
391 11
                } elseif (preg_match('/(?<!\w)Tf$/', $line)) {
392 4
                    // Font change commands
393 4
                    $sections[] = $line;
394 4
                } elseif (preg_match('/(?<!\w)Do$/', $line)) {
395 4
                    // Invoke named XObject command
396
                    $sections[] = $line;
397
                }
398 4
            }
399
        }
400 4
401
        return $sections;
402
    }
403 4
404
    private function getDefaultFont(Page $page = null): Font
405 9
    {
406 8
        $fonts = [];
407 2
        if (null !== $page) {
408
            $fonts = $page->getFonts();
409 8
        }
410
411
        $firstFont = $this->document->getFirstFont();
412 8
        if (null !== $firstFont) {
413
            $fonts[] = $firstFont;
414
        }
415 8
416 3
        if (\count($fonts) > 0) {
417
            return reset($fonts);
418 8
        }
419 3
420
        return new Font($this->document, null, null, $this->config);
421 7
    }
422
423
    /**
424 7
     * Decode a '[]TJ' command and attempt to use alternate fonts if
425 7
     * the current font results in output that contains Unicode control
426
     * characters. See Font::decodeText for a full description of
427
     * $textMatrix
428 7
     *
429 7
     * @param array<int,array<string,string|bool>> $command
430 1
     * @param array<string,float>                  $textMatrix
431
     */
432 6
    private function getTJUsingFontFallback(
433
        Font $font,
434
        array $command,
435 6
        array $textMatrix = ['a' => 1, 'b' => 0, 'i' => 0, 'j' => 1],
436 6
        Page $page = null
437
    ): string {
438
        $orig_text = $font->decodeText($command, $textMatrix);
439
        $text = $orig_text;
440
441
        // If we make this a Config option, we can add a check if it's
442
        // enabled here.
443
        if (null !== $page) {
444
            $font_ids = array_keys($page->getFonts());
445 18
446 1
            // If the decoded text contains UTF-8 control characters
447 1
            // then the font page being used is probably the wrong one.
448
            // Loop through the rest of the fonts to see if we can get
449
            // a good decode. Allow x09 to x0d which are whitespace.
450 18
            while (preg_match('/[\x00-\x08\x0e-\x1f\x7f]/u', $text) || false !== strpos(bin2hex($text), '00')) {
451
                // If we're out of font IDs, then give up and use the
452
                // original string
453 20
                if (0 == \count($font_ids)) {
454
                    return $orig_text;
455
                }
456
457
                // Try the next font ID
458
                $font = $page->getFont(array_shift($font_ids));
459 6
                $text = $font->decodeText($command, $textMatrix);
460
            }
461 6
        }
462 6
463 6
        return $text;
464
    }
465 6
466 6
    /**
467
     * Expects a string that is a full PDF dictionary object, including
468 6
     * the outer enclosing << >> angle brackets.
469 6
     *
470
     * @throws \Exception
471 6
     */
472 3
    public function parseDictionary(string $dictionary): array
473
    {
474
        // Normalize whitespace
475 6
        $dictionary = preg_replace(['/\r/', '/\n/', '/\s{2,}/'], ' ', trim($dictionary));
476 6
477
        if ('<<' != substr($dictionary, 0, 2)) {
478
            throw new \Exception('Not a valid dictionary object.');
479 6
        }
480
481
        $parsed = [];
482 6
        $stack = [];
483 6
        $currentName = '';
484 6
        $arrayTypeNumeric = false;
485 6
486 6
        // Remove outer layer of dictionary, and split on tokens
487
        $split = preg_split(
488 6
            '/(<<|>>|\[|\]|\/[^\s\/\[\]\(\)<>]*)/',
489
            trim(preg_replace('/^<<|>>$/', '', $dictionary)),
490 6
            -1,
491 6
            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
492 5
        );
493
494 6
        foreach ($split as $token) {
495 6
            $token = trim($token);
496 6
            switch ($token) {
497 6
                case '':
498
                    break;
499
500 5
                    // Open numeric array
501 4
                case '[':
502
                    $parsed[$currentName] = [];
503 5
                    $arrayTypeNumeric = true;
504 4
505
                    // Move up one level in the stack
506
                    $stack[\count($stack)] = &$parsed;
507 5
                    $parsed = &$parsed[$currentName];
508
                    $currentName = '';
509
                    break;
510
511 5
                    // Open hashed array
512 2
                case '<<':
513
                    $parsed[$currentName] = [];
514
                    $arrayTypeNumeric = false;
515 5
516
                    // Move up one level in the stack
517
                    $stack[\count($stack)] = &$parsed;
518
                    $parsed = &$parsed[$currentName];
519
                    $currentName = '';
520 5
                    break;
521
522 4
                    // Close numeric array
523
                case ']':
524 4
                    // Revert string type arrays back to a single element
525
                    if (\is_array($parsed) && 1 == \count($parsed)
526
                        && isset($parsed[0]) && \is_string($parsed[0])
527 4
                        && '' !== $parsed[0] && '/' != $parsed[0][0]) {
528
                        $parsed = '['.$parsed[0].']';
529
                    }
530
                    // Close hashed array
531
                    // no break
532
                case '>>':
533
                    $arrayTypeNumeric = false;
534
535
                    // Move down one level in the stack
536
                    $parsed = &$stack[\count($stack) - 1];
537 4
                    unset($stack[\count($stack) - 1]);
538 4
                    break;
539 2
540
                default:
541 4
                    // If value begins with a slash, then this is a name
542
                    // Add it to the appropriate array
543
                    if ('/' == substr($token, 0, 1)) {
544 4
                        $currentName = substr($token, 1);
545
                        if (true == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
546
                            $parsed[] = $currentName;
547 4
                            $currentName = '';
548
                        }
549
                    } elseif ('' != $currentName) {
550 4
                        if (false == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
551 1
                            $parsed[$currentName] = $token;
552
                        }
553 4
                        $currentName = '';
554
                    } elseif ('' == $currentName) {
555
                        $parsed[] = $token;
556 4
                    }
557 4
            }
558
        }
559
560 4
        return $parsed;
561 4
    }
562 2
563
    /**
564 2
     * getText() leverages getTextArray() to get the content of the
565
     * document, setting the addPositionWhitespace flag to true so
566
     * whitespace is inserted in a logical way for reading by humans.
567 2
     */
568 2
    public function getText(Page $page = null): string
569
    {
570
        $this->addPositionWhitespace = true;
571
        $result = $this->getTextArray($page);
572
        $this->addPositionWhitespace = false;
573
574
        return implode('', $result).' ';
575
    }
576 6
577
    /**
578
     * getTextArray() returns the text objects of a document in an
579 29
     * array. By default no positioning whitespace is added to the
580
     * output unless the addPositionWhitespace flag is set to true.
581 29
     *
582
     * @throws \Exception
583 29
     */
584 29
    public function getTextArray(Page $page = null): array
585 29
    {
586
        $result = [];
587 29
        $text = [];
588 29
589 29
        $marked_stack = [];
590
        $last_written_position = false;
591 29
592 29
        $sections = $this->getSectionsText($this->content);
593 29
        $current_font = $this->getDefaultFont($page);
594 29
595 29
        $current_position = ['x' => false, 'y' => false];
596 29
        $current_position_tm = [
597
            'a' => 1, 'b' => 0, 'c' => 0,
598
            'i' => 0, 'j' => 1, 'k' => 0,
599
            'x' => false, 'y' => false, 'z' => 1,
600 29
        ];
601 29
        $current_position_td = ['x' => 0, 'y' => 0];
602 29
        $current_position_cm = [
603 11
            'a' => 1, 'b' => 0, 'c' => 0,
604 11
            'i' => 0, 'j' => 1, 'k' => 0,
605 11
            'x' => 0, 'y' => 0, 'z' => 1,
606
        ];
607
608
        $clipped_font = [];
609 11
        $clipped_position_cm = [];
610 11
611 11
        self::$recursionStack[] = $this->getUniqueId();
612
613 29
        foreach ($sections as $section) {
614
            $commands = $this->getCommandsText($section);
615 29
            foreach ($commands as $command) {
616 29
                switch ($command[self::OPERATOR]) {
617
                    // Begin text object
618 25
                    case 'BT':
619 25
                        // Reset text positioning matrices
620 25
                        $current_position_tm = [
621
                            'a' => 1, 'b' => 0, 'c' => 0,
622 25
                            'i' => 0, 'j' => 1, 'k' => 0,
623
                            'x' => false, 'y' => false, 'z' => 1,
624 25
                        ];
625 25
                        $current_position_td = ['x' => 0, 'y' => 0];
626 25
                        break;
627
628
                        // Begin marked content sequence with property list
629 25
                    case 'BDC':
630 25
                        if (preg_match('/(<<.*>>)$/', $command[self::COMMAND], $match)) {
0 ignored issues
show
Bug introduced by
It seems like $command[self::COMMAND] can also be of type array and array<mixed,array<string,mixed|string>>; however, parameter $subject of preg_match() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

630
                        if (preg_match('/(<<.*>>)$/', /** @scrutinizer ignore-type */ $command[self::COMMAND], $match)) {
Loading history...
631
                            $dict = $this->parseDictionary($match[1]);
632 25
633
                            // Check for ActualText block
634 29
                            if (isset($dict['ActualText']) && \is_string($dict['ActualText']) && '' !== $dict['ActualText']) {
635 29
                                if ('[' == $dict['ActualText'][0]) {
636
                                    // Simulate a 'TJ' command on the stack
637 14
                                    $marked_stack[] = [
638 14
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'TJ')[0],
639 14
                                    ];
640 14
                                } elseif ('<' == $dict['ActualText'][0] || '(' == $dict['ActualText'][0]) {
641 14
                                    // Simulate a 'Tj' command on the stack
642 14
                                    $marked_stack[] = [
643
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'Tj')[0],
644
                                    ];
645 14
                                }
646 9
                            }
647 9
                        }
648
                        break;
649 14
650
                        // Begin marked content sequence
651 29
                    case 'BMC':
652 29
                        if ('ReversedChars' == $command[self::COMMAND]) {
653 22
                            // Upon encountering a ReversedChars command,
654 22
                            // add the characters we've built up so far to
655 22
                            // the result array
656 22
                            $result = array_merge($result, $text);
657 22
658 22
                            // Start a fresh $text array that will contain
659 22
                            // reversed characters
660
                            $text = [];
661
662 22
                            // Add the reversed text flag to the stack
663 22
                            $marked_stack[] = ['ReversedChars' => true];
664 22
                        }
665
                        break;
666
667 16
                        // set graphics position matrix
668 16
                    case 'cm':
669
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
0 ignored issues
show
Bug introduced by
It seems like $command[self::COMMAND] can also be of type array and array<mixed,array<string,mixed|string>>; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

669
                        $args = preg_split('/\s+/s', /** @scrutinizer ignore-type */ $command[self::COMMAND]);
Loading history...
670 22
                        $current_position_cm = [
671
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
672
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
673
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
674
                        ];
675 22
                        break;
676
677 22
                    case 'Do':
678 22
                        if (null !== $page) {
679
                            $args = preg_split('/\s/s', $command[self::COMMAND]);
680 22
                            $id = trim(array_pop($args), '/ ');
681
                            $xobject = $page->getXObject($id);
682 22
683 22
                            // @todo $xobject could be a ElementXRef object, which would then throw an error
684
                            if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack)) {
685 22
                                // Not a circular reference.
686 18
                                $text[] = $xobject->getText($page);
687 18
                            }
688
                        }
689
                        break;
690 22
691
                        // Marked content point with (DP) & without (MP) property list
692
                    case 'DP':
693 29
                    case 'MP':
694 1
                        break;
695 29
696 29
                        // End text object
697 29
                    case 'ET':
698
                        break;
699
700
                        // Store current selected font and graphics matrix
701 29
                    case 'q':
702 29
                        $clipped_font[] = $current_font;
703 29
                        $clipped_position_cm[] = $current_position_cm;
704 24
                        break;
705 22
706 22
                        // Restore previous selected font and graphics matrix
707 22
                    case 'Q':
708 17
                        $current_font = array_pop($clipped_font);
709 17
                        $current_position_cm = array_pop($clipped_position_cm);
710 17
                        break;
711 17
712 17
                        // End marked content sequence
713
                    case 'EMC':
714
                        $data = false;
715
                        if (\count($marked_stack)) {
716 29
                            $marked = array_pop($marked_stack);
717 29
                            $action = key($marked);
718 29
                            $data = $marked[$action];
719 29
720 29
                            switch ($action) {
721
                                // If we are in ReversedChars mode...
722
                                case 'ReversedChars':
723 25
                                    // Reverse the characters we've built up so far
724
                                    foreach ($text as $key => $t) {
725
                                        $text[$key] = implode('', array_reverse(
726
                                            mb_str_split($t, 1, mb_internal_encoding())
0 ignored issues
show
Bug introduced by
It seems like mb_internal_encoding() can also be of type true; however, parameter $encoding of mb_str_split() does only seem to accept null|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

726
                                            mb_str_split($t, 1, /** @scrutinizer ignore-type */ mb_internal_encoding())
Loading history...
727 29
                                        ));
728
                                    }
729
730 42
                                    // Add these characters to the result array
731
                                    $result = array_merge($result, $text);
732
733
                                    // Start a fresh $text array that will contain
734
                                    // non-reversed characters
735
                                    $text = [];
736 42
                                    break;
737 42
738 8
                                case 'ActualText':
739 8
                                    // Use the content of the ActualText as a command
740 3
                                    $command = $data;
741
                                    break;
742 6
                            }
743 6
                        }
744
745
                        // If this EMC command has been transformed into a 'Tj'
746
                        // or 'TJ' command because of being ActualText, then bypass
747
                        // the break to proceed to the writing section below.
748 42
                        if ('Tj' != $command[self::OPERATOR] && 'TJ' != $command[self::OPERATOR]) {
749 41
                            break;
750
                        }
751 42
752 41
                        // no break
753
                    case "'":
754 42
                    case '"':
755 6
                        if ("'" == $command[self::OPERATOR] || '"' == $command[self::OPERATOR]) {
756
                            // Move to next line and write text
757 42
                            $current_position['x'] = 0;
758 41
                            $current_position_td['x'] = 0;
759 41
                            $current_position_td['y'] += 10;
760
                        }
761 41
                        // no break
762 41
                    case 'Tj':
763
                        $command[self::COMMAND] = [$command];
764
                        // no break
765
                    case 'TJ':
766
                        // Check the marked content stack for flags
767
                        $actual_text = false;
768 42
                        $reverse_text = false;
769
                        foreach ($marked_stack as $marked) {
770
                            if (isset($marked['ActualText'])) {
771
                                $actual_text = true;
772
                            }
773
                            if (isset($marked['ReversedChars'])) {
774
                                $reverse_text = true;
775 20
                            }
776
                        }
777 20
778
                        // Account for text position ONLY just before we write text
779
                        if (false === $actual_text && \is_array($last_written_position)) {
780
                            // If $last_written_position is an array, that
781
                            // means we have stored text position coordinates
782
                            // for placing an ActualText
783
                            $currentX = $last_written_position[0];
784
                            $currentY = $last_written_position[1];
785
                            $last_written_position = false;
786
                        } else {
787
                            $currentX = $current_position_cm['x'] + $current_position_tm['x'] + $current_position_td['x'];
788
                            $currentY = $current_position_cm['y'] + $current_position_tm['y'] + $current_position_td['y'];
789
                        }
790
                        $whiteSpace = '';
791
792
                        if (true === $this->addPositionWhitespace && false !== $current_position['x']) {
793
                            if (abs($currentY - $current_position['y']) > 9) {
794
                                $whiteSpace = "\n";
795
                            } else {
796
                                $curX = $currentX - $current_position['x'];
797
                                $factorX = 10 * $current_position_tm['a'] + 10 * $current_position_tm['i'];
798
                                if (true === $reverse_text) {
799
                                    if ($curX < -abs($factorX * 8)) {
800
                                        $whiteSpace = "\t";
801
                                    } elseif ($curX < -abs($factorX)) {
802
                                        $whiteSpace = ' ';
803
                                    }
804
                                } else {
805
                                    if ($curX > ($factorX * 8)) {
806
                                        $whiteSpace = "\t";
807
                                    } elseif ($curX > $factorX) {
808
                                        $whiteSpace = ' ';
809
                                    }
810
                                }
811
                            }
812
                        }
813
814
                        $newtext = $this->getTJUsingFontFallback(
815
                            $current_font,
816
                            $command[self::COMMAND],
817
                            $current_position_tm,
818
                            $page
819
                        );
820
821
                        // If there is no ActualText pending then write
822
                        if (false === $actual_text) {
823
                            if (false !== $reverse_text) {
824
                                // If we are in ReversedChars mode, add the whitespace last
825
                                $text[] = str_replace(["\r", "\n"], '', $newtext).$whiteSpace;
826
                            } else {
827
                                // Otherwise add the whitespace first
828
                                $text[] = $whiteSpace.str_replace(["\r", "\n"], '', $newtext);
829
                            }
830
831
                            // Record the position of this inserted text for comparison
832
                            // with the next text block.
833
                            // Provide a 'fudge' factor guess on how wide this text block
834
                            // is based on the number of characters. This helps limit the
835
                            // number of tabs inserted, but isn't perfect.
836
                            $factor = 6;
837
                            if (true === $reverse_text) {
838
                                $factor = -$factor;
839
                            }
840
                            $current_position = [
841
                                'x' => $currentX + mb_strlen($newtext) * $factor,
842
                                'y' => $currentY,
843
                            ];
844
                        } elseif (false === $last_written_position) {
845
                            // If there is an ActualText in the pipeline
846
                            // store the position this undisplayed text
847
                            // *would* have been written to, so the
848
                            // ActualText is displayed in the right spot
849
                            $last_written_position = [$currentX, $currentY];
850
                        }
851
                        break;
852
853
                        // move to start of next line
854
                    case 'T*':
855
                        $current_position['x'] = 0;
856
                        $current_position_td['x'] = 0;
857
                        $current_position_td['y'] += 10;
858
                        break;
859
860
                        // set character spacing
861
                    case 'Tc':
862
                        break;
863
864
                        // move text current point and set leading
865
                    case 'Td':
866
                    case 'TD':
867
                        // move text current point
868
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
869
                        $y = (float) array_pop($args);
870
                        $x = (float) array_pop($args);
871
872
                        $current_position_td = [
873
                            'x' => $current_position_td['x'] + $x * $current_position_tm['a'] + $x * $current_position_tm['i'],
874
                            'y' => $current_position_td['y'] + $y * $current_position_tm['b'] + $y * $current_position_tm['j'],
875
                        ];
876
                        break;
877
878
                    case 'Tf':
879
                        list($id) = preg_split('/\s/s', $command[self::COMMAND]);
880
                        $id = trim($id, '/');
881
                        if (null !== $page) {
882
                            $new_font = $page->getFont($id);
883
                            // If an invalid font ID is given, do not update the font.
884
                            // This should theoretically never happen, as the PDF spec states for the Tf operator:
885
                            // "The specified font value shall match a resource name in the Font entry of the default resource dictionary"
886
                            // (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 435)
887
                            // But we want to make sure that malformed PDFs do not simply crash.
888
                            if (null !== $new_font) {
889
                                $current_font = $new_font;
890
                            }
891
                        }
892
                        break;
893
894
                        // set leading
895
                    case 'TL':
896
                        break;
897
898
                        // set text position matrix
899
                    case 'Tm':
900
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
901
                        $current_position_tm = [
902
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
903
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
904
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
905
                        ];
906
                        break;
907
908
                        // set text rendering mode
909
                    case 'Ts':
910
                        break;
911
912
                        // set super/subscripting text rise
913
                    case 'Ts':
914
                        break;
915
916
                        // set word spacing
917
                    case 'Tw':
918
                        break;
919
920
                        // set horizontal scaling
921
                    case 'Tz':
922
                        break;
923
924
                    default:
925
                }
926
            }
927
        }
928
929
        $result = array_merge($result, $text);
930
931
        return $result;
932
    }
933
934
    /**
935
     * getCommandsText() expects the content of $text_part to be an
936
     * already formatted, single-line command from a document stream.
937
     * The companion function getSectionsText() returns a document
938
     * stream as an array of single commands for just this purpose.
939
     *
940
     * A better name for this function would be getCommandText()
941
     * since it now always works on just one command.
942
     */
943
    public function getCommandsText(string $text_part): array
944
    {
945
        $commands = $matches = [];
946
947
        preg_match('/^(([\/\[\(<])?.*)(?<!\w)([a-z01\'\"*]+)$/i', $text_part, $matches);
948
949
        $type = $matches[2];
950
        $operator = $matches[3];
951
        $command = trim($matches[1]);
952
953
        if ('TJ' == $operator) {
954
            $subcommand = [];
955
            $command = trim($command, '[]');
956
            do {
957
                $oldCommand = $command;
958
959
                // Search for parentheses string () format
960
                if (preg_match('/^ *\((.*?)(?<![^\\\\]\\\\)\) *(-?[\d.]+)?/', $command, $tjmatch)) {
961
                    $subcommand[] = [
962
                        self::TYPE => '(',
963
                        self::OPERATOR => 'TJ',
964
                        self::COMMAND => $tjmatch[1],
965
                    ];
966
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
967
                        $subcommand[] = [
968
                            self::TYPE => 'n',
969
                            self::OPERATOR => '',
970
                            self::COMMAND => $tjmatch[2],
971
                        ];
972
                    }
973
                    $command = substr($command, \strlen($tjmatch[0]));
974
                }
975
976
                // Search for hexadecimal <> format
977
                if (preg_match('/^ *<([0-9a-f\s]*)> *(-?[\d.]+)?/i', $command, $tjmatch)) {
978
                    $tjmatch[1] = preg_replace('/\s/', '', $tjmatch[1]);
979
                    $subcommand[] = [
980
                        self::TYPE => '<',
981
                        self::OPERATOR => 'TJ',
982
                        self::COMMAND => $tjmatch[1],
983
                    ];
984
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
985
                        $subcommand[] = [
986
                            self::TYPE => 'n',
987
                            self::OPERATOR => '',
988
                            self::COMMAND => $tjmatch[2],
989
                        ];
990
                    }
991
                    $command = substr($command, \strlen($tjmatch[0]));
992
                }
993
            } while ($command != $oldCommand);
994
995
            $command = $subcommand;
996
        } elseif ('Tj' == $operator || "'" == $operator || '"' == $operator) {
997
            // Depending on the string type, trim the data of the
998
            // appropriate delimiters
999
            if ('(' == $type) {
1000
                // Don't use trim() here since a () string may end with
1001
                // a balanced or escaped right parentheses, and trim()
1002
                // will delete both. Both strings below are valid:
1003
                //   eg. (String())
1004
                //   eg. (String\))
1005
                $command = preg_replace('/^\(|\)$/', '', $command);
1006
            } elseif ('<' == $type) {
1007
                $command = trim($command, '<>');
1008
            }
1009
        } elseif ('/' == $type) {
1010
            $command = substr($command, 1);
1011
        }
1012
1013
        $commands[] = [
1014
            self::TYPE => $type,
1015
            self::OPERATOR => $operator,
1016
            self::COMMAND => $command,
1017
        ];
1018
1019
        return $commands;
1020
    }
1021
1022
    public static function factory(
1023
        Document $document,
1024
        Header $header,
1025
        ?string $content,
1026
        Config $config = null
1027
    ): self {
1028
        switch ($header->get('Type')->getContent()) {
1029
            case 'XObject':
1030
                switch ($header->get('Subtype')->getContent()) {
1031
                    case 'Image':
1032
                        return new Image($document, $header, $config->getRetainImageContent() ? $content : null, $config);
0 ignored issues
show
Bug introduced by
The method getRetainImageContent() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1032
                        return new Image($document, $header, $config->/** @scrutinizer ignore-call */ getRetainImageContent() ? $content : null, $config);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
1033
1034
                    case 'Form':
1035
                        return new Form($document, $header, $content, $config);
1036
                }
1037
1038
                return new self($document, $header, $content, $config);
1039
1040
            case 'Pages':
1041
                return new Pages($document, $header, $content, $config);
1042
1043
            case 'Page':
1044
                return new Page($document, $header, $content, $config);
1045
1046
            case 'Encoding':
1047
                return new Encoding($document, $header, $content, $config);
1048
1049
            case 'Font':
1050
                $subtype = $header->get('Subtype')->getContent();
1051
                $classname = '\Smalot\PdfParser\Font\Font'.$subtype;
1052
1053
                if (class_exists($classname)) {
1054
                    return new $classname($document, $header, $content, $config);
1055
                }
1056
1057
                return new Font($document, $header, $content, $config);
1058
1059
            default:
1060
                return new self($document, $header, $content, $config);
1061
        }
1062
    }
1063
1064
    /**
1065
     * Returns unique id identifying the object.
1066
     */
1067
    protected function getUniqueId(): string
1068
    {
1069
        return spl_object_hash($this);
1070
    }
1071
}
1072