Passed
Pull Request — master (#693)
by
unknown
02:26
created

PDFObject::formatContent()   C

Complexity

Conditions 11
Paths 194

Size

Total Lines 138
Code Lines 67

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 76
CRAP Score 11.0002

Importance

Changes 1
Bugs 0 Features 0
Metric Value
cc 11
eloc 67
c 1
b 0
f 0
nc 194
nop 1
dl 0
loc 138
ccs 76
cts 77
cp 0.987
crap 11.0002
rs 5.9466

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\XObject\Form;
36
use Smalot\PdfParser\XObject\Image;
37
38
/**
39
 * Class PDFObject
40
 */
41
class PDFObject
42
{
43
    public const TYPE = 't';
44
45
    public const OPERATOR = 'o';
46
47
    public const COMMAND = 'c';
48
49
    /**
50
     * The recursion stack.
51
     *
52
     * @var array
53
     */
54
    public static $recursionStack = [];
55
56
    /**
57
     * @var Document|null
58
     */
59
    protected $document;
60
61
    /**
62
     * @var Header
63
     */
64
    protected $header;
65
66
    /**
67
     * @var string
68
     */
69
    protected $content;
70
71
    /**
72
     * @var Config|null
73
     */
74
    protected $config;
75
76
    /**
77
     * @var bool
78
     */
79
    protected $addPositionWhitespace = false;
80
81 92
    public function __construct(
82
        Document $document,
83
        ?Header $header = null,
84
        ?string $content = null,
85
        ?Config $config = null
86
    ) {
87 92
        $this->document = $document;
88 92
        $this->header = $header ?? new Header();
89 92
        $this->content = $content;
90 92
        $this->config = $config;
91
    }
92
93 71
    public function init()
94
    {
95 71
    }
96
97 4
    public function getDocument(): Document
98
    {
99 4
        return $this->document;
0 ignored issues
show
Bug Best Practice introduced by
The expression return $this->document could return the type null which is incompatible with the type-hinted return Smalot\PdfParser\Document. Consider adding an additional type-check to rule them out.
Loading history...
100
    }
101
102 71
    public function getHeader(): ?Header
103
    {
104 71
        return $this->header;
105
    }
106
107 4
    public function getConfig(): ?Config
108
    {
109 4
        return $this->config;
110
    }
111
112
    /**
113
     * @return Element|PDFObject|Header
114
     */
115 73
    public function get(string $name)
116
    {
117 73
        return $this->header->get($name);
118
    }
119
120 72
    public function has(string $name): bool
121
    {
122 72
        return $this->header->has($name);
123
    }
124
125 4
    public function getDetails(bool $deep = true): array
126
    {
127 4
        return $this->header->getDetails($deep);
128
    }
129
130 59
    public function getContent(): ?string
131
    {
132 59
        return $this->content;
133
    }
134
135
    /**
136
     * Creates a duplicate of the document stream with
137
     * strings and other items replaced by $char. Formerly
138
     * getSectionsText() used this output to more easily gather offset
139
     * values to extract text from the *actual* document stream.
140
     *
141
     * @deprecated function is no longer used and will be removed in a future release
142
     *
143
     * @internal
144
     */
145 1
    public function cleanContent(string $content, string $char = 'X')
146
    {
147 1
        $char = $char[0];
148 1
        $content = str_replace(['\\\\', '\\)', '\\('], $char.$char, $content);
149
150
        // Remove image bloc with binary content
151 1
        preg_match_all('/\s(BI\s.*?(\sID\s).*?(\sEI))\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
152 1
        foreach ($matches[0] as $part) {
153
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
154
        }
155
156
        // Clean content in square brackets [.....]
157 1
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);
0 ignored issues
show
Unused Code introduced by
The call to preg_match_all() has too many arguments starting with PREG_OFFSET_CAPTURE. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

157
        /** @scrutinizer ignore-call */ 
158
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
158 1
        foreach ($matches[1] as $part) {
159 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
160
        }
161
162
        // Clean content in round brackets (.....)
163 1
        preg_match_all('/\((.*?)\)/s', $content, $matches, \PREG_OFFSET_CAPTURE);
164 1
        foreach ($matches[1] as $part) {
165 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
166
        }
167
168
        // Clean structure
169 1
        if ($parts = preg_split('/(<|>)/s', $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
0 ignored issues
show
Bug introduced by
It seems like $content can also be of type array; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

169
        if ($parts = preg_split('/(<|>)/s', /** @scrutinizer ignore-type */ $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
Loading history...
170 1
            $content = '';
171 1
            $level = 0;
172 1
            foreach ($parts as $part) {
173 1
                if ('<' == $part) {
174 1
                    ++$level;
175
                }
176
177 1
                $content .= (0 == $level ? $part : str_repeat($char, \strlen($part)));
178
179 1
                if ('>' == $part) {
180 1
                    --$level;
181
                }
182
            }
183
        }
184
185
        // Clean BDC and EMC markup
186 1
        preg_match_all(
187 1
            '/(\/[A-Za-z0-9\_]*\s*'.preg_quote($char).'*BDC)/s',
188 1
            $content,
189 1
            $matches,
190 1
            \PREG_OFFSET_CAPTURE
191 1
        );
192 1
        foreach ($matches[1] as $part) {
193 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
194
        }
195
196 1
        preg_match_all('/\s(EMC)\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
197 1
        foreach ($matches[1] as $part) {
198 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
199
        }
200
201 1
        return $content;
202
    }
203
204
    /**
205
     * Takes a string of PDF document stream text and formats
206
     * it into a multi-line string with one PDF command on each line,
207
     * separated by \r\n. If the given string is null, or binary data
208
     * is detected instead of a document stream then return an empty
209
     * string.
210
     */
211 51
    private function formatContent(?string $content): string
212
    {
213 51
        if (null === $content) {
214 3
            return '';
215
        }
216
217
        // Outside of (String) content in PDF document streams, all
218
        // text should conform to UTF-8. Test for binary content by
219
        // deleting everything after the first open-parenthesis ( which
220
        // indicates the beginning of a string. Then test what remains
221
        // for valid UTF-8. If it's not UTF-8, return an empty string
222
        // as this $content is most likely binary.
223 48
        if (false === mb_check_encoding(preg_replace('/\(.*$/s', '', $content), 'UTF-8')) {
224 1
            return '';
225
        }
226
227
        // Find all strings () and replace them so they aren't affected
228
        // by the next steps
229 48
        $pdfstrings = [];
230 48
        $attempt = '(';
231 48
        while (preg_match('/'.preg_quote($attempt, '/').'.*?(?<![^\\\\]\\\\)\)/s', $content, $text)) {
232
            // PDF strings can contain unescaped parentheses as long as
233
            // they're balanced, so check for balanced parentheses
234 39
            $left = preg_match_all('/(?<![^\\\\]\\\\)\(/', $text[0]);
235 39
            $right = preg_match_all('/(?<![^\\\\]\\\\)\)/', $text[0]);
236
237 39
            if ($left == $right) {
238
                // Replace the string with a unique placeholder
239 39
                $id = uniqid('STRING_', true);
240 39
                $pdfstrings[$id] = $text[0];
241 39
                $content = preg_replace(
242 39
                    '/'.preg_quote($text[0], '/').'/',
243 39
                    '@@@'.$id.'@@@',
244 39
                    $content,
245 39
                    1
246 39
                );
247
248
                // Reset to search for the next string
249 39
                $attempt = '(';
250
            } else {
251
                // We had unbalanced parentheses, so use the current
252
                // match as a base to find a longer string
253
                $attempt = $text[0];
254
            }
255
        }
256
257
        // Find all inline image content and replace them so they aren't
258
        // affected by the next steps
259 48
        $pdfInlineImages = [];
260 48
        while (preg_match('/\sBI(.+?)\sID\s(.+?)\sEI(?=\s|$)/', $content, $text)) {
261 1
            $id = uniqid('IMAGE_', true);
262 1
            $pdfInlineImages[$id] = [$text[1], $text[2]];
263 1
            $content = preg_replace(
264 1
                '/'.preg_quote($text[0], '/').'/',
265 1
                '^^^'.$id.'^^^',
266 1
                $content,
267 1
                1
268 1
            );
269
        }
270
271
        // Remove all carriage returns and line-feeds from the document stream
272 48
        $content = str_replace(["\r", "\n"], ' ', trim($content));
273
274
        // Find all dictionary << >> commands and replace them so they
275
        // aren't affected by the next steps
276 48
        $dictstore = [];
277 48
        while (preg_match('/(<<.*?>> *)(BDC|BMC|DP|MP)/', $content, $dicttext)) {
278 18
            $dictid = uniqid('DICT_', true);
279 18
            $dictstore[$dictid] = $dicttext[1];
280 18
            $content = preg_replace(
281 18
                '/'.preg_quote($dicttext[0], '/').'/',
282 18
                ' ###'.$dictid.'###'.$dicttext[2],
283 18
                $content,
284 18
                1
285 18
            );
286
        }
287
288
        // Normalize white-space in the document stream
289 48
        $content = preg_replace('/\s{2,}/', ' ', $content);
290
291
        // Find all valid PDF operators and add \r\n after each; this
292
        // ensures there is just one command on every line
293
        // Source: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A
294
        // Source: https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A
295
        // Note: PDF Reference 1.7 lists 'I' and 'rI' as valid commands, while
296
        //       PDF 32000:2008 lists them as 'i' and 'ri' respectively. Both versions
297
        //       appear here in the list for completeness.
298 48
        $operators = [
299 48
          'b*', 'b', 'BDC', 'BMC', 'B*', 'BI', 'BT', 'BX', 'B', 'cm', 'cs', 'c', 'CS',
300 48
          'd0', 'd1', 'd', 'Do', 'DP', 'EMC', 'EI', 'ET', 'EX', 'f*', 'f', 'F', 'gs',
301 48
          'g', 'G',  'h', 'i', 'ID', 'I', 'j', 'J', 'k', 'K', 'l', 'm', 'MP', 'M', 'n',
302 48
          'q', 'Q', 're', 'rg', 'ri', 'rI', 'RG', 'scn', 'sc', 'sh', 's', 'SCN', 'SC',
303 48
          'S', 'T*', 'Tc', 'Td', 'TD', 'Tf', 'TJ', 'Tj', 'TL', 'Tm', 'Tr', 'Ts', 'Tw',
304 48
          'Tz', 'v', 'w', 'W*', 'W', 'y', '\'', '"',
305 48
        ];
306 48
        foreach ($operators as $operator) {
307 48
            $content = preg_replace(
308 48
                '/(?<!\w|\/)'.preg_quote($operator, '/').'(?![\w10\*])/',
309 48
                $operator."\r\n",
310 48
                $content
311 48
            );
312
        }
313
314
        // Restore the original content of the dictionary << >> commands
315 48
        $dictstore = array_reverse($dictstore, true);
316 48
        foreach ($dictstore as $id => $dict) {
317 18
            $content = str_replace('###'.$id.'###', $dict, $content);
318
        }
319
320
        // Restore the original content of any inline images
321 48
        $pdfInlineImages = array_reverse($pdfInlineImages, true);
322 48
        foreach ($pdfInlineImages as $id => $image) {
323 1
            $content = str_replace(
324 1
                '^^^'.$id.'^^^',
325 1
                "\r\nBI\r\n".$image[0]."\r\nID\r\n".$image[1]."\r\nEI\r\n",
326 1
                $content
327 1
            );
328
        }
329
330
        // Restore the original string content
331 48
        $pdfstrings = array_reverse($pdfstrings, true);
332 48
        foreach ($pdfstrings as $id => $text) {
333
            // Strings may contain escaped newlines, or literal newlines
334
            // and we should clean these up before replacing the string
335
            // back into the content stream; this ensures no strings are
336
            // split between two lines (every command must be on one line)
337 39
            $text = str_replace(
338 39
                ["\\\r\n", "\\\r", "\\\n", "\r", "\n"],
339 39
                ['', '', '', '\r', '\n'],
340 39
                $text
341 39
            );
342
343 39
            $content = str_replace('@@@'.$id.'@@@', $text, $content);
344
        }
345
346 48
        $content = trim(preg_replace(['/(\r\n){2,}/', '/\r\n +/'], "\r\n", $content));
347
348 48
        return $content;
349
    }
350
351
    /**
352
     * getSectionsText() now takes an entire, unformatted
353
     * document stream as a string, cleans it, then filters out
354
     * commands that aren't needed for text positioning/extraction. It
355
     * returns an array of unprocessed PDF commands, one command per
356
     * element.
357
     *
358
     * @internal
359
     */
360 51
    public function getSectionsText(?string $content): array
361
    {
362 51
        $sections = [];
363
364
        // A cleaned stream has one command on every line, so split the
365
        // cleaned stream content on \r\n into an array
366 51
        $textCleaned = preg_split(
367 51
            '/(\r\n|\n|\r)/',
368 51
            $this->formatContent($content),
369 51
            -1,
370 51
            \PREG_SPLIT_NO_EMPTY
371 51
        );
372
373 51
        $inTextBlock = false;
374 51
        foreach ($textCleaned as $line) {
375 48
            $line = trim($line);
376
377
            // Skip empty lines
378 48
            if ('' === $line) {
379
                continue;
380
            }
381
382
            // If a 'BT' is encountered, set the $inTextBlock flag
383 48
            if (preg_match('/BT$/', $line)) {
384 48
                $inTextBlock = true;
385 48
                $sections[] = $line;
386
387
            // If an 'ET' is encountered, unset the $inTextBlock flag
388 48
            } elseif ('ET' == $line) {
389 48
                $inTextBlock = false;
390 48
                $sections[] = $line;
391 48
            } elseif ($inTextBlock) {
392
                // If we are inside a BT ... ET text block, save all lines
393 48
                $sections[] = trim($line);
394
            } else {
395
                // Otherwise, if we are outside of a text block, only
396
                // save specific, necessary lines. Care should be taken
397
                // to ensure a command being checked for *only* matches
398
                // that command. For instance, a simple search for 'c'
399
                // may also match the 'sc' command. See the command
400
                // list in the formatContent() method above.
401
                // Add more commands to save here as you find them in
402
                // weird PDFs!
403 47
                if ('q' == $line[-1] || 'Q' == $line[-1]) {
404
                    // Save and restore graphics state commands
405 41
                    $sections[] = $line;
406 47
                } elseif (preg_match('/(?<!\w)B[DM]C$/', $line)) {
407
                    // Begin marked content sequence
408 16
                    $sections[] = $line;
409 47
                } elseif (preg_match('/(?<!\w)[DM]P$/', $line)) {
410
                    // Marked content point
411 1
                    $sections[] = $line;
412 46
                } elseif (preg_match('/(?<!\w)EMC$/', $line)) {
413
                    // End marked content sequence
414 15
                    $sections[] = $line;
415 44
                } elseif (preg_match('/(?<!\w)cm$/', $line)) {
416
                    // Graphics position change commands
417 33
                    $sections[] = $line;
418 44
                } elseif (preg_match('/(?<!\w)Tf$/', $line)) {
419
                    // Font change commands
420 3
                    $sections[] = $line;
421 44
                } elseif (preg_match('/(?<!\w)Do$/', $line)) {
422
                    // Invoke named XObject command
423 15
                    $sections[] = $line;
424
                }
425
            }
426
        }
427
428 51
        return $sections;
429
    }
430
431 45
    private function getDefaultFont(?Page $page = null): Font
432
    {
433 45
        $fonts = [];
434 45
        if (null !== $page) {
435 43
            $fonts = $page->getFonts();
436
        }
437
438 45
        $firstFont = $this->document->getFirstFont();
0 ignored issues
show
Bug introduced by
The method getFirstFont() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

438
        /** @scrutinizer ignore-call */ 
439
        $firstFont = $this->document->getFirstFont();

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
439 45
        if (null !== $firstFont) {
440 42
            $fonts[] = $firstFont;
441
        }
442
443 45
        if (\count($fonts) > 0) {
444 42
            return reset($fonts);
445
        }
446
447 3
        return new Font($this->document, null, null, $this->config);
0 ignored issues
show
Bug introduced by
It seems like $this->document can also be of type null; however, parameter $document of Smalot\PdfParser\Font::__construct() does only seem to accept Smalot\PdfParser\Document, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

447
        return new Font(/** @scrutinizer ignore-type */ $this->document, null, null, $this->config);
Loading history...
448
    }
449
450
    /**
451
     * Decode a '[]TJ' command and attempt to use alternate
452
     * fonts if the current font results in output that contains
453
     * Unicode control characters.
454
     *
455
     * @internal
456
     *
457
     * @param array<int,array<string,string|bool>> $command
458
     */
459 42
    private function getTJUsingFontFallback(Font $font, array $command, ?Page $page = null, float $fontFactor = 4): string
460
    {
461 42
        $orig_text = $font->decodeText($command, $fontFactor);
462 42
        $text = $orig_text;
463
464
        // If we make this a Config option, we can add a check if it's
465
        // enabled here.
466 42
        if (null !== $page) {
467 42
            $font_ids = array_keys($page->getFonts());
468
469
            // If the decoded text contains UTF-8 control characters
470
            // then the font page being used is probably the wrong one.
471
            // Loop through the rest of the fonts to see if we can get
472
            // a good decode. Allow x09 to x0d which are whitespace.
473 42
            while (preg_match('/[\x00-\x08\x0e-\x1f\x7f]/u', $text) || false !== strpos(bin2hex($text), '00')) {
474
                // If we're out of font IDs, then give up and use the
475
                // original string
476 3
                if (0 == \count($font_ids)) {
477 3
                    return $orig_text;
478
                }
479
480
                // Try the next font ID
481 3
                $font = $page->getFont(array_shift($font_ids));
482 3
                $text = $font->decodeText($command, $fontFactor);
483
            }
484
        }
485
486 42
        return $text;
487
    }
488
489
    /**
490
     * Expects a string that is a full PDF dictionary object,
491
     * including the outer enclosing << >> angle brackets
492
     *
493
     * @internal
494
     *
495
     * @throws \Exception
496
     */
497 17
    public function parseDictionary(string $dictionary): array
498
    {
499
        // Normalize whitespace
500 17
        $dictionary = preg_replace(['/\r/', '/\n/', '/\s{2,}/'], ' ', trim($dictionary));
501
502 17
        if ('<<' != substr($dictionary, 0, 2)) {
503
            throw new \Exception('Not a valid dictionary object.');
504
        }
505
506 17
        $parsed = [];
507 17
        $stack = [];
508 17
        $currentName = '';
509 17
        $arrayTypeNumeric = false;
510
511
        // Remove outer layer of dictionary, and split on tokens
512 17
        $split = preg_split(
513 17
            '/(<<|>>|\[|\]|\/[^\s\/\[\]\(\)<>]*)/',
514 17
            trim(preg_replace('/^<<|>>$/', '', $dictionary)),
515 17
            -1,
516 17
            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
517 17
        );
518
519 17
        foreach ($split as $token) {
520 17
            $token = trim($token);
521
            switch ($token) {
522 17
                case '':
523 7
                    break;
524
525
                    // Open numeric array
526 17
                case '[':
527 7
                    $parsed[$currentName] = [];
528 7
                    $arrayTypeNumeric = true;
529
530
                    // Move up one level in the stack
531 7
                    $stack[\count($stack)] = &$parsed;
532 7
                    $parsed = &$parsed[$currentName];
533 7
                    $currentName = '';
534 7
                    break;
535
536
                    // Open hashed array
537 17
                case '<<':
538 1
                    $parsed[$currentName] = [];
539 1
                    $arrayTypeNumeric = false;
540
541
                    // Move up one level in the stack
542 1
                    $stack[\count($stack)] = &$parsed;
543 1
                    $parsed = &$parsed[$currentName];
544 1
                    $currentName = '';
545 1
                    break;
546
547
                    // Close numeric array
548 17
                case ']':
549
                    // Revert string type arrays back to a single element
550 7
                    if (\is_array($parsed) && 1 == \count($parsed)
551 7
                        && isset($parsed[0]) && \is_string($parsed[0])
552 7
                        && '' !== $parsed[0] && '/' != $parsed[0][0]) {
553 6
                        $parsed = '['.$parsed[0].']';
554
                    }
555
                    // Close hashed array
556
                    // no break
557 17
                case '>>':
558 7
                    $arrayTypeNumeric = false;
559
560
                    // Move down one level in the stack
561 7
                    $parsed = &$stack[\count($stack) - 1];
562 7
                    unset($stack[\count($stack) - 1]);
563 7
                    break;
564
565
                default:
566
                    // If value begins with a slash, then this is a name
567
                    // Add it to the appropriate array
568 17
                    if ('/' == substr($token, 0, 1)) {
569 17
                        $currentName = substr($token, 1);
570 17
                        if (true == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
571 6
                            $parsed[] = $currentName;
572 17
                            $currentName = '';
573
                        }
574 17
                    } elseif ('' != $currentName) {
575 17
                        if (false == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
576 17
                            $parsed[$currentName] = $token;
577
                        }
578 17
                        $currentName = '';
579 5
                    } elseif ('' == $currentName) {
580 5
                        $parsed[] = $token;
581
                    }
582
            }
583
        }
584
585 17
        return $parsed;
586
    }
587
588
    /**
589
     * Returns the text content of a PDF as a string. Attempts to add
590
     * whitespace for spacing and line-breaks where appropriate.
591
     *
592
     * getText() leverages getTextArray() to get the content
593
     * of the document, setting the addPositionWhitespace flag to true
594
     * so whitespace is inserted in a logical way for reading by
595
     * humans.
596
     */
597 36
    public function getText(?Page $page = null): string
598
    {
599 36
        $this->addPositionWhitespace = true;
600 36
        $result = $this->getTextArray($page);
601 36
        $this->addPositionWhitespace = false;
602
603 36
        return implode('', $result).' ';
604
    }
605
606
    /**
607
     * Returns the text content of a PDF as an array of strings. No
608
     * extra whitespace is inserted besides what is actually encoded in
609
     * the PDF text.
610
     *
611
     * @throws \Exception
612
     */
613 45
    public function getTextArray(?Page $page = null): array
614
    {
615 45
        $result = [];
616 45
        $text = [];
617
618 45
        $marked_stack = [];
619 45
        $last_written_position = false;
620
621 45
        $sections = $this->getSectionsText($this->content);
622 45
        $current_font = $this->getDefaultFont($page);
623 45
        $current_font_size = 1;
624 45
        $current_text_leading = 0;
625
626 45
        $current_position = ['x' => false, 'y' => false];
627 45
        $current_position_tm = [
628 45
            'a' => 1, 'b' => 0, 'c' => 0,
629 45
            'i' => 0, 'j' => 1, 'k' => 0,
630 45
            'x' => 0, 'y' => 0, 'z' => 1,
631 45
        ];
632 45
        $current_position_td = ['x' => 0, 'y' => 0];
633 45
        $current_position_cm = [
634 45
            'a' => 1, 'b' => 0, 'c' => 0,
635 45
            'i' => 0, 'j' => 1, 'k' => 0,
636 45
            'x' => 0, 'y' => 0, 'z' => 1,
637 45
        ];
638
639 45
        $clipped_font = [];
640 45
        $clipped_position_cm = [];
641
642 45
        self::$recursionStack[] = $this->getUniqueId();
643
644 45
        foreach ($sections as $section) {
645 42
            $commands = $this->getCommandsText($section);
646 42
            foreach ($commands as $command) {
647 42
                switch ($command[self::OPERATOR]) {
648
                    // Begin text object
649 42
                    case 'BT':
650
                        // Reset text positioning matrices
651 42
                        $current_position_tm = [
652 42
                            'a' => 1, 'b' => 0, 'c' => 0,
653 42
                            'i' => 0, 'j' => 1, 'k' => 0,
654 42
                            'x' => 0, 'y' => 0, 'z' => 1,
655 42
                        ];
656 42
                        $current_position_td = ['x' => 0, 'y' => 0];
657 42
                        $current_text_leading = 0;
658 42
                        break;
659
660
                        // Begin marked content sequence with property list
661 42
                    case 'BDC':
662 16
                        if (preg_match('/(<<.*>>)$/', $command[self::COMMAND], $match)) {
663 16
                            $dict = $this->parseDictionary($match[1]);
664
665
                            // Check for ActualText block
666 16
                            if (isset($dict['ActualText']) && \is_string($dict['ActualText']) && '' !== $dict['ActualText']) {
667 4
                                if ('[' == $dict['ActualText'][0]) {
668
                                    // Simulate a 'TJ' command on the stack
669
                                    $marked_stack[] = [
670
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'TJ')[0],
671
                                    ];
672 4
                                } elseif ('<' == $dict['ActualText'][0] || '(' == $dict['ActualText'][0]) {
673
                                    // Simulate a 'Tj' command on the stack
674 4
                                    $marked_stack[] = [
675 4
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'Tj')[0],
676 4
                                    ];
677
                                }
678
                            }
679
                        }
680 16
                        break;
681
682
                        // Begin marked content sequence
683 42
                    case 'BMC':
684 2
                        if ('ReversedChars' == $command[self::COMMAND]) {
685
                            // Upon encountering a ReversedChars command,
686
                            // add the characters we've built up so far to
687
                            // the result array
688 1
                            $result = array_merge($result, $text);
689
690
                            // Start a fresh $text array that will contain
691
                            // reversed characters
692 1
                            $text = [];
693
694
                            // Add the reversed text flag to the stack
695 1
                            $marked_stack[] = ['ReversedChars' => true];
696
                        }
697 2
                        break;
698
699
                        // set graphics position matrix
700 42
                    case 'cm':
701 29
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
702 29
                        $current_position_cm = [
703 29
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
704 29
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
705 29
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
706 29
                        ];
707 29
                        break;
708
709 42
                    case 'Do':
710 15
                        if (null !== $page) {
711 15
                            $args = preg_split('/\s/s', $command[self::COMMAND]);
712 15
                            $id = trim(array_pop($args), '/ ');
713 15
                            $xobject = $page->getXObject($id);
714
715
                            // @todo $xobject could be a ElementXRef object, which would then throw an error
716 15
                            if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
717
                                // Not a circular reference.
718 15
                                $text[] = $xobject->getText($page);
719
                            }
720
                        }
721 15
                        break;
722
723
                        // Marked content point with (DP) & without (MP) property list
724 42
                    case 'DP':
725 42
                    case 'MP':
726 1
                        break;
727
728
                        // End text object
729 42
                    case 'ET':
730 42
                        break;
731
732
                        // Store current selected font and graphics matrix
733 42
                    case 'q':
734 36
                        $clipped_font[] = [$current_font, $current_font_size];
735 36
                        $clipped_position_cm[] = $current_position_cm;
736 36
                        break;
737
738
                        // Restore previous selected font and graphics matrix
739 42
                    case 'Q':
740 36
                        list($current_font, $current_font_size) = array_pop($clipped_font);
741 36
                        $current_position_cm = array_pop($clipped_position_cm);
742 36
                        break;
743
744
                        // End marked content sequence
745 42
                    case 'EMC':
746 17
                        $data = false;
747 17
                        if (\count($marked_stack)) {
748 5
                            $marked = array_pop($marked_stack);
749 5
                            $action = key($marked);
750 5
                            $data = $marked[$action];
751
752
                            switch ($action) {
753
                                // If we are in ReversedChars mode...
754 5
                                case 'ReversedChars':
755
                                    // Reverse the characters we've built up so far
756 1
                                    foreach ($text as $key => $t) {
757 1
                                        $text[$key] = implode('', array_reverse(
758 1
                                            mb_str_split($t, 1, mb_internal_encoding())
0 ignored issues
show
Bug introduced by
It seems like mb_internal_encoding() can also be of type true; however, parameter $encoding of mb_str_split() does only seem to accept null|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

758
                                            mb_str_split($t, 1, /** @scrutinizer ignore-type */ mb_internal_encoding())
Loading history...
759 1
                                        ));
760
                                    }
761
762
                                    // Add these characters to the result array
763 1
                                    $result = array_merge($result, $text);
764
765
                                    // Start a fresh $text array that will contain
766
                                    // non-reversed characters
767 1
                                    $text = [];
768 1
                                    break;
769
770 4
                                case 'ActualText':
771
                                    // Use the content of the ActualText as a command
772 4
                                    $command = $data;
773 4
                                    break;
774
                            }
775
                        }
776
777
                        // If this EMC command has been transformed into a 'Tj'
778
                        // or 'TJ' command because of being ActualText, then bypass
779
                        // the break to proceed to the writing section below.
780 17
                        if ('Tj' != $command[self::OPERATOR] && 'TJ' != $command[self::OPERATOR]) {
781 17
                            break;
782
                        }
783
784
                        // no break
785 42
                    case "'":
786 42
                    case '"':
787 4
                        if ("'" == $command[self::OPERATOR] || '"' == $command[self::OPERATOR]) {
788
                            // Move to next line and write text
789
                            $current_position['x'] = 0;
790
                            $current_position_td['x'] = 0;
791
                            $current_position_td['y'] += $current_text_leading;
792
                        }
793
                        // no break
794 42
                    case 'Tj':
795 34
                        $command[self::COMMAND] = [$command];
796
                        // no break
797 42
                    case 'TJ':
798
                        // Check the marked content stack for flags
799 42
                        $actual_text = false;
800 42
                        $reverse_text = false;
801 42
                        foreach ($marked_stack as $marked) {
802 5
                            if (isset($marked['ActualText'])) {
803 4
                                $actual_text = true;
804
                            }
805 5
                            if (isset($marked['ReversedChars'])) {
806 1
                                $reverse_text = true;
807
                            }
808
                        }
809
810
                        // Account for text position ONLY just before we write text
811 42
                        if (false === $actual_text && \is_array($last_written_position)) {
812
                            // If $last_written_position is an array, that
813
                            // means we have stored text position coordinates
814
                            // for placing an ActualText
815 4
                            $currentX = $last_written_position[0];
816 4
                            $currentY = $last_written_position[1];
817 4
                            $last_written_position = false;
818
                        } else {
819 42
                            $currentX = $current_position_cm['x'] + $current_position_tm['x'] + $current_position_td['x'];
820 42
                            $currentY = $current_position_cm['y'] + $current_position_tm['y'] + $current_position_td['y'];
821
                        }
822 42
                        $whiteSpace = '';
823
824 42
                        $factorX = -$current_font_size * $current_position_tm['a'] - $current_font_size * $current_position_tm['i'];
825 42
                        $factorY = $current_font_size * $current_position_tm['b'] + $current_font_size * $current_position_tm['j'];
826
827 42
                        if (true === $this->addPositionWhitespace && false !== $current_position['x']) {
828 30
                            $curY = $currentY - $current_position['y'];
829 30
                            if (abs($curY) >= abs($factorY) / 4) {
830 29
                                $whiteSpace = "\n";
831
                            } else {
832 29
                                if (true === $reverse_text) {
833 1
                                    $curX = $current_position['x'] - $currentX;
834
                                } else {
835 29
                                    $curX = $currentX - $current_position['x'];
836
                                }
837
838
                                // In abs($factorX * 7) below, the 7 is chosen arbitrarily
839
                                // as the number of apparent "spaces" in a document we
840
                                // would need before considering them a "tab". In the
841
                                // future, we might offer this value to users as a config
842
                                // option.
843 29
                                if ($curX >= abs($factorX * 7)) {
844 19
                                    $whiteSpace = "\t";
845 28
                                } elseif ($curX >= abs($factorX * 2)) {
846 17
                                    $whiteSpace = ' ';
847
                                }
848
                            }
849
                        }
850
851 42
                        $newtext = $this->getTJUsingFontFallback(
852 42
                            $current_font,
853 42
                            $command[self::COMMAND],
854 42
                            $page,
855 42
                            $factorX
856 42
                        );
857
858
                        // If there is no ActualText pending then write
859 42
                        if (false === $actual_text) {
860 42
                            $newtext = str_replace(["\r", "\n"], '', $newtext);
861 42
                            if (false !== $reverse_text) {
862
                                // If we are in ReversedChars mode, add the whitespace last
863 1
                                $text[] = preg_replace('/  $/', ' ', $newtext.$whiteSpace);
864
                            } else {
865
                                // Otherwise add the whitespace first
866 42
                                if (' ' === $whiteSpace && isset($text[\count($text) - 1])) {
867 16
                                    $text[\count($text) - 1] = preg_replace('/ $/', '', $text[\count($text) - 1]);
868
                                }
869 42
                                $text[] = preg_replace('/^[ \t]{2}/', ' ', $whiteSpace.$newtext);
870
                            }
871
872
                            // Record the position of this inserted text for comparison
873
                            // with the next text block.
874
                            // Provide a 'fudge' factor guess on how wide this text block
875
                            // is based on the number of characters. This helps limit the
876
                            // number of tabs inserted, but isn't perfect.
877 42
                            $factor = $factorX / 2;
878 42
                            $current_position = [
879 42
                                'x' => $currentX - mb_strlen($newtext) * $factor,
880 42
                                'y' => $currentY,
881 42
                            ];
882 4
                        } elseif (false === $last_written_position) {
883
                            // If there is an ActualText in the pipeline
884
                            // store the position this undisplayed text
885
                            // *would* have been written to, so the
886
                            // ActualText is displayed in the right spot
887 4
                            $last_written_position = [$currentX, $currentY];
888 4
                            $current_position['x'] = $currentX;
889
                        }
890 42
                        break;
891
892
                        // move to start of next line
893 42
                    case 'T*':
894 13
                        $current_position['x'] = 0;
895 13
                        $current_position_td['x'] = 0;
896 13
                        $current_position_td['y'] += $current_text_leading;
897 13
                        break;
898
899
                        // set character spacing
900 42
                    case 'Tc':
901 13
                        break;
902
903
                        // move text current point and set leading
904 42
                    case 'Td':
905 42
                    case 'TD':
906
                        // move text current point
907 31
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
908 31
                        $y = (float) array_pop($args);
909 31
                        $x = (float) array_pop($args);
910
911 31
                        if ('TD' == $command[self::OPERATOR]) {
912 7
                            $current_text_leading = -$y * $current_position_tm['b'] - $y * $current_position_tm['j'];
913
                        }
914
915 31
                        $current_position_td = [
916 31
                            'x' => $current_position_td['x'] + $x * $current_position_tm['a'] + $x * $current_position_tm['i'],
917 31
                            'y' => $current_position_td['y'] + $y * $current_position_tm['b'] + $y * $current_position_tm['j'],
918 31
                        ];
919 31
                        break;
920
921 42
                    case 'Tf':
922 42
                        $args = preg_split('/\s/s', $command[self::COMMAND]);
923 42
                        $size = (float) array_pop($args);
924 42
                        $id = trim(array_pop($args), '/');
925 42
                        if (null !== $page) {
926 42
                            $new_font = $page->getFont($id);
927
                            // If an invalid font ID is given, do not update the font.
928
                            // This should theoretically never happen, as the PDF spec states for the Tf operator:
929
                            // "The specified font value shall match a resource name in the Font entry of the default resource dictionary"
930
                            // (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 435)
931
                            // But we want to make sure that malformed PDFs do not simply crash.
932 42
                            if (null !== $new_font) {
933 39
                                $current_font = $new_font;
934 39
                                $current_font_size = $size;
935
                            }
936
                        }
937 42
                        break;
938
939
                        // set leading
940 36
                    case 'TL':
941 6
                        $y = (float) $command[self::COMMAND];
942 6
                        $current_text_leading = -$y * $current_position_tm['b'] + -$y * $current_position_tm['j'];
943 6
                        break;
944
945
                        // set text position matrix
946 36
                    case 'Tm':
947 34
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
948 34
                        $current_position_tm = [
949 34
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
950 34
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
951 34
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
952 34
                        ];
953 34
                        break;
954
955
                        // set text rendering mode
956 21
                    case 'Ts':
957
                        break;
958
959
                        // set super/subscripting text rise
960 21
                    case 'Ts':
961
                        break;
962
963
                        // set word spacing
964 21
                    case 'Tw':
965 9
                        break;
966
967
                        // set horizontal scaling
968 21
                    case 'Tz':
969
                        break;
970
971
                    default:
972
                }
973
            }
974
        }
975
976 45
        $result = array_merge($result, $text);
977
978 45
        return $result;
979
    }
980
981
    /**
982
     * getCommandsText() expects the content of $text_part to be an
983
     * already formatted, single-line command from a document stream.
984
     * The companion function getSectionsText() returns a document
985
     * stream as an array of single commands for just this purpose.
986
     * Because of this, the argument $offset is no longer used, and
987
     * may be removed in a future PdfParser release.
988
     *
989
     * A better name for this function would be getCommandText()
990
     * since it now always works on just one command.
991
     */
992 49
    public function getCommandsText(string $text_part, int &$offset = 0): array
0 ignored issues
show
Unused Code introduced by
The parameter $offset is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

992
    public function getCommandsText(string $text_part, /** @scrutinizer ignore-unused */ int &$offset = 0): array

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
993
    {
994 49
        $commands = $matches = [];
995
996 49
        preg_match('/^(([\/\[\(<])?.*)(?<!\w)([a-z01\'\"*]+)$/i', $text_part, $matches);
997
998
        // If no valid command is detected, return an empty array
999 49
        if (!isset($matches[1]) || !isset($matches[2]) || !isset($matches[3])) {
1000 1
            return [];
1001
        }
1002
1003 49
        $type = $matches[2];
1004 49
        $operator = $matches[3];
1005 49
        $command = trim($matches[1]);
1006
1007 49
        if ('TJ' == $operator) {
1008 40
            $subcommand = [];
1009 40
            $command = trim($command, '[]');
1010
            do {
1011 40
                $oldCommand = $command;
1012
1013
                // Search for parentheses string () format
1014 40
                if (preg_match('/^ *\((.*?)(?<![^\\\\]\\\\)\) *(-?[\d.]+)?/', $command, $tjmatch)) {
1015 34
                    $subcommand[] = [
1016 34
                        self::TYPE => '(',
1017 34
                        self::OPERATOR => 'TJ',
1018 34
                        self::COMMAND => $tjmatch[1],
1019 34
                    ];
1020 34
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
1021 28
                        $subcommand[] = [
1022 28
                            self::TYPE => 'n',
1023 28
                            self::OPERATOR => '',
1024 28
                            self::COMMAND => $tjmatch[2],
1025 28
                        ];
1026
                    }
1027 34
                    $command = substr($command, \strlen($tjmatch[0]));
1028
                }
1029
1030
                // Search for hexadecimal <> format
1031 40
                if (preg_match('/^ *<([0-9a-f\s]*)> *(-?[\d.]+)?/i', $command, $tjmatch)) {
1032 19
                    $tjmatch[1] = preg_replace('/\s/', '', $tjmatch[1]);
1033 19
                    $subcommand[] = [
1034 19
                        self::TYPE => '<',
1035 19
                        self::OPERATOR => 'TJ',
1036 19
                        self::COMMAND => $tjmatch[1],
1037 19
                    ];
1038 19
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
1039 18
                        $subcommand[] = [
1040 18
                            self::TYPE => 'n',
1041 18
                            self::OPERATOR => '',
1042 18
                            self::COMMAND => $tjmatch[2],
1043 18
                        ];
1044
                    }
1045 19
                    $command = substr($command, \strlen($tjmatch[0]));
1046
                }
1047 40
            } while ($command != $oldCommand);
1048
1049 40
            $command = $subcommand;
1050 49
        } elseif ('Tj' == $operator || "'" == $operator || '"' == $operator) {
1051
            // Depending on the string type, trim the data of the
1052
            // appropriate delimiters
1053 38
            if ('(' == $type) {
1054
                // Don't use trim() here since a () string may end with
1055
                // a balanced or escaped right parentheses, and trim()
1056
                // will delete both. Both strings below are valid:
1057
                //   eg. (String())
1058
                //   eg. (String\))
1059 32
                $command = preg_replace('/^\(|\)$/', '', $command);
1060 15
            } elseif ('<' == $type) {
1061 38
                $command = trim($command, '<>');
1062
            }
1063 49
        } elseif ('/' == $type) {
1064 48
            $command = substr($command, 1);
1065
        }
1066
1067 49
        $commands[] = [
1068 49
            self::TYPE => $type,
1069 49
            self::OPERATOR => $operator,
1070 49
            self::COMMAND => $command,
1071 49
        ];
1072
1073 49
        return $commands;
1074
    }
1075
1076 64
    public static function factory(
1077
        Document $document,
1078
        Header $header,
1079
        ?string $content,
1080
        ?Config $config = null
1081
    ): self {
1082 64
        switch ($header->get('Type')->getContent()) {
1083 64
            case 'XObject':
1084 19
                switch ($header->get('Subtype')->getContent()) {
1085 19
                    case 'Image':
1086 12
                        return new Image($document, $header, $config->getRetainImageContent() ? $content : null, $config);
0 ignored issues
show
Bug introduced by
The method getRetainImageContent() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1086
                        return new Image($document, $header, $config->/** @scrutinizer ignore-call */ getRetainImageContent() ? $content : null, $config);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
1087
1088 8
                    case 'Form':
1089 8
                        return new Form($document, $header, $content, $config);
1090
                }
1091
1092
                return new self($document, $header, $content, $config);
1093
1094 64
            case 'Pages':
1095 63
                return new Pages($document, $header, $content, $config);
1096
1097 64
            case 'Page':
1098 63
                return new Page($document, $header, $content, $config);
1099
1100 64
            case 'Encoding':
1101 11
                return new Encoding($document, $header, $content, $config);
1102
1103 64
            case 'Font':
1104 63
                $subtype = $header->get('Subtype')->getContent();
1105 63
                $classname = '\Smalot\PdfParser\Font\Font'.$subtype;
1106
1107 63
                if (class_exists($classname)) {
1108 63
                    return new $classname($document, $header, $content, $config);
1109
                }
1110
1111
                return new Font($document, $header, $content, $config);
1112
1113
            default:
1114 64
                return new self($document, $header, $content, $config);
1115
        }
1116
    }
1117
1118
    /**
1119
     * Returns unique id identifying the object.
1120
     */
1121 45
    protected function getUniqueId(): string
1122
    {
1123 45
        return spl_object_hash($this);
1124
    }
1125
}
1126