Passed
Pull Request — master (#682)
by Konrad
03:15
created

PDFObject::cleanContent()   B

Complexity

Conditions 11
Paths 64

Size

Total Lines 57
Code Lines 31

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 32
CRAP Score 11.0033

Importance

Changes 0
Metric Value
cc 11
eloc 31
c 0
b 0
f 0
nc 64
nop 2
dl 0
loc 57
ccs 32
cts 33
cp 0.9697
crap 11.0033
rs 7.3166

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\XObject\Form;
36
use Smalot\PdfParser\XObject\Image;
37
38
/**
39
 * Class PDFObject
40
 */
41
class PDFObject
42
{
43
    public const TYPE = 't';
44
45
    public const OPERATOR = 'o';
46
47
    public const COMMAND = 'c';
48
49
    /**
50
     * The recursion stack.
51
     *
52
     * @var array
53
     */
54
    public static $recursionStack = [];
55
56
    /**
57
     * @var Document|null
58
     */
59
    protected $document;
60
61
    /**
62
     * @var Header
63
     */
64
    protected $header;
65
66
    /**
67
     * @var string
68
     */
69
    protected $content;
70
71
    /**
72
     * @var Config|null
73
     */
74
    protected $config;
75
76
    /**
77
     * @var bool
78
     */
79
    protected $addPositionWhitespace = false;
80
81 92
    public function __construct(
82
        Document $document,
83
        ?Header $header = null,
84
        ?string $content = null,
85
        ?Config $config = null
86
    ) {
87 92
        $this->document = $document;
88 92
        $this->header = $header ?? new Header();
89 92
        $this->content = $content;
90 92
        $this->config = $config;
91
    }
92
93 70
    public function init()
94
    {
95 70
    }
96
97 3
    public function getDocument(): Document
98
    {
99 3
        return $this->document;
0 ignored issues
show
Bug Best Practice introduced by
The expression return $this->document could return the type null which is incompatible with the type-hinted return Smalot\PdfParser\Document. Consider adding an additional type-check to rule them out.
Loading history...
100
    }
101
102 70
    public function getHeader(): ?Header
103
    {
104 70
        return $this->header;
105
    }
106
107 3
    public function getConfig(): ?Config
108
    {
109 3
        return $this->config;
110
    }
111
112
    /**
113
     * @return Element|PDFObject|Header
114
     */
115 72
    public function get(string $name)
116
    {
117 72
        return $this->header->get($name);
118
    }
119
120 71
    public function has(string $name): bool
121
    {
122 71
        return $this->header->has($name);
123
    }
124
125 4
    public function getDetails(bool $deep = true): array
126
    {
127 4
        return $this->header->getDetails($deep);
128
    }
129
130 58
    public function getContent(): ?string
131
    {
132 58
        return $this->content;
133
    }
134
135
    /**
136
     * Creates a duplicate of the document stream with
137
     * strings and other items replaced by $char. Formerly
138
     * getSectionsText() used this output to more easily gather offset
139
     * values to extract text from the *actual* document stream.
140
     *
141
     * @deprecated function is no longer used and will be removed in a future release
142
     *
143
     * @internal
144
     */
145 1
    public function cleanContent(string $content, string $char = 'X')
146
    {
147 1
        $char = $char[0];
148 1
        $content = str_replace(['\\\\', '\\)', '\\('], $char.$char, $content);
149
150
        // Remove image bloc with binary content
151 1
        preg_match_all('/\s(BI\s.*?(\sID\s).*?(\sEI))\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
152 1
        foreach ($matches[0] as $part) {
153
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
154
        }
155
156
        // Clean content in square brackets [.....]
157 1
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);
0 ignored issues
show
Unused Code introduced by
The call to preg_match_all() has too many arguments starting with PREG_OFFSET_CAPTURE. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

157
        /** @scrutinizer ignore-call */ 
158
        preg_match_all('/\[((\(.*?\)|[0-9\.\-\s]*)*)\]/s', $content, $matches, \PREG_OFFSET_CAPTURE);

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
158 1
        foreach ($matches[1] as $part) {
159 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
160
        }
161
162
        // Clean content in round brackets (.....)
163 1
        preg_match_all('/\((.*?)\)/s', $content, $matches, \PREG_OFFSET_CAPTURE);
164 1
        foreach ($matches[1] as $part) {
165 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
166
        }
167
168
        // Clean structure
169 1
        if ($parts = preg_split('/(<|>)/s', $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
0 ignored issues
show
Bug introduced by
It seems like $content can also be of type array; however, parameter $subject of preg_split() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

169
        if ($parts = preg_split('/(<|>)/s', /** @scrutinizer ignore-type */ $content, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE)) {
Loading history...
170 1
            $content = '';
171 1
            $level = 0;
172 1
            foreach ($parts as $part) {
173 1
                if ('<' == $part) {
174 1
                    ++$level;
175
                }
176
177 1
                $content .= (0 == $level ? $part : str_repeat($char, \strlen($part)));
178
179 1
                if ('>' == $part) {
180 1
                    --$level;
181
                }
182
            }
183
        }
184
185
        // Clean BDC and EMC markup
186 1
        preg_match_all(
187 1
            '/(\/[A-Za-z0-9\_]*\s*'.preg_quote($char).'*BDC)/s',
188 1
            $content,
189 1
            $matches,
190 1
            \PREG_OFFSET_CAPTURE
191 1
        );
192 1
        foreach ($matches[1] as $part) {
193 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
194
        }
195
196 1
        preg_match_all('/\s(EMC)\s/s', $content, $matches, \PREG_OFFSET_CAPTURE);
197 1
        foreach ($matches[1] as $part) {
198 1
            $content = substr_replace($content, str_repeat($char, \strlen($part[0])), $part[1], \strlen($part[0]));
199
        }
200
201 1
        return $content;
202
    }
203
204
    /**
205
     * Takes a string of PDF document stream text and formats
206
     * it into a multi-line string with one PDF command on each line,
207
     * separated by \r\n. If the given string is null, or binary data
208
     * is detected instead of a document stream then return an empty
209
     * string.
210
     */
211 51
    private function formatContent(?string $content): string
212
    {
213 51
        if (null === $content) {
214 3
            return '';
215
        }
216
217
        // Find all strings () and replace them so they aren't affected
218
        // by the next steps
219 48
        $pdfstrings = [];
220 48
        $attempt = '(';
221 48
        while (preg_match('/'.preg_quote($attempt, '/').'.*?(?<![^\\\\]\\\\)\)/s', $content, $text)) {
222
            // PDF strings can contain unescaped parentheses as long as
223
            // they're balanced, so check for balanced parentheses
224 39
            $left = preg_match_all('/(?<![^\\\\]\\\\)\(/', $text[0]);
225 39
            $right = preg_match_all('/(?<![^\\\\]\\\\)\)/', $text[0]);
226
227 39
            if ($left == $right) {
228
                // Replace the string with a unique placeholder
229 39
                $id = uniqid('STRING_', true);
230 39
                $pdfstrings[$id] = $text[0];
231 39
                $content = preg_replace(
232 39
                    '/'.preg_quote($text[0], '/').'/',
233 39
                    '@@@'.$id.'@@@',
234 39
                    $content,
235 39
                    1
236 39
                );
237
238
                // Reset to search for the next string
239 39
                $attempt = '(';
240
            } else {
241
                // We had unbalanced parentheses, so use the current
242
                // match as a base to find a longer string
243
                $attempt = $text[0];
244
            }
245
        }
246
247
        // Remove all carriage returns and line-feeds from the document stream
248 48
        $content = str_replace(["\r", "\n"], ' ', trim($content));
249
250
        // Find all dictionary << >> commands and replace them so they
251
        // aren't affected by the next steps
252 48
        $dictstore = [];
253 48
        while (preg_match('/(<<.*?>> *)(BDC|BMC|DP|MP)/', $content, $dicttext)) {
254 18
            $dictid = uniqid('DICT_', true);
255 18
            $dictstore[$dictid] = $dicttext[1];
256 18
            $content = preg_replace(
257 18
                '/'.preg_quote($dicttext[0], '/').'/',
258 18
                ' ###'.$dictid.'###'.$dicttext[2],
259 18
                $content,
260 18
                1
261 18
            );
262
        }
263
264
        // Now that all strings and dictionaries are hidden, the only
265
        // PDF commands left should all be plain text.
266
        // Detect text encoding of the current string to prevent reading
267
        // content streams that are images, etc. This prevents PHP
268
        // error messages when JPEG content is sent to this function
269
        // by the sample file '12249.pdf' from:
270
        // https://github.com/smalot/pdfparser/issues/458
271 48
        if (false === mb_detect_encoding($content, null, true)) {
272 1
            return '';
273
        }
274
275
        // Normalize white-space in the document stream
276 48
        $content = preg_replace('/\s{2,}/', ' ', $content);
277
278
        // Find all valid PDF operators and add \r\n after each; this
279
        // ensures there is just one command on every line
280
        // Source: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A
281
        // Source: https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A
282
        // Note: PDF Reference 1.7 lists 'I' and 'rI' as valid commands, while
283
        //       PDF 32000:2008 lists them as 'i' and 'ri' respectively. Both versions
284
        //       appear here in the list for completeness.
285 48
        $operators = [
286 48
          'b*', 'b', 'BDC', 'BMC', 'B*', 'BI', 'BT', 'BX', 'B', 'cm', 'cs', 'c', 'CS',
287 48
          'd0', 'd1', 'd', 'Do', 'DP', 'EMC', 'EI', 'ET', 'EX', 'f*', 'f', 'F', 'gs',
288 48
          'g', 'G',  'h', 'i', 'ID', 'I', 'j', 'J', 'k', 'K', 'l', 'm', 'MP', 'M', 'n',
289 48
          'q', 'Q', 're', 'rg', 'ri', 'rI', 'RG', 'scn', 'sc', 'sh', 's', 'SCN', 'SC',
290 48
          'S', 'T*', 'Tc', 'Td', 'TD', 'Tf', 'TJ', 'Tj', 'TL', 'Tm', 'Tr', 'Ts', 'Tw',
291 48
          'Tz', 'v', 'w', 'W*', 'W', 'y', '\'', '"',
292 48
        ];
293 48
        foreach ($operators as $operator) {
294 48
            $content = preg_replace(
295 48
                '/(?<!\w|\/)'.preg_quote($operator, '/').'(?![\w10\*])/',
296 48
                $operator."\r\n",
297 48
                $content
298 48
            );
299
        }
300
301
        // Restore the original content of the dictionary << >> commands
302 48
        $dictstore = array_reverse($dictstore, true);
303 48
        foreach ($dictstore as $id => $dict) {
304 18
            $content = str_replace('###'.$id.'###', $dict, $content);
305
        }
306
307
        // Restore the original string content
308 48
        $pdfstrings = array_reverse($pdfstrings, true);
309 48
        foreach ($pdfstrings as $id => $text) {
310
            // Strings may contain escaped newlines, or literal newlines
311
            // and we should clean these up before replacing the string
312
            // back into the content stream; this ensures no strings are
313
            // split between two lines (every command must be on one line)
314 39
            $text = str_replace(
315 39
                ["\\\r\n", "\\\r", "\\\n", "\r", "\n"],
316 39
                ['', '', '', '\r', '\n'],
317 39
                $text
318 39
            );
319
320 39
            $content = str_replace('@@@'.$id.'@@@', $text, $content);
321
        }
322
323 48
        $content = trim(preg_replace(['/(\r\n){2,}/', '/\r\n +/'], "\r\n", $content));
324
325 48
        return $content;
326
    }
327
328
    /**
329
     * getSectionsText() now takes an entire, unformatted
330
     * document stream as a string, cleans it, then filters out
331
     * commands that aren't needed for text positioning/extraction. It
332
     * returns an array of unprocessed PDF commands, one command per
333
     * element.
334
     *
335
     * @internal
336
     */
337 50
    public function getSectionsText(?string $content): array
338
    {
339 50
        $sections = [];
340
341
        // A cleaned stream has one command on every line, so split the
342
        // cleaned stream content on \r\n into an array
343 50
        $textCleaned = preg_split(
344 50
            '/(\r\n|\n|\r)/',
345 50
            $this->formatContent($content),
346 50
            -1,
347 50
            \PREG_SPLIT_NO_EMPTY
348 50
        );
349
350 50
        $inTextBlock = false;
351 50
        foreach ($textCleaned as $line) {
352 47
            $line = trim($line);
353
354
            // Skip empty lines
355 47
            if ('' === $line) {
356
                continue;
357
            }
358
359
            // If a 'BT' is encountered, set the $inTextBlock flag
360 47
            if (preg_match('/BT$/', $line)) {
361 47
                $inTextBlock = true;
362 47
                $sections[] = $line;
363
364
            // If an 'ET' is encountered, unset the $inTextBlock flag
365 47
            } elseif ('ET' == $line) {
366 47
                $inTextBlock = false;
367 47
                $sections[] = $line;
368 47
            } elseif ($inTextBlock) {
369
                // If we are inside a BT ... ET text block, save all lines
370 47
                $sections[] = trim($line);
371
            } else {
372
                // Otherwise, if we are outside of a text block, only
373
                // save specific, necessary lines. Care should be taken
374
                // to ensure a command being checked for *only* matches
375
                // that command. For instance, a simple search for 'c'
376
                // may also match the 'sc' command. See the command
377
                // list in the formatContent() method above.
378
                // Add more commands to save here as you find them in
379
                // weird PDFs!
380 46
                if ('q' == $line[-1] || 'Q' == $line[-1]) {
381
                    // Save and restore graphics state commands
382 40
                    $sections[] = $line;
383 46
                } elseif (preg_match('/(?<!\w)B[DM]C$/', $line)) {
384
                    // Begin marked content sequence
385 16
                    $sections[] = $line;
386 46
                } elseif (preg_match('/(?<!\w)[DM]P$/', $line)) {
387
                    // Marked content point
388 1
                    $sections[] = $line;
389 45
                } elseif (preg_match('/(?<!\w)EMC$/', $line)) {
390
                    // End marked content sequence
391 15
                    $sections[] = $line;
392 43
                } elseif (preg_match('/(?<!\w)cm$/', $line)) {
393
                    // Graphics position change commands
394 32
                    $sections[] = $line;
395 43
                } elseif (preg_match('/(?<!\w)Tf$/', $line)) {
396
                    // Font change commands
397 3
                    $sections[] = $line;
398 43
                } elseif (preg_match('/(?<!\w)Do$/', $line)) {
399
                    // Invoke named XObject command
400 14
                    $sections[] = $line;
401
                }
402
            }
403
        }
404
405 50
        return $sections;
406
    }
407
408 44
    private function getDefaultFont(?Page $page = null): Font
409
    {
410 44
        $fonts = [];
411 44
        if (null !== $page) {
412 42
            $fonts = $page->getFonts();
413
        }
414
415 44
        $firstFont = $this->document->getFirstFont();
0 ignored issues
show
Bug introduced by
The method getFirstFont() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

415
        /** @scrutinizer ignore-call */ 
416
        $firstFont = $this->document->getFirstFont();

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
416 44
        if (null !== $firstFont) {
417 41
            $fonts[] = $firstFont;
418
        }
419
420 44
        if (\count($fonts) > 0) {
421 41
            return reset($fonts);
422
        }
423
424 3
        return new Font($this->document, null, null, $this->config);
0 ignored issues
show
Bug introduced by
It seems like $this->document can also be of type null; however, parameter $document of Smalot\PdfParser\Font::__construct() does only seem to accept Smalot\PdfParser\Document, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

424
        return new Font(/** @scrutinizer ignore-type */ $this->document, null, null, $this->config);
Loading history...
425
    }
426
427
    /**
428
     * Decode a '[]TJ' command and attempt to use alternate
429
     * fonts if the current font results in output that contains
430
     * Unicode control characters.
431
     *
432
     * @internal
433
     *
434
     * @param array<int,array<string,string|bool>> $command
435
     */
436 41
    private function getTJUsingFontFallback(Font $font, array $command, ?Page $page = null, float $fontFactor = 4): string
437
    {
438 41
        $orig_text = $font->decodeText($command, $fontFactor);
439 41
        $text = $orig_text;
440
441
        // If we make this a Config option, we can add a check if it's
442
        // enabled here.
443 41
        if (null !== $page) {
444 41
            $font_ids = array_keys($page->getFonts());
445
446
            // If the decoded text contains UTF-8 control characters
447
            // then the font page being used is probably the wrong one.
448
            // Loop through the rest of the fonts to see if we can get
449
            // a good decode. Allow x09 to x0d which are whitespace.
450 41
            while (preg_match('/[\x00-\x08\x0e-\x1f\x7f]/u', $text) || false !== strpos(bin2hex($text), '00')) {
451
                // If we're out of font IDs, then give up and use the
452
                // original string
453 3
                if (0 == \count($font_ids)) {
454 3
                    return $orig_text;
455
                }
456
457
                // Try the next font ID
458 3
                $font = $page->getFont(array_shift($font_ids));
459 3
                $text = $font->decodeText($command, $fontFactor);
460
            }
461
        }
462
463 41
        return $text;
464
    }
465
466
    /**
467
     * Expects a string that is a full PDF dictionary object,
468
     * including the outer enclosing << >> angle brackets
469
     *
470
     * @internal
471
     *
472
     * @throws \Exception
473
     */
474 17
    public function parseDictionary(string $dictionary): array
475
    {
476
        // Normalize whitespace
477 17
        $dictionary = preg_replace(['/\r/', '/\n/', '/\s{2,}/'], ' ', trim($dictionary));
478
479 17
        if ('<<' != substr($dictionary, 0, 2)) {
480
            throw new \Exception('Not a valid dictionary object.');
481
        }
482
483 17
        $parsed = [];
484 17
        $stack = [];
485 17
        $currentName = '';
486 17
        $arrayTypeNumeric = false;
487
488
        // Remove outer layer of dictionary, and split on tokens
489 17
        $split = preg_split(
490 17
            '/(<<|>>|\[|\]|\/[^\s\/\[\]\(\)<>]*)/',
491 17
            trim(preg_replace('/^<<|>>$/', '', $dictionary)),
492 17
            -1,
493 17
            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
494 17
        );
495
496 17
        foreach ($split as $token) {
497 17
            $token = trim($token);
498
            switch ($token) {
499 17
                case '':
500 7
                    break;
501
502
                    // Open numeric array
503 17
                case '[':
504 7
                    $parsed[$currentName] = [];
505 7
                    $arrayTypeNumeric = true;
506
507
                    // Move up one level in the stack
508 7
                    $stack[\count($stack)] = &$parsed;
509 7
                    $parsed = &$parsed[$currentName];
510 7
                    $currentName = '';
511 7
                    break;
512
513
                    // Open hashed array
514 17
                case '<<':
515 1
                    $parsed[$currentName] = [];
516 1
                    $arrayTypeNumeric = false;
517
518
                    // Move up one level in the stack
519 1
                    $stack[\count($stack)] = &$parsed;
520 1
                    $parsed = &$parsed[$currentName];
521 1
                    $currentName = '';
522 1
                    break;
523
524
                    // Close numeric array
525 17
                case ']':
526
                    // Revert string type arrays back to a single element
527 7
                    if (\is_array($parsed) && 1 == \count($parsed)
528 7
                        && isset($parsed[0]) && \is_string($parsed[0])
529 7
                        && '' !== $parsed[0] && '/' != $parsed[0][0]) {
530 6
                        $parsed = '['.$parsed[0].']';
531
                    }
532
                    // Close hashed array
533
                    // no break
534 17
                case '>>':
535 7
                    $arrayTypeNumeric = false;
536
537
                    // Move down one level in the stack
538 7
                    $parsed = &$stack[\count($stack) - 1];
539 7
                    unset($stack[\count($stack) - 1]);
540 7
                    break;
541
542
                default:
543
                    // If value begins with a slash, then this is a name
544
                    // Add it to the appropriate array
545 17
                    if ('/' == substr($token, 0, 1)) {
546 17
                        $currentName = substr($token, 1);
547 17
                        if (true == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
548 6
                            $parsed[] = $currentName;
549 17
                            $currentName = '';
550
                        }
551 17
                    } elseif ('' != $currentName) {
552 17
                        if (false == $arrayTypeNumeric) {
0 ignored issues
show
Coding Style Best Practice introduced by
It seems like you are loosely comparing two booleans. Considering using the strict comparison === instead.

When comparing two booleans, it is generally considered safer to use the strict comparison operator.

Loading history...
553 17
                            $parsed[$currentName] = $token;
554
                        }
555 17
                        $currentName = '';
556 5
                    } elseif ('' == $currentName) {
557 5
                        $parsed[] = $token;
558
                    }
559
            }
560
        }
561
562 17
        return $parsed;
563
    }
564
565
    /**
566
     * Returns the text content of a PDF as a string. Attempts to add
567
     * whitespace for spacing and line-breaks where appropriate.
568
     *
569
     * getText() leverages getTextArray() to get the content
570
     * of the document, setting the addPositionWhitespace flag to true
571
     * so whitespace is inserted in a logical way for reading by
572
     * humans.
573
     */
574 35
    public function getText(?Page $page = null): string
575
    {
576 35
        $this->addPositionWhitespace = true;
577 35
        $result = $this->getTextArray($page);
578 35
        $this->addPositionWhitespace = false;
579
580 35
        return implode('', $result).' ';
581
    }
582
583
    /**
584
     * Returns the text content of a PDF as an array of strings. No
585
     * extra whitespace is inserted besides what is actually encoded in
586
     * the PDF text.
587
     *
588
     * @throws \Exception
589
     */
590 44
    public function getTextArray(?Page $page = null): array
591
    {
592 44
        $result = [];
593 44
        $text = [];
594
595 44
        $marked_stack = [];
596 44
        $last_written_position = false;
597
598 44
        $sections = $this->getSectionsText($this->content);
599 44
        $current_font = $this->getDefaultFont($page);
600 44
        $current_font_size = 1;
601 44
        $current_text_leading = 0;
602
603 44
        $current_position = ['x' => false, 'y' => false];
604 44
        $current_position_tm = [
605 44
            'a' => 1, 'b' => 0, 'c' => 0,
606 44
            'i' => 0, 'j' => 1, 'k' => 0,
607 44
            'x' => 0, 'y' => 0, 'z' => 1,
608 44
        ];
609 44
        $current_position_td = ['x' => 0, 'y' => 0];
610 44
        $current_position_cm = [
611 44
            'a' => 1, 'b' => 0, 'c' => 0,
612 44
            'i' => 0, 'j' => 1, 'k' => 0,
613 44
            'x' => 0, 'y' => 0, 'z' => 1,
614 44
        ];
615
616 44
        $clipped_font = [];
617 44
        $clipped_position_cm = [];
618
619 44
        self::$recursionStack[] = $this->getUniqueId();
620
621 44
        foreach ($sections as $section) {
622 41
            $commands = $this->getCommandsText($section);
623 41
            foreach ($commands as $command) {
624 41
                switch ($command[self::OPERATOR]) {
625
                    // Begin text object
626 41
                    case 'BT':
627
                        // Reset text positioning matrices
628 41
                        $current_position_tm = [
629 41
                            'a' => 1, 'b' => 0, 'c' => 0,
630 41
                            'i' => 0, 'j' => 1, 'k' => 0,
631 41
                            'x' => 0, 'y' => 0, 'z' => 1,
632 41
                        ];
633 41
                        $current_position_td = ['x' => 0, 'y' => 0];
634 41
                        $current_text_leading = 0;
635 41
                        break;
636
637
                        // Begin marked content sequence with property list
638 41
                    case 'BDC':
639 16
                        if (preg_match('/(<<.*>>)$/', $command[self::COMMAND], $match)) {
640 16
                            $dict = $this->parseDictionary($match[1]);
641
642
                            // Check for ActualText block
643 16
                            if (isset($dict['ActualText']) && \is_string($dict['ActualText']) && '' !== $dict['ActualText']) {
644 4
                                if ('[' == $dict['ActualText'][0]) {
645
                                    // Simulate a 'TJ' command on the stack
646
                                    $marked_stack[] = [
647
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'TJ')[0],
648
                                    ];
649 4
                                } elseif ('<' == $dict['ActualText'][0] || '(' == $dict['ActualText'][0]) {
650
                                    // Simulate a 'Tj' command on the stack
651 4
                                    $marked_stack[] = [
652 4
                                        'ActualText' => $this->getCommandsText($dict['ActualText'].'Tj')[0],
653 4
                                    ];
654
                                }
655
                            }
656
                        }
657 16
                        break;
658
659
                        // Begin marked content sequence
660 41
                    case 'BMC':
661 2
                        if ('ReversedChars' == $command[self::COMMAND]) {
662
                            // Upon encountering a ReversedChars command,
663
                            // add the characters we've built up so far to
664
                            // the result array
665 1
                            $result = array_merge($result, $text);
666
667
                            // Start a fresh $text array that will contain
668
                            // reversed characters
669 1
                            $text = [];
670
671
                            // Add the reversed text flag to the stack
672 1
                            $marked_stack[] = ['ReversedChars' => true];
673
                        }
674 2
                        break;
675
676
                        // set graphics position matrix
677 41
                    case 'cm':
678 28
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
679 28
                        $current_position_cm = [
680 28
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
681 28
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
682 28
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
683 28
                        ];
684 28
                        break;
685
686 41
                    case 'Do':
687 14
                        if (null !== $page) {
688 14
                            $args = preg_split('/\s/s', $command[self::COMMAND]);
689 14
                            $id = trim(array_pop($args), '/ ');
690 14
                            $xobject = $page->getXObject($id);
691
692
                            // @todo $xobject could be a ElementXRef object, which would then throw an error
693 14
                            if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack)) {
694
                                // Not a circular reference.
695 14
                                $text[] = $xobject->getText($page);
696
                            }
697
                        }
698 14
                        break;
699
700
                        // Marked content point with (DP) & without (MP) property list
701 41
                    case 'DP':
702 41
                    case 'MP':
703 1
                        break;
704
705
                        // End text object
706 41
                    case 'ET':
707 41
                        break;
708
709
                        // Store current selected font and graphics matrix
710 41
                    case 'q':
711 35
                        $clipped_font[] = [$current_font, $current_font_size];
712 35
                        $clipped_position_cm[] = $current_position_cm;
713 35
                        break;
714
715
                        // Restore previous selected font and graphics matrix
716 41
                    case 'Q':
717 35
                        list($current_font, $current_font_size) = array_pop($clipped_font);
718 35
                        $current_position_cm = array_pop($clipped_position_cm);
719 35
                        break;
720
721
                        // End marked content sequence
722 41
                    case 'EMC':
723 17
                        $data = false;
724 17
                        if (\count($marked_stack)) {
725 5
                            $marked = array_pop($marked_stack);
726 5
                            $action = key($marked);
727 5
                            $data = $marked[$action];
728
729
                            switch ($action) {
730
                                // If we are in ReversedChars mode...
731 5
                                case 'ReversedChars':
732
                                    // Reverse the characters we've built up so far
733 1
                                    foreach ($text as $key => $t) {
734 1
                                        $text[$key] = implode('', array_reverse(
735 1
                                            mb_str_split($t, 1, mb_internal_encoding())
0 ignored issues
show
Bug introduced by
It seems like mb_internal_encoding() can also be of type true; however, parameter $encoding of mb_str_split() does only seem to accept null|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

735
                                            mb_str_split($t, 1, /** @scrutinizer ignore-type */ mb_internal_encoding())
Loading history...
736 1
                                        ));
737
                                    }
738
739
                                    // Add these characters to the result array
740 1
                                    $result = array_merge($result, $text);
741
742
                                    // Start a fresh $text array that will contain
743
                                    // non-reversed characters
744 1
                                    $text = [];
745 1
                                    break;
746
747 4
                                case 'ActualText':
748
                                    // Use the content of the ActualText as a command
749 4
                                    $command = $data;
750 4
                                    break;
751
                            }
752
                        }
753
754
                        // If this EMC command has been transformed into a 'Tj'
755
                        // or 'TJ' command because of being ActualText, then bypass
756
                        // the break to proceed to the writing section below.
757 17
                        if ('Tj' != $command[self::OPERATOR] && 'TJ' != $command[self::OPERATOR]) {
758 17
                            break;
759
                        }
760
761
                        // no break
762 41
                    case "'":
763 41
                    case '"':
764 4
                        if ("'" == $command[self::OPERATOR] || '"' == $command[self::OPERATOR]) {
765
                            // Move to next line and write text
766
                            $current_position['x'] = 0;
767
                            $current_position_td['x'] = 0;
768
                            $current_position_td['y'] += $current_text_leading;
769
                        }
770
                        // no break
771 41
                    case 'Tj':
772 33
                        $command[self::COMMAND] = [$command];
773
                        // no break
774 41
                    case 'TJ':
775
                        // Check the marked content stack for flags
776 41
                        $actual_text = false;
777 41
                        $reverse_text = false;
778 41
                        foreach ($marked_stack as $marked) {
779 5
                            if (isset($marked['ActualText'])) {
780 4
                                $actual_text = true;
781
                            }
782 5
                            if (isset($marked['ReversedChars'])) {
783 1
                                $reverse_text = true;
784
                            }
785
                        }
786
787
                        // Account for text position ONLY just before we write text
788 41
                        if (false === $actual_text && \is_array($last_written_position)) {
789
                            // If $last_written_position is an array, that
790
                            // means we have stored text position coordinates
791
                            // for placing an ActualText
792 4
                            $currentX = $last_written_position[0];
793 4
                            $currentY = $last_written_position[1];
794 4
                            $last_written_position = false;
795
                        } else {
796 41
                            $currentX = $current_position_cm['x'] + $current_position_tm['x'] + $current_position_td['x'];
797 41
                            $currentY = $current_position_cm['y'] + $current_position_tm['y'] + $current_position_td['y'];
798
                        }
799 41
                        $whiteSpace = '';
800
801 41
                        $factorX = -$current_font_size * $current_position_tm['a'] - $current_font_size * $current_position_tm['i'];
802 41
                        $factorY = $current_font_size * $current_position_tm['b'] + $current_font_size * $current_position_tm['j'];
803
804 41
                        if (true === $this->addPositionWhitespace && false !== $current_position['x']) {
805 29
                            $curY = $currentY - $current_position['y'];
806 29
                            if (abs($curY) >= abs($factorY) / 4) {
807 28
                                $whiteSpace = "\n";
808
                            } else {
809 28
                                if (true === $reverse_text) {
810 1
                                    $curX = $current_position['x'] - $currentX;
811
                                } else {
812 28
                                    $curX = $currentX - $current_position['x'];
813
                                }
814
815
                                // In abs($factorX * 7) below, the 7 is chosen arbitrarily
816
                                // as the number of apparent "spaces" in a document we
817
                                // would need before considering them a "tab". In the
818
                                // future, we might offer this value to users as a config
819
                                // option.
820 28
                                if ($curX >= abs($factorX * 7)) {
821 18
                                    $whiteSpace = "\t";
822 27
                                } elseif ($curX >= abs($factorX * 2)) {
823 16
                                    $whiteSpace = ' ';
824
                                }
825
                            }
826
                        }
827
828 41
                        $newtext = $this->getTJUsingFontFallback(
829 41
                            $current_font,
830 41
                            $command[self::COMMAND],
831 41
                            $page,
832 41
                            $factorX
833 41
                        );
834
835
                        // If there is no ActualText pending then write
836 41
                        if (false === $actual_text) {
837 41
                            $newtext = str_replace(["\r", "\n"], '', $newtext);
838 41
                            if (false !== $reverse_text) {
839
                                // If we are in ReversedChars mode, add the whitespace last
840 1
                                $text[] = preg_replace('/  $/', ' ', $newtext.$whiteSpace);
841
                            } else {
842
                                // Otherwise add the whitespace first
843 41
                                if (' ' === $whiteSpace && isset($text[\count($text) - 1])) {
844 15
                                    $text[\count($text) - 1] = preg_replace('/ $/', '', $text[\count($text) - 1]);
845
                                }
846 41
                                $text[] = preg_replace('/^[ \t]{2}/', ' ', $whiteSpace.$newtext);
847
                            }
848
849
                            // Record the position of this inserted text for comparison
850
                            // with the next text block.
851
                            // Provide a 'fudge' factor guess on how wide this text block
852
                            // is based on the number of characters. This helps limit the
853
                            // number of tabs inserted, but isn't perfect.
854 41
                            $factor = $factorX / 2;
855 41
                            $current_position = [
856 41
                                'x' => $currentX - mb_strlen($newtext) * $factor,
857 41
                                'y' => $currentY,
858 41
                            ];
859 4
                        } elseif (false === $last_written_position) {
860
                            // If there is an ActualText in the pipeline
861
                            // store the position this undisplayed text
862
                            // *would* have been written to, so the
863
                            // ActualText is displayed in the right spot
864 4
                            $last_written_position = [$currentX, $currentY];
865 4
                            $current_position['x'] = $currentX;
866
                        }
867 41
                        break;
868
869
                        // move to start of next line
870 41
                    case 'T*':
871 12
                        $current_position['x'] = 0;
872 12
                        $current_position_td['x'] = 0;
873 12
                        $current_position_td['y'] += $current_text_leading;
874 12
                        break;
875
876
                        // set character spacing
877 41
                    case 'Tc':
878 13
                        break;
879
880
                        // move text current point and set leading
881 41
                    case 'Td':
882 41
                    case 'TD':
883
                        // move text current point
884 30
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
885 30
                        $y = (float) array_pop($args);
886 30
                        $x = (float) array_pop($args);
887
888 30
                        if ('TD' == $command[self::OPERATOR]) {
889 7
                            $current_text_leading = -$y * $current_position_tm['b'] - $y * $current_position_tm['j'];
890
                        }
891
892 30
                        $current_position_td = [
893 30
                            'x' => $current_position_td['x'] + $x * $current_position_tm['a'] + $x * $current_position_tm['i'],
894 30
                            'y' => $current_position_td['y'] + $y * $current_position_tm['b'] + $y * $current_position_tm['j'],
895 30
                        ];
896 30
                        break;
897
898 41
                    case 'Tf':
899 41
                        $args = preg_split('/\s/s', $command[self::COMMAND]);
900 41
                        $size = (float) array_pop($args);
901 41
                        $id = trim(array_pop($args), '/');
902 41
                        if (null !== $page) {
903 41
                            $new_font = $page->getFont($id);
904
                            // If an invalid font ID is given, do not update the font.
905
                            // This should theoretically never happen, as the PDF spec states for the Tf operator:
906
                            // "The specified font value shall match a resource name in the Font entry of the default resource dictionary"
907
                            // (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 435)
908
                            // But we want to make sure that malformed PDFs do not simply crash.
909 41
                            if (null !== $new_font) {
910 38
                                $current_font = $new_font;
911 38
                                $current_font_size = $size;
912
                            }
913
                        }
914 41
                        break;
915
916
                        // set leading
917 35
                    case 'TL':
918 5
                        $y = (float) $command[self::COMMAND];
919 5
                        $current_text_leading = -$y * $current_position_tm['b'] + -$y * $current_position_tm['j'];
920 5
                        break;
921
922
                        // set text position matrix
923 35
                    case 'Tm':
924 33
                        $args = preg_split('/\s+/s', $command[self::COMMAND]);
925 33
                        $current_position_tm = [
926 33
                            'a' => (float) $args[0], 'b' => (float) $args[1], 'c' => 0,
927 33
                            'i' => (float) $args[2], 'j' => (float) $args[3], 'k' => 0,
928 33
                            'x' => (float) $args[4], 'y' => (float) $args[5], 'z' => 1,
929 33
                        ];
930 33
                        break;
931
932
                        // set text rendering mode
933 21
                    case 'Ts':
934
                        break;
935
936
                        // set super/subscripting text rise
937 21
                    case 'Ts':
938
                        break;
939
940
                        // set word spacing
941 21
                    case 'Tw':
942 9
                        break;
943
944
                        // set horizontal scaling
945 21
                    case 'Tz':
946
                        break;
947
948
                    default:
949
                }
950
            }
951
        }
952
953 44
        $result = array_merge($result, $text);
954
955 44
        return $result;
956
    }
957
958
    /**
959
     * getCommandsText() expects the content of $text_part to be an
960
     * already formatted, single-line command from a document stream.
961
     * The companion function getSectionsText() returns a document
962
     * stream as an array of single commands for just this purpose.
963
     * Because of this, the argument $offset is no longer used, and
964
     * may be removed in a future PdfParser release.
965
     *
966
     * A better name for this function would be getCommandText()
967
     * since it now always works on just one command.
968
     */
969 48
    public function getCommandsText(string $text_part, int &$offset = 0): array
0 ignored issues
show
Unused Code introduced by
The parameter $offset is not used and could be removed. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unused  annotation

969
    public function getCommandsText(string $text_part, /** @scrutinizer ignore-unused */ int &$offset = 0): array

This check looks for parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
970
    {
971 48
        $commands = $matches = [];
972
973 48
        preg_match('/^(([\/\[\(<])?.*)(?<!\w)([a-z01\'\"*]+)$/i', $text_part, $matches);
974
975
        // If no valid command is detected, return an empty array
976 48
        if (!isset($matches[1]) || !isset($matches[2]) || !isset($matches[3])) {
977 1
            return [];
978
        }
979
980 48
        $type = $matches[2];
981 48
        $operator = $matches[3];
982 48
        $command = trim($matches[1]);
983
984 48
        if ('TJ' == $operator) {
985 39
            $subcommand = [];
986 39
            $command = trim($command, '[]');
987
            do {
988 39
                $oldCommand = $command;
989
990
                // Search for parentheses string () format
991 39
                if (preg_match('/^ *\((.*?)(?<![^\\\\]\\\\)\) *(-?[\d.]+)?/', $command, $tjmatch)) {
992 33
                    $subcommand[] = [
993 33
                        self::TYPE => '(',
994 33
                        self::OPERATOR => 'TJ',
995 33
                        self::COMMAND => $tjmatch[1],
996 33
                    ];
997 33
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
998 27
                        $subcommand[] = [
999 27
                            self::TYPE => 'n',
1000 27
                            self::OPERATOR => '',
1001 27
                            self::COMMAND => $tjmatch[2],
1002 27
                        ];
1003
                    }
1004 33
                    $command = substr($command, \strlen($tjmatch[0]));
1005
                }
1006
1007
                // Search for hexadecimal <> format
1008 39
                if (preg_match('/^ *<([0-9a-f\s]*)> *(-?[\d.]+)?/i', $command, $tjmatch)) {
1009 19
                    $tjmatch[1] = preg_replace('/\s/', '', $tjmatch[1]);
1010 19
                    $subcommand[] = [
1011 19
                        self::TYPE => '<',
1012 19
                        self::OPERATOR => 'TJ',
1013 19
                        self::COMMAND => $tjmatch[1],
1014 19
                    ];
1015 19
                    if (isset($tjmatch[2]) && trim($tjmatch[2])) {
1016 18
                        $subcommand[] = [
1017 18
                            self::TYPE => 'n',
1018 18
                            self::OPERATOR => '',
1019 18
                            self::COMMAND => $tjmatch[2],
1020 18
                        ];
1021
                    }
1022 19
                    $command = substr($command, \strlen($tjmatch[0]));
1023
                }
1024 39
            } while ($command != $oldCommand);
1025
1026 39
            $command = $subcommand;
1027 48
        } elseif ('Tj' == $operator || "'" == $operator || '"' == $operator) {
1028
            // Depending on the string type, trim the data of the
1029
            // appropriate delimiters
1030 37
            if ('(' == $type) {
1031
                // Don't use trim() here since a () string may end with
1032
                // a balanced or escaped right parentheses, and trim()
1033
                // will delete both. Both strings below are valid:
1034
                //   eg. (String())
1035
                //   eg. (String\))
1036 31
                $command = preg_replace('/^\(|\)$/', '', $command);
1037 15
            } elseif ('<' == $type) {
1038 37
                $command = trim($command, '<>');
1039
            }
1040 48
        } elseif ('/' == $type) {
1041 47
            $command = substr($command, 1);
1042
        }
1043
1044 48
        $commands[] = [
1045 48
            self::TYPE => $type,
1046 48
            self::OPERATOR => $operator,
1047 48
            self::COMMAND => $command,
1048 48
        ];
1049
1050 48
        return $commands;
1051
    }
1052
1053 63
    public static function factory(
1054
        Document $document,
1055
        Header $header,
1056
        ?string $content,
1057
        ?Config $config = null
1058
    ): self {
1059 63
        switch ($header->get('Type')->getContent()) {
1060 63
            case 'XObject':
1061 19
                switch ($header->get('Subtype')->getContent()) {
1062 19
                    case 'Image':
1063 12
                        return new Image($document, $header, $config->getRetainImageContent() ? $content : null, $config);
0 ignored issues
show
Bug introduced by
The method getRetainImageContent() does not exist on null. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

1063
                        return new Image($document, $header, $config->/** @scrutinizer ignore-call */ getRetainImageContent() ? $content : null, $config);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
1064
1065 8
                    case 'Form':
1066 8
                        return new Form($document, $header, $content, $config);
1067
                }
1068
1069
                return new self($document, $header, $content, $config);
1070
1071 63
            case 'Pages':
1072 62
                return new Pages($document, $header, $content, $config);
1073
1074 63
            case 'Page':
1075 62
                return new Page($document, $header, $content, $config);
1076
1077 63
            case 'Encoding':
1078 10
                return new Encoding($document, $header, $content, $config);
1079
1080 63
            case 'Font':
1081 62
                $subtype = $header->get('Subtype')->getContent();
1082 62
                $classname = '\Smalot\PdfParser\Font\Font'.$subtype;
1083
1084 62
                if (class_exists($classname)) {
1085 62
                    return new $classname($document, $header, $content, $config);
1086
                }
1087
1088
                return new Font($document, $header, $content, $config);
1089
1090
            default:
1091 63
                return new self($document, $header, $content, $config);
1092
        }
1093
    }
1094
1095
    /**
1096
     * Returns unique id identifying the object.
1097
     */
1098 44
    protected function getUniqueId(): string
1099
    {
1100 44
        return spl_object_hash($this);
1101
    }
1102
}
1103