Passed
Push — master ( 7964d2...ac8e66 )
by Konrad
12:58
created

RawDataParser::decodeXref()   D

Complexity

Conditions 16
Paths 200

Size

Total Lines 67
Code Lines 39

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 37
CRAP Score 16.0046

Importance

Changes 3
Bugs 1 Features 1
Metric Value
cc 16
eloc 39
c 3
b 1
f 1
nc 200
nop 3
dl 0
loc 67
ccs 37
cts 38
cp 0.9737
crap 16.0046
rs 4.7333

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * This file is based on code of tecnickcom/TCPDF PDF library.
5
 *
6
 * Original author Nicola Asuni ([email protected]) and
7
 * contributors (https://github.com/tecnickcom/TCPDF/graphs/contributors).
8
 *
9
 * @see https://github.com/tecnickcom/TCPDF
10
 *
11
 * Original code was licensed on the terms of the LGPL v3.
12
 *
13
 * ------------------------------------------------------------------------------
14
 *
15
 * @file This file is part of the PdfParser library.
16
 *
17
 * @author  Konrad Abicht <[email protected]>
18
 *
19
 * @date    2020-01-06
20
 *
21
 * @license LGPLv3
22
 *
23
 * @url     <https://github.com/smalot/pdfparser>
24
 *
25
 *  PdfParser is a pdf library written in PHP, extraction oriented.
26
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
27
 *
28
 *  This program is free software: you can redistribute it and/or modify
29
 *  it under the terms of the GNU Lesser General Public License as published by
30
 *  the Free Software Foundation, either version 3 of the License, or
31
 *  (at your option) any later version.
32
 *
33
 *  This program is distributed in the hope that it will be useful,
34
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
35
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
36
 *  GNU Lesser General Public License for more details.
37
 *
38
 *  You should have received a copy of the GNU Lesser General Public License
39
 *  along with this program.
40
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
41
 */
42
43
namespace Smalot\PdfParser\RawData;
44
45
use Smalot\PdfParser\Config;
46
47
class RawDataParser
48
{
49
    /**
50
     * @var Config
51
     */
52
    private $config;
53
54
    /**
55
     * Configuration array.
56
     *
57
     * @var array<string,bool>
58
     */
59
    protected $cfg = [
60
        // if `true` ignore filter decoding errors
61
        'ignore_filter_decoding_errors' => true,
62
        // if `true` ignore missing filter decoding errors
63
        'ignore_missing_filter_decoders' => true,
64
    ];
65
66
    protected $filterHelper;
67
    protected $objects;
68
69
    /**
70
     * @param array $cfg Configuration array, default is []
71
     */
72 74
    public function __construct($cfg = [], ?Config $config = null)
73
    {
74
        // merge given array with default values
75 74
        $this->cfg = array_merge($this->cfg, $cfg);
76
77 74
        $this->filterHelper = new FilterHelper();
78 74
        $this->config = $config ?: new Config();
79
    }
80
81
    /**
82
     * Decode the specified stream.
83
     *
84
     * @param string $pdfData PDF data
85
     * @param array  $sdic    Stream's dictionary array
86
     * @param string $stream  Stream to decode
87
     *
88
     * @return array containing decoded stream data and remaining filters
89
     *
90
     * @throws \Exception
91
     */
92 67
    protected function decodeStream(string $pdfData, array $xref, array $sdic, string $stream): array
93
    {
94
        // get stream length and filters
95 67
        $slength = \strlen($stream);
96 67
        if ($slength <= 0) {
97
            return ['', []];
98
        }
99 67
        $filters = [];
100 67
        foreach ($sdic as $k => $v) {
101 67
            if ('/' == $v[0]) {
102 67
                if (('Length' == $v[1]) && (isset($sdic[$k + 1])) && ('numeric' == $sdic[$k + 1][0])) {
103
                    // get declared stream length
104 62
                    $declength = (int) $sdic[$k + 1][1];
105 62
                    if ($declength < $slength) {
106 62
                        $stream = substr($stream, 0, $declength);
107 62
                        $slength = $declength;
108
                    }
109 67
                } elseif (('Filter' == $v[1]) && (isset($sdic[$k + 1]))) {
110
                    // resolve indirect object
111 67
                    $objval = $this->getObjectVal($pdfData, $xref, $sdic[$k + 1]);
112 67
                    if ('/' == $objval[0]) {
113
                        // single filter
114 67
                        $filters[] = $objval[1];
115 4
                    } elseif ('[' == $objval[0]) {
116
                        // array of filters
117 4
                        foreach ($objval[1] as $flt) {
118 4
                            if ('/' == $flt[0]) {
119 4
                                $filters[] = $flt[1];
120
                            }
121
                        }
122
                    }
123
                }
124
            }
125
        }
126
127
        // decode the stream
128 67
        $remaining_filters = [];
129 67
        foreach ($filters as $filter) {
130 67
            if (\in_array($filter, $this->filterHelper->getAvailableFilters(), true)) {
131
                try {
132 67
                    $stream = $this->filterHelper->decodeFilter($filter, $stream, $this->config->getDecodeMemoryLimit());
133 2
                } catch (\Exception $e) {
134 2
                    $emsg = $e->getMessage();
135 2
                    if ((('~' == $emsg[0]) && !$this->cfg['ignore_missing_filter_decoders'])
136 2
                        || (('~' != $emsg[0]) && !$this->cfg['ignore_filter_decoding_errors'])
137
                    ) {
138 67
                        throw new \Exception($e->getMessage());
139
                    }
140
                }
141
            } else {
142
                // add missing filter to array
143 9
                $remaining_filters[] = $filter;
144
            }
145
        }
146
147 67
        return [$stream, $remaining_filters];
148
    }
149
150
    /**
151
     * Decode the Cross-Reference section
152
     *
153
     * @param string $pdfData   PDF data
154
     * @param int    $startxref Offset at which the xref section starts (position of the 'xref' keyword)
155
     * @param array  $xref      Previous xref array (if any)
156
     *
157
     * @return array containing xref and trailer data
158
     *
159
     * @throws \Exception
160
     */
161 54
    protected function decodeXref(string $pdfData, int $startxref, array $xref = []): array
162
    {
163 54
        $startxref += 4; // 4 is the length of the word 'xref'
164
        // skip initial white space chars
165 54
        $offset = $startxref + strspn($pdfData, $this->config->getPdfWhitespaces(), $startxref);
166
        // initialize object number
167 54
        $obj_num = 0;
168
        // search for cross-reference entries or subsection
169 54
        while (preg_match('/([0-9]+)[\x20]([0-9]+)[\x20]?([nf]?)(\r\n|[\x20]?[\r\n])/', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset) > 0) {
170 54
            if ($matches[0][1] != $offset) {
171
                // we are on another section
172 12
                break;
173
            }
174 54
            $offset += \strlen($matches[0][0]);
175 54
            if ('n' == $matches[3][0]) {
176
                // create unique object index: [object number]_[generation number]
177 54
                $index = $obj_num.'_'.(int) $matches[2][0];
178
                // check if object already exist
179 54
                if (!isset($xref['xref'][$index])) {
180
                    // store object offset position
181 54
                    $xref['xref'][$index] = (int) $matches[1][0];
182
                }
183 54
                ++$obj_num;
184 54
            } elseif ('f' == $matches[3][0]) {
185 53
                ++$obj_num;
186
            } else {
187
                // object number (index)
188 54
                $obj_num = (int) $matches[1][0];
189
            }
190
        }
191
        // get trailer data
192 54
        if (preg_match('/trailer[\s]*<<(.*)>>/isU', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset) > 0) {
193 54
            $trailer_data = $matches[1][0];
194 54
            if (!isset($xref['trailer']) || empty($xref['trailer'])) {
195
                // get only the last updated version
196 54
                $xref['trailer'] = [];
197
                // parse trailer_data
198 54
                if (preg_match('/Size[\s]+([0-9]+)/i', $trailer_data, $matches) > 0) {
199 54
                    $xref['trailer']['size'] = (int) $matches[1];
200
                }
201 54
                if (preg_match('/Root[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
202 54
                    $xref['trailer']['root'] = (int) $matches[1].'_'.(int) $matches[2];
203
                }
204 54
                if (preg_match('/Encrypt[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
205 2
                    $xref['trailer']['encrypt'] = (int) $matches[1].'_'.(int) $matches[2];
206
                }
207 54
                if (preg_match('/Info[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
208 50
                    $xref['trailer']['info'] = (int) $matches[1].'_'.(int) $matches[2];
209
                }
210 54
                if (preg_match('/ID[\s]*[\[][\s]*[<]([^>]*)[>][\s]*[<]([^>]*)[>]/i', $trailer_data, $matches) > 0) {
211 41
                    $xref['trailer']['id'] = [];
212 41
                    $xref['trailer']['id'][0] = $matches[1];
213 41
                    $xref['trailer']['id'][1] = $matches[2];
214
                }
215
            }
216 54
            if (preg_match('/Prev[\s]+([0-9]+)/i', $trailer_data, $matches) > 0) {
217 13
                $offset = (int) $matches[1];
218 13
                if (0 != $offset) {
219
                    // get previous xref
220 54
                    $xref = $this->getXrefData($pdfData, $offset, $xref);
221
                }
222
            }
223
        } else {
224
            throw new \Exception('Unable to find trailer');
225
        }
226
227 54
        return $xref;
228
    }
229
230
    /**
231
     * Decode the Cross-Reference Stream section
232
     *
233
     * @param string $pdfData   PDF data
234
     * @param int    $startxref Offset at which the xref section starts
235
     * @param array  $xref      Previous xref array (if any)
236
     *
237
     * @return array containing xref and trailer data
238
     *
239
     * @throws \Exception if unknown PNG predictor detected
240
     */
241 13
    protected function decodeXrefStream(string $pdfData, int $startxref, array $xref = []): array
242
    {
243
        // try to read Cross-Reference Stream
244 13
        $xrefobj = $this->getRawObject($pdfData, $startxref);
245 13
        $xrefcrs = $this->getIndirectObject($pdfData, $xref, $xrefobj[1], $startxref, true);
246 13
        if (!isset($xref['trailer']) || empty($xref['trailer'])) {
247
            // get only the last updated version
248 13
            $xref['trailer'] = [];
249 13
            $filltrailer = true;
250
        } else {
251 11
            $filltrailer = false;
252
        }
253 13
        if (!isset($xref['xref'])) {
254 13
            $xref['xref'] = [];
255
        }
256 13
        $valid_crs = false;
257 13
        $columns = 0;
258 13
        $predictor = null;
259 13
        $sarr = $xrefcrs[0][1];
260 13
        if (!\is_array($sarr)) {
261
            $sarr = [];
262
        }
263
264 13
        $wb = [];
265
266 13
        foreach ($sarr as $k => $v) {
267
            if (
268 13
                ('/' == $v[0])
269 13
                && ('Type' == $v[1])
270 13
                && (isset($sarr[$k + 1])
271 13
                    && '/' == $sarr[$k + 1][0]
272 13
                    && 'XRef' == $sarr[$k + 1][1]
273
                )
274
            ) {
275 13
                $valid_crs = true;
276 13
            } elseif (('/' == $v[0]) && ('Index' == $v[1]) && (isset($sarr[$k + 1]))) {
277
                // initialize list for: first object number in the subsection / number of objects
278 11
                $index_blocks = [];
279 11
                for ($m = 0; $m < \count($sarr[$k + 1][1]); $m += 2) {
0 ignored issues
show
Performance Best Practice introduced by
It seems like you are calling the size function count() as part of the test condition. You might want to compute the size beforehand, and not on each iteration.

If the size of the collection does not change during the iteration, it is generally a good practice to compute it beforehand, and not on each iteration:

for ($i=0; $i<count($array); $i++) { // calls count() on each iteration
}

// Better
for ($i=0, $c=count($array); $i<$c; $i++) { // calls count() just once
}
Loading history...
280 11
                    $index_blocks[] = [$sarr[$k + 1][1][$m][1], $sarr[$k + 1][1][$m + 1][1]];
281
                }
282 13
            } elseif (('/' == $v[0]) && ('Prev' == $v[1]) && (isset($sarr[$k + 1]) && ('numeric' == $sarr[$k + 1][0]))) {
283
                // get previous xref offset
284 11
                $prevxref = (int) $sarr[$k + 1][1];
285 13
            } elseif (('/' == $v[0]) && ('W' == $v[1]) && (isset($sarr[$k + 1]))) {
286
                // number of bytes (in the decoded stream) of the corresponding field
287 13
                $wb[0] = (int) $sarr[$k + 1][1][0][1];
288 13
                $wb[1] = (int) $sarr[$k + 1][1][1][1];
289 13
                $wb[2] = (int) $sarr[$k + 1][1][2][1];
290 13
            } elseif (('/' == $v[0]) && ('DecodeParms' == $v[1]) && (isset($sarr[$k + 1][1]))) {
291 11
                $decpar = $sarr[$k + 1][1];
292 11
                foreach ($decpar as $kdc => $vdc) {
293
                    if (
294 11
                        '/' == $vdc[0]
295 11
                        && 'Columns' == $vdc[1]
296 11
                        && (isset($decpar[$kdc + 1])
297 11
                            && 'numeric' == $decpar[$kdc + 1][0]
298
                        )
299
                    ) {
300 11
                        $columns = (int) $decpar[$kdc + 1][1];
301
                    } elseif (
302 11
                        '/' == $vdc[0]
303 11
                        && 'Predictor' == $vdc[1]
304 11
                        && (isset($decpar[$kdc + 1])
305 11
                            && 'numeric' == $decpar[$kdc + 1][0]
306
                        )
307
                    ) {
308 11
                        $predictor = (int) $decpar[$kdc + 1][1];
309
                    }
310
                }
311 13
            } elseif ($filltrailer) {
312 13
                if (('/' == $v[0]) && ('Size' == $v[1]) && (isset($sarr[$k + 1]) && ('numeric' == $sarr[$k + 1][0]))) {
313 13
                    $xref['trailer']['size'] = $sarr[$k + 1][1];
314 13
                } elseif (('/' == $v[0]) && ('Root' == $v[1]) && (isset($sarr[$k + 1]) && ('objref' == $sarr[$k + 1][0]))) {
315 13
                    $xref['trailer']['root'] = $sarr[$k + 1][1];
316 13
                } elseif (('/' == $v[0]) && ('Info' == $v[1]) && (isset($sarr[$k + 1]) && ('objref' == $sarr[$k + 1][0]))) {
317 13
                    $xref['trailer']['info'] = $sarr[$k + 1][1];
318 13
                } elseif (('/' == $v[0]) && ('Encrypt' == $v[1]) && (isset($sarr[$k + 1]) && ('objref' == $sarr[$k + 1][0]))) {
319
                    $xref['trailer']['encrypt'] = $sarr[$k + 1][1];
320 13
                } elseif (('/' == $v[0]) && ('ID' == $v[1]) && (isset($sarr[$k + 1]))) {
321 13
                    $xref['trailer']['id'] = [];
322 13
                    $xref['trailer']['id'][0] = $sarr[$k + 1][1][0][1];
323 13
                    $xref['trailer']['id'][1] = $sarr[$k + 1][1][1][1];
324
                }
325
            }
326
        }
327
328
        // decode data
329 13
        if ($valid_crs && isset($xrefcrs[1][3][0])) {
330 13
            if (null !== $predictor) {
331
                // number of bytes in a row
332 11
                $rowlen = ($columns + 1);
333
                // convert the stream into an array of integers
334
                /** @var array<int> */
335 11
                $sdata = unpack('C*', $xrefcrs[1][3][0]);
336
                // TODO: Handle the case when unpack returns false
337
338
                // split the rows
339 11
                $sdata = array_chunk($sdata, $rowlen);
340
341
                // initialize decoded array
342 11
                $ddata = [];
343
                // initialize first row with zeros
344 11
                $prev_row = array_fill(0, $rowlen, 0);
345
                // for each row apply PNG unpredictor
346 11
                foreach ($sdata as $k => $row) {
347
                    // initialize new row
348 11
                    $ddata[$k] = [];
349
                    // get PNG predictor value
350 11
                    $predictor = (10 + $row[0]);
351
                    // for each byte on the row
352 11
                    for ($i = 1; $i <= $columns; ++$i) {
353
                        // new index
354 11
                        $j = ($i - 1);
355 11
                        $row_up = $prev_row[$j];
356 11
                        if (1 == $i) {
357 11
                            $row_left = 0;
358 11
                            $row_upleft = 0;
359
                        } else {
360 11
                            $row_left = $row[$i - 1];
361 11
                            $row_upleft = $prev_row[$j - 1];
362
                        }
363
                        switch ($predictor) {
364 11
                            case 10:  // PNG prediction (on encoding, PNG None on all rows)
365
                                $ddata[$k][$j] = $row[$i];
366
                                break;
367
368 11
                            case 11:  // PNG prediction (on encoding, PNG Sub on all rows)
369
                                $ddata[$k][$j] = (($row[$i] + $row_left) & 0xFF);
370
                                break;
371
372 11
                            case 12:  // PNG prediction (on encoding, PNG Up on all rows)
373 11
                                $ddata[$k][$j] = (($row[$i] + $row_up) & 0xFF);
374 11
                                break;
375
376
                            case 13:  // PNG prediction (on encoding, PNG Average on all rows)
377
                                $ddata[$k][$j] = (($row[$i] + (($row_left + $row_up) / 2)) & 0xFF);
378
                                break;
379
380
                            case 14:  // PNG prediction (on encoding, PNG Paeth on all rows)
381
                                // initial estimate
382
                                $p = ($row_left + $row_up - $row_upleft);
383
                                // distances
384
                                $pa = abs($p - $row_left);
385
                                $pb = abs($p - $row_up);
386
                                $pc = abs($p - $row_upleft);
387
                                $pmin = min($pa, $pb, $pc);
388
                                // return minimum distance
389
                                switch ($pmin) {
390
                                    case $pa:
391
                                        $ddata[$k][$j] = (($row[$i] + $row_left) & 0xFF);
392
                                        break;
393
394
                                    case $pb:
395
                                        $ddata[$k][$j] = (($row[$i] + $row_up) & 0xFF);
396
                                        break;
397
398
                                    case $pc:
399
                                        $ddata[$k][$j] = (($row[$i] + $row_upleft) & 0xFF);
400
                                        break;
401
                                }
402
                                break;
403
404
                            default:  // PNG prediction (on encoding, PNG optimum)
405
                                throw new \Exception('Unknown PNG predictor: '.$predictor);
406
                        }
407
                    }
408 11
                    $prev_row = $ddata[$k];
409
                } // end for each row
410
            // complete decoding
411
            } else {
412
                // number of bytes in a row
413 2
                $rowlen = array_sum($wb);
414 2
                if (0 < $rowlen) {
415
                    // convert the stream into an array of integers
416 2
                    $sdata = unpack('C*', $xrefcrs[1][3][0]);
417
                    // split the rows
418 2
                    $ddata = array_chunk($sdata, $rowlen);
0 ignored issues
show
Bug introduced by
It seems like $rowlen can also be of type double; however, parameter $length of array_chunk() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

418
                    $ddata = array_chunk($sdata, /** @scrutinizer ignore-type */ $rowlen);
Loading history...
419
                } else {
420
                    // if the row length is zero, $ddata should be an empty array as well
421
                    $ddata = [];
422
                }
423
            }
424
425 13
            $sdata = [];
426
427
            // for every row
428 13
            foreach ($ddata as $k => $row) {
429
                // initialize new row
430 13
                $sdata[$k] = [0, 0, 0];
431 13
                if (0 == $wb[0]) {
432
                    // default type field
433
                    $sdata[$k][0] = 1;
434
                }
435 13
                $i = 0; // count bytes in the row
436
                // for every column
437 13
                for ($c = 0; $c < 3; ++$c) {
438
                    // for every byte on the column
439 13
                    for ($b = 0; $b < $wb[$c]; ++$b) {
440 13
                        if (isset($row[$i])) {
441 13
                            $sdata[$k][$c] += ($row[$i] << (($wb[$c] - 1 - $b) * 8));
442
                        }
443 13
                        ++$i;
444
                    }
445
                }
446
            }
447
448
            // fill xref
449 13
            if (isset($index_blocks)) {
450
                // load the first object number of the first /Index entry
451 11
                $obj_num = $index_blocks[0][0];
452
            } else {
453 12
                $obj_num = 0;
454
            }
455 13
            foreach ($sdata as $k => $row) {
456 13
                switch ($row[0]) {
457 13
                    case 0:  // (f) linked list of free objects
458 13
                        break;
459
460 13
                    case 1:  // (n) objects that are in use but are not compressed
461
                        // create unique object index: [object number]_[generation number]
462 13
                        $index = $obj_num.'_'.$row[2];
463
                        // check if object already exist
464 13
                        if (!isset($xref['xref'][$index])) {
465
                            // store object offset position
466 13
                            $xref['xref'][$index] = $row[1];
467
                        }
468 13
                        break;
469
470 13
                    case 2:  // compressed objects
471
                        // $row[1] = object number of the object stream in which this object is stored
472
                        // $row[2] = index of this object within the object stream
473 13
                        $index = $row[1].'_0_'.$row[2];
474 13
                        $xref['xref'][$index] = -1;
475 13
                        break;
476
477
                    default:  // null objects
478
                        break;
479
                }
480 13
                ++$obj_num;
481 13
                if (isset($index_blocks)) {
482
                    // reduce the number of remaining objects
483 11
                    --$index_blocks[0][1];
484 11
                    if (0 == $index_blocks[0][1]) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $index_blocks does not seem to be defined for all execution paths leading up to this point.
Loading history...
485
                        // remove the actual used /Index entry
486 11
                        array_shift($index_blocks);
487 11
                        if (0 < \count($index_blocks)) {
488
                            // load the first object number of the following /Index entry
489 8
                            $obj_num = $index_blocks[0][0];
490
                        } else {
491
                            // if there are no more entries, remove $index_blocks to avoid actions on an empty array
492 11
                            unset($index_blocks);
493
                        }
494
                    }
495
                }
496
            }
497
        } // end decoding data
498 13
        if (isset($prevxref)) {
499
            // get previous xref
500 11
            $xref = $this->getXrefData($pdfData, $prevxref, $xref);
501
        }
502
503 13
        return $xref;
504
    }
505
506 67
    protected function getObjectHeaderPattern(array $objRefs): string
507
    {
508
        // consider all whitespace character (PDF specifications)
509 67
        return '/'.$objRefs[0].$this->config->getPdfWhitespacesRegex().$objRefs[1].$this->config->getPdfWhitespacesRegex().'obj/';
510
    }
511
512 67
    protected function getObjectHeaderLen(array $objRefs): int
513
    {
514
        // "4 0 obj"
515
        // 2 whitespaces + strlen("obj") = 5
516 67
        return 5 + \strlen($objRefs[0]) + \strlen($objRefs[1]);
517
    }
518
519
    /**
520
     * Get content of indirect object.
521
     *
522
     * @param string $pdfData  PDF data
523
     * @param string $objRef   Object number and generation number separated by underscore character
524
     * @param int    $offset   Object offset
525
     * @param bool   $decoding If true decode streams
526
     *
527
     * @return array containing object data
528
     *
529
     * @throws \Exception if invalid object reference found
530
     */
531 67
    protected function getIndirectObject(string $pdfData, array $xref, string $objRef, int $offset = 0, bool $decoding = true): array
532
    {
533
        /*
534
         * build indirect object header
535
         */
536
        // $objHeader = "[object number] [generation number] obj"
537 67
        $objRefArr = explode('_', $objRef);
538 67
        if (2 !== \count($objRefArr)) {
539
            throw new \Exception('Invalid object reference for $obj.');
540
        }
541
542 67
        $objHeaderLen = $this->getObjectHeaderLen($objRefArr);
543
544
        /*
545
         * check if we are in position
546
         */
547
        // ignore whitespace characters at offset
548 67
        $offset += strspn($pdfData, $this->config->getPdfWhitespaces(), $offset);
549
        // ignore leading zeros for object number
550 67
        $offset += strspn($pdfData, '0', $offset);
551 67
        if (0 == preg_match($this->getObjectHeaderPattern($objRefArr), substr($pdfData, $offset, $objHeaderLen))) {
552
            // an indirect reference to an undefined object shall be considered a reference to the null object
553
            return ['null', 'null', $offset];
554
        }
555
556
        /*
557
         * get content
558
         */
559
        // starting position of object content
560 67
        $offset += $objHeaderLen;
561 67
        $objContentArr = [];
562 67
        $i = 0; // object main index
563 67
        $header = null;
564
        do {
565 67
            $oldOffset = $offset;
566
            // get element
567 67
            $element = $this->getRawObject($pdfData, $offset, null != $header ? $header[1] : null);
568 67
            $offset = $element[2];
569
            // decode stream using stream's dictionary information
570 67
            if ($decoding && ('stream' === $element[0]) && null != $header) {
571 67
                $element[3] = $this->decodeStream($pdfData, $xref, $header[1], $element[1]);
572
            }
573 67
            $objContentArr[$i] = $element;
574 67
            $header = isset($element[0]) && '<<' === $element[0] ? $element : null;
575 67
            ++$i;
576 67
        } while (('endobj' !== $element[0]) && ($offset !== $oldOffset));
577
        // remove closing delimiter
578 67
        array_pop($objContentArr);
579
580
        /*
581
         * return raw object content
582
         */
583 67
        return $objContentArr;
584
    }
585
586
    /**
587
     * Get the content of object, resolving indirect object reference if necessary.
588
     *
589
     * @param string $pdfData PDF data
590
     * @param array  $obj     Object value
591
     *
592
     * @return array containing object data
593
     *
594
     * @throws \Exception
595
     */
596 67
    protected function getObjectVal(string $pdfData, $xref, array $obj): array
597
    {
598 67
        if ('objref' == $obj[0]) {
599
            // reference to indirect object
600
            if (isset($this->objects[$obj[1]])) {
601
                // this object has been already parsed
602
                return $this->objects[$obj[1]];
603
            } elseif (isset($xref[$obj[1]])) {
604
                // parse new object
605
                $this->objects[$obj[1]] = $this->getIndirectObject($pdfData, $xref, $obj[1], $xref[$obj[1]], false);
606
607
                return $this->objects[$obj[1]];
608
            }
609
        }
610
611 67
        return $obj;
612
    }
613
614
    /**
615
     * Get object type, raw value and offset to next object
616
     *
617
     * @param int        $offset    Object offset
618
     * @param array|null $headerDic obj header's dictionary, parsed by getRawObject. Used for stream parsing optimization
619
     *
620
     * @return array containing object type, raw value and offset to next object
621
     */
622 68
    protected function getRawObject(string $pdfData, int $offset = 0, ?array $headerDic = null): array
623
    {
624 68
        $objtype = ''; // object type to be returned
625 68
        $objval = ''; // object value to be returned
626
627
        // skip initial white space chars
628 68
        $offset += strspn($pdfData, $this->config->getPdfWhitespaces(), $offset);
629
630
        // get first char
631 68
        $char = $pdfData[$offset];
632
        // get object type
633
        switch ($char) {
634 68
            case '%':  // \x25 PERCENT SIGN
635
                // skip comment and search for next token
636 3
                $next = strcspn($pdfData, "\r\n", $offset);
637 3
                if ($next > 0) {
638 3
                    $offset += $next;
639
640 3
                    return $this->getRawObject($pdfData, $offset);
641
                }
642
                break;
643
644 68
            case '/':  // \x2F SOLIDUS
645
                // name object
646 68
                $objtype = $char;
647 68
                ++$offset;
648 68
                $span = strcspn($pdfData, "\x00\x09\x0a\x0c\x0d\x20\n\t\r\v\f\x28\x29\x3c\x3e\x5b\x5d\x7b\x7d\x2f\x25", $offset, 256);
649 68
                if ($span > 0) {
650 68
                    $objval = substr($pdfData, $offset, $span); // unescaped value
651 68
                    $offset += $span;
652
                }
653 68
                break;
654
655 68
            case '(':   // \x28 LEFT PARENTHESIS
656 68
            case ')':  // \x29 RIGHT PARENTHESIS
657
                // literal string object
658 62
                $objtype = $char;
659 62
                ++$offset;
660 62
                $strpos = $offset;
661 62
                if ('(' == $char) {
662 62
                    $open_bracket = 1;
663 62
                    while ($open_bracket > 0) {
664 62
                        if (!isset($pdfData[$strpos])) {
665
                            break;
666
                        }
667 62
                        $ch = $pdfData[$strpos];
668
                        switch ($ch) {
669 62
                            case '\\':  // REVERSE SOLIDUS (5Ch) (Backslash)
670
                                // skip next character
671 29
                                ++$strpos;
672 29
                                break;
673
674 62
                            case '(':  // LEFT PARENHESIS (28h)
675 2
                                ++$open_bracket;
676 2
                                break;
677
678 62
                            case ')':  // RIGHT PARENTHESIS (29h)
679 62
                                --$open_bracket;
680 62
                                break;
681
                        }
682 62
                        ++$strpos;
683
                    }
684 62
                    $objval = substr($pdfData, $offset, $strpos - $offset - 1);
685 62
                    $offset = $strpos;
686
                }
687 62
                break;
688
689 68
            case '[':   // \x5B LEFT SQUARE BRACKET
690 68
            case ']':  // \x5D RIGHT SQUARE BRACKET
691
                // array object
692 67
                $objtype = $char;
693 67
                ++$offset;
694 67
                if ('[' == $char) {
695
                    // get array content
696 67
                    $objval = [];
697
                    do {
698 67
                        $oldOffset = $offset;
699
                        // get element
700 67
                        $element = $this->getRawObject($pdfData, $offset);
701 67
                        $offset = $element[2];
702 67
                        $objval[] = $element;
703 67
                    } while ((']' != $element[0]) && ($offset != $oldOffset));
704
                    // remove closing delimiter
705 67
                    array_pop($objval);
706
                }
707 67
                break;
708
709 68
            case '<':  // \x3C LESS-THAN SIGN
710 68
            case '>':  // \x3E GREATER-THAN SIGN
711 68
                if (isset($pdfData[$offset + 1]) && ($pdfData[$offset + 1] == $char)) {
712
                    // dictionary object
713 68
                    $objtype = $char.$char;
714 68
                    $offset += 2;
715 68
                    if ('<' == $char) {
716
                        // get array content
717 68
                        $objval = [];
718
                        do {
719 68
                            $oldOffset = $offset;
720
                            // get element
721 68
                            $element = $this->getRawObject($pdfData, $offset);
722 68
                            $offset = $element[2];
723 68
                            $objval[] = $element;
724 68
                        } while (('>>' != $element[0]) && ($offset != $oldOffset));
725
                        // remove closing delimiter
726 68
                        array_pop($objval);
727
                    }
728
                } else {
729
                    // hexadecimal string object
730 32
                    $objtype = $char;
731 32
                    ++$offset;
732
733 32
                    $span = strspn($pdfData, "0123456789abcdefABCDEF\x09\x0a\x0c\x0d\x20", $offset);
734 32
                    $dataToCheck = $pdfData[$offset + $span] ?? null;
735 32
                    if ('<' == $char && $span > 0 && '>' == $dataToCheck) {
736
                        // remove white space characters
737 32
                        $objval = strtr(substr($pdfData, $offset, $span), $this->config->getPdfWhitespaces(), '');
738 32
                        $offset += $span + 1;
739 2
                    } elseif (false !== ($endpos = strpos($pdfData, '>', $offset))) {
740 2
                        $offset = $endpos + 1;
741
                    }
742
                }
743 68
                break;
744
745
            default:
746 68
                if ('endobj' == substr($pdfData, $offset, 6)) {
747
                    // indirect object
748 67
                    $objtype = 'endobj';
749 67
                    $offset += 6;
750 68
                } elseif ('null' == substr($pdfData, $offset, 4)) {
751
                    // null object
752 11
                    $objtype = 'null';
753 11
                    $offset += 4;
754 11
                    $objval = 'null';
755 68
                } elseif ('true' == substr($pdfData, $offset, 4)) {
756
                    // boolean true object
757 31
                    $objtype = 'boolean';
758 31
                    $offset += 4;
759 31
                    $objval = 'true';
760 68
                } elseif ('false' == substr($pdfData, $offset, 5)) {
761
                    // boolean false object
762 5
                    $objtype = 'boolean';
763 5
                    $offset += 5;
764 5
                    $objval = 'false';
765 68
                } elseif ('stream' == substr($pdfData, $offset, 6)) {
766
                    // start stream object
767 68
                    $objtype = 'stream';
768 68
                    $offset += 6;
769 68
                    if (1 == preg_match('/^( *[\r]?[\n])/isU', substr($pdfData, $offset, 4), $matches)) {
770 68
                        $offset += \strlen($matches[0]);
771
772
                        // we get stream length here to later help preg_match test less data
773 68
                        $streamLen = (int) $this->getHeaderValue($headerDic, 'Length', 'numeric', 0);
774 68
                        $skip = false === $this->config->getRetainImageContent() && 'XObject' == $this->getHeaderValue($headerDic, 'Type', '/') && 'Image' == $this->getHeaderValue($headerDic, 'Subtype', '/');
775
776 68
                        $pregResult = preg_match(
777 68
                            '/(endstream)[\x09\x0a\x0c\x0d\x20]/isU',
778 68
                            $pdfData,
779 68
                            $matches,
780 68
                            \PREG_OFFSET_CAPTURE,
781 68
                            $offset + $streamLen
782 68
                        );
783
784 68
                        if (1 == $pregResult) {
785 68
                            $objval = $skip ? '' : substr($pdfData, $offset, $matches[0][1] - $offset);
786 68
                            $offset = $matches[1][1];
787
                        }
788
                    }
789 68
                } elseif ('endstream' == substr($pdfData, $offset, 9)) {
790
                    // end stream object
791 67
                    $objtype = 'endstream';
792 67
                    $offset += 9;
793 68
                } elseif (1 == preg_match('/^([0-9]+)[\s]+([0-9]+)[\s]+R/iU', substr($pdfData, $offset, 33), $matches)) {
794
                    // indirect object reference
795 67
                    $objtype = 'objref';
796 67
                    $offset += \strlen($matches[0]);
797 67
                    $objval = (int) $matches[1].'_'.(int) $matches[2];
798 68
                } elseif (1 == preg_match('/^([0-9]+)[\s]+([0-9]+)[\s]+obj/iU', substr($pdfData, $offset, 33), $matches)) {
799
                    // object start
800 14
                    $objtype = 'obj';
801 14
                    $objval = (int) $matches[1].'_'.(int) $matches[2];
802 14
                    $offset += \strlen($matches[0]);
803 68
                } elseif (($numlen = strspn($pdfData, '+-.0123456789', $offset)) > 0) {
804
                    // numeric object
805 67
                    $objtype = 'numeric';
806 67
                    $objval = substr($pdfData, $offset, $numlen);
807 67
                    $offset += $numlen;
808
                }
809 68
                break;
810
        }
811
812 68
        return [$objtype, $objval, $offset];
813
    }
814
815
    /**
816
     * Get value of an object header's section (obj << YYY >> part ).
817
     *
818
     * It is similar to Header::get('...')->getContent(), the only difference is it can be used during the parsing process,
819
     * when no Smalot\PdfParser\Header objects are created yet.
820
     *
821
     * @param string            $key     header's section name
822
     * @param string            $type    type of the section (i.e. 'numeric', '/', '<<', etc.)
823
     * @param string|array|null $default default value for header's section
824
     *
825
     * @return string|array|null value of obj header's section, or default value if none found, or its type doesn't match $type param
826
     */
827 68
    private function getHeaderValue(?array $headerDic, string $key, string $type, $default = '')
828
    {
829 68
        if (false === \is_array($headerDic)) {
0 ignored issues
show
introduced by
The condition false === is_array($headerDic) is always false.
Loading history...
830 1
            return $default;
831
        }
832
833
        /*
834
         * It recieves dictionary of header fields, as it is returned by RawDataParser::getRawObject,
835
         * iterates over it, searching for section of type '/' whith requested key.
836
         * If such a section is found, it tries to receive it's value (next object in dictionary),
837
         * returning it, if it matches requested type, or default value otherwise.
838
         */
839 67
        foreach ($headerDic as $i => $val) {
840 67
            $isSectionName = \is_array($val) && 3 == \count($val) && '/' == $val[0];
841
            if (
842 67
                $isSectionName
843 67
                && $val[1] == $key
844 67
                && isset($headerDic[$i + 1])
845
            ) {
846 67
                $isSectionValue = \is_array($headerDic[$i + 1]) && 1 < \count($headerDic[$i + 1]);
847
848 67
                return $isSectionValue && $type == $headerDic[$i + 1][0]
849 62
                    ? $headerDic[$i + 1][1]
850 67
                    : $default;
851
            }
852
        }
853
854
        return $default;
855
    }
856
857
    /**
858
     * Get Cross-Reference (xref) table and trailer data from PDF document data.
859
     *
860
     * @param int   $offset xref offset (if known)
861
     * @param array $xref   previous xref array (if any)
862
     *
863
     * @return array containing xref and trailer data
864
     *
865
     * @throws \Exception if it was unable to find startxref
866
     * @throws \Exception if it was unable to find xref
867
     */
868 68
    protected function getXrefData(string $pdfData, int $offset = 0, array $xref = []): array
869
    {
870
        // If the $offset is currently pointed at whitespace, bump it
871
        // forward until it isn't; affects loosely targetted offsets
872
        // for the 'xref' keyword
873
        // See: https://github.com/smalot/pdfparser/issues/673
874 68
        $bumpOffset = $offset;
875 68
        while (preg_match('/\s/', substr($pdfData, $bumpOffset, 1))) {
876 1
            ++$bumpOffset;
877
        }
878
879
        // Find all startxref tables from this $offset forward
880 68
        $startxrefPreg = preg_match_all(
881 68
            '/(?<=[\r\n])startxref[\s]*[\r\n]+([0-9]+)[\s]*[\r\n]+%%EOF/i',
882 68
            $pdfData,
883 68
            $startxrefMatches,
884 68
            \PREG_SET_ORDER,
885 68
            $offset
886 68
        );
887
888 68
        if (0 == $startxrefPreg) {
0 ignored issues
show
Bug Best Practice introduced by
It seems like you are loosely comparing $startxrefPreg of type integer|null to 0; this is ambiguous as not only 0 == 0 is true, but null == 0 is true, too. Consider using a strict comparison ===.
Loading history...
889
            // No startxref tables were found
890
            throw new \Exception('Unable to find startxref');
891 68
        } elseif (0 == $offset) {
892
            // Use the last startxref in the document
893 68
            $startxref = (int) $startxrefMatches[\count($startxrefMatches) - 1][1];
894 23
        } elseif (strpos($pdfData, 'xref', $bumpOffset) == $bumpOffset) {
895
            // Already pointing at the xref table
896 12
            $startxref = $bumpOffset;
897 11
        } elseif (preg_match('/([0-9]+[\s][0-9]+[\s]obj)/i', $pdfData, $matches, 0, $bumpOffset)) {
898
            // Cross-Reference Stream object
899 11
            $startxref = $bumpOffset;
900
        } else {
901
            // Use the next startxref from this $offset
902
            $startxref = (int) $startxrefMatches[0][1];
903
        }
904
905 68
        if ($startxref > \strlen($pdfData)) {
906 1
            throw new \Exception('Unable to find xref (PDF corrupted?)');
907
        }
908
909
        // check xref position
910 67
        if (strpos($pdfData, 'xref', $startxref) == $startxref) {
911
            // Cross-Reference
912 54
            $xref = $this->decodeXref($pdfData, $startxref, $xref);
913
        } else {
914
            // Check if the $pdfData might have the wrong line-endings
915 13
            $pdfDataUnix = str_replace("\r\n", "\n", $pdfData);
916 13
            if ($startxref < \strlen($pdfDataUnix) && strpos($pdfDataUnix, 'xref', $startxref) == $startxref) {
917
                // Return Unix-line-ending flag
918
                $xref = ['Unix' => true];
919
            } else {
920
                // Cross-Reference Stream
921 13
                $xref = $this->decodeXrefStream($pdfData, $startxref, $xref);
922
            }
923
        }
924 67
        if (empty($xref)) {
925
            throw new \Exception('Unable to find xref');
926
        }
927
928 67
        return $xref;
929
    }
930
931
    /**
932
     * Parses PDF data and returns extracted data as array.
933
     *
934
     * @param string $data PDF data to parse
935
     *
936
     * @return array array of parsed PDF document objects
937
     *
938
     * @throws \Exception if empty PDF data given
939
     * @throws \Exception if PDF data missing %PDF header
940
     */
941 68
    public function parseData(string $data): array
942
    {
943 68
        if (empty($data)) {
944
            throw new \Exception('Empty PDF data given.');
945
        }
946
        // find the pdf header starting position
947 68
        if (false === ($trimpos = strpos($data, '%PDF-'))) {
948
            throw new \Exception('Invalid PDF data: missing %PDF header.');
949
        }
950
951
        // get PDF content string
952 68
        $pdfData = $trimpos > 0 ? substr($data, $trimpos) : $data;
953
954
        // get xref and trailer data
955 68
        $xref = $this->getXrefData($pdfData);
956
957
        // If we found Unix line-endings
958 67
        if (isset($xref['Unix'])) {
959
            $pdfData = str_replace("\r\n", "\n", $pdfData);
960
            $xref = $this->getXrefData($pdfData);
961
        }
962
963
        // parse all document objects
964 67
        $objects = [];
965 67
        foreach ($xref['xref'] as $obj => $offset) {
966 67
            if (!isset($objects[$obj]) && ($offset > 0)) {
967
                // decode objects with positive offset
968 67
                $objects[$obj] = $this->getIndirectObject($pdfData, $xref, $obj, $offset, true);
969
            }
970
        }
971
972 67
        return [$xref, $objects];
973
    }
974
}
975