Test Failed
Pull Request — master (#582)
by
unknown
02:39
created

RawDataParser   F

Complexity

Total Complexity 192

Size/Duplication

Total Lines 853
Duplicated Lines 0 %

Test Coverage

Coverage 86.94%

Importance

Changes 7
Bugs 2 Features 1
Metric Value
eloc 446
c 7
b 2
f 1
dl 0
loc 853
ccs 366
cts 421
cp 0.8694
rs 2
wmc 192

11 Methods

Rating   Name   Duplication   Size   Complexity  
D decodeStream() 0 56 21
C decodeXref() 0 64 15
A __construct() 0 7 2
A getObjectHeaderLen() 0 5 1
A getObjectHeaderPattern() 0 4 1
A getObjectVal() 0 16 4
B getIndirectObject() 0 51 9
F decodeXrefStream() 0 261 83
F getRawObject() 0 184 40
B getXrefData() 0 53 9
B parseData() 0 26 7

How to fix   Complexity   

Complex Class

Complex classes like RawDataParser often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use RawDataParser, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/**
4
 * This file is based on code of tecnickcom/TCPDF PDF library.
5
 *
6
 * Original author Nicola Asuni ([email protected]) and
7
 * contributors (https://github.com/tecnickcom/TCPDF/graphs/contributors).
8
 *
9
 * @see https://github.com/tecnickcom/TCPDF
10
 *
11
 * Original code was licensed on the terms of the LGPL v3.
12
 *
13
 * ------------------------------------------------------------------------------
14
 *
15
 * @file This file is part of the PdfParser library.
16
 *
17
 * @author  Konrad Abicht <[email protected]>
18
 *
19
 * @date    2020-01-06
20
 *
21
 * @license LGPLv3
22
 *
23
 * @url     <https://github.com/smalot/pdfparser>
24
 *
25
 *  PdfParser is a pdf library written in PHP, extraction oriented.
26
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
27
 *
28
 *  This program is free software: you can redistribute it and/or modify
29
 *  it under the terms of the GNU Lesser General Public License as published by
30
 *  the Free Software Foundation, either version 3 of the License, or
31
 *  (at your option) any later version.
32
 *
33
 *  This program is distributed in the hope that it will be useful,
34
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
35
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
36
 *  GNU Lesser General Public License for more details.
37
 *
38
 *  You should have received a copy of the GNU Lesser General Public License
39
 *  along with this program.
40
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
41
 */
42
43
namespace Smalot\PdfParser\RawData;
44
45
use Smalot\PdfParser\Config;
46
47
class RawDataParser
48
{
49
    /**
50
     * @var \Smalot\PdfParser\Config
51
     */
52
    private $config;
53
54
    /**
55
     * Configuration array.
56
     */
57
    protected $cfg = [
58
        // if `true` ignore filter decoding errors
59
        'ignore_filter_decoding_errors' => true,
60
        // if `true` ignore missing filter decoding errors
61
        'ignore_missing_filter_decoders' => true,
62
    ];
63
64
    protected $filterHelper;
65
    protected $objects;
66
67
    /**
68
     * @param array $cfg Configuration array, default is []
69
     */
70 45
    public function __construct($cfg = [], Config $config = null)
71
    {
72
        // merge given array with default values
73 45
        $this->cfg = array_merge($this->cfg, $cfg);
74
75 45
        $this->filterHelper = new FilterHelper();
76 45
        $this->config = $config ?: new Config();
77 45
    }
78
79
    /**
80
     * Decode the specified stream.
81
     *
82
     * @param string $pdfData PDF data
83
     * @param array  $sdic    Stream's dictionary array
84
     * @param string $stream  Stream to decode
85
     *
86
     * @return array containing decoded stream data and remaining filters
87
     *
88
     * @throws \Exception
89
     */
90 41
    protected function decodeStream(string $pdfData, array $xref, array $sdic, string $stream): array
91
    {
92
        // get stream length and filters
93 41
        $slength = \strlen($stream);
94 41
        if ($slength <= 0) {
95
            return ['', []];
96
        }
97 41
        $filters = [];
98 41
        foreach ($sdic as $k => $v) {
99 41
            if ('/' == $v[0]) {
100 41
                if (('Length' == $v[1]) && (isset($sdic[$k + 1])) && ('numeric' == $sdic[$k + 1][0])) {
101
                    // get declared stream length
102 39
                    $declength = (int) $sdic[$k + 1][1];
103 39
                    if ($declength < $slength) {
104 39
                        $stream = substr($stream, 0, $declength);
105 39
                        $slength = $declength;
106
                    }
107 41
                } elseif (('Filter' == $v[1]) && (isset($sdic[$k + 1]))) {
108
                    // resolve indirect object
109 41
                    $objval = $this->getObjectVal($pdfData, $xref, $sdic[$k + 1]);
110 41
                    if ('/' == $objval[0]) {
111
                        // single filter
112 41
                        $filters[] = $objval[1];
113 3
                    } elseif ('[' == $objval[0]) {
114
                        // array of filters
115 3
                        foreach ($objval[1] as $flt) {
116 3
                            if ('/' == $flt[0]) {
117 3
                                $filters[] = $flt[1];
118
                            }
119
                        }
120
                    }
121
                }
122
            }
123
        }
124
125
        // decode the stream
126 41
        $remaining_filters = [];
127 41
        foreach ($filters as $filter) {
128 41
            if (\in_array($filter, $this->filterHelper->getAvailableFilters())) {
129
                try {
130 41
                    $stream = $this->filterHelper->decodeFilter($filter, $stream, $this->config->getDecodeMemoryLimit());
131
                } catch (\Exception $e) {
132
                    $emsg = $e->getMessage();
133
                    if ((('~' == $emsg[0]) && !$this->cfg['ignore_missing_filter_decoders'])
134
                        || (('~' != $emsg[0]) && !$this->cfg['ignore_filter_decoding_errors'])
135
                    ) {
136 41
                        throw new \Exception($e->getMessage());
137
                    }
138
                }
139
            } else {
140
                // add missing filter to array
141 4
                $remaining_filters[] = $filter;
142
            }
143
        }
144
145 41
        return [$stream, $remaining_filters];
146
    }
147
148
    /**
149
     * Decode the Cross-Reference section
150
     *
151
     * @param string $pdfData   PDF data
152
     * @param int    $startxref Offset at which the xref section starts (position of the 'xref' keyword)
153
     * @param array  $xref      Previous xref array (if any)
154
     *
155
     * @return array containing xref and trailer data
156
     *
157
     * @throws \Exception
158
     */
159 32
    protected function decodeXref(string $pdfData, int $startxref, array $xref = []): array
160
    {
161 32
        $startxref += 4; // 4 is the length of the word 'xref'
162
        // skip initial white space chars
163 32
        $offset = $startxref + strspn($pdfData, $this->config->getPdfWhitespaces(), $startxref);
164
        // initialize object number
165 32
        $obj_num = 0;
166
        // search for cross-reference entries or subsection
167 32
        while (preg_match('/([0-9]+)[\x20]([0-9]+)[\x20]?([nf]?)(\r\n|[\x20]?[\r\n])/', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset) > 0) {
168 32
            if ($matches[0][1] != $offset) {
169
                // we are on another section
170 7
                break;
171
            }
172 32
            $offset += \strlen($matches[0][0]);
173 32
            if ('n' == $matches[3][0]) {
174
                // create unique object index: [object number]_[generation number]
175 32
                $index = $obj_num.'_'.(int) $matches[2][0];
176
                // check if object already exist
177 32
                if (!isset($xref['xref'][$index])) {
178
                    // store object offset position
179 32
                    $xref['xref'][$index] = (int) $matches[1][0];
180
                }
181 32
                ++$obj_num;
182 32
            } elseif ('f' == $matches[3][0]) {
183 32
                ++$obj_num;
184
            } else {
185
                // object number (index)
186 32
                $obj_num = (int) $matches[1][0];
187
            }
188
        }
189
        // get trailer data
190 32
        if (preg_match('/trailer[\s]*<<(.*)>>/isU', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset) > 0) {
191 32
            $trailer_data = $matches[1][0];
192 32
            if (!isset($xref['trailer']) || empty($xref['trailer'])) {
193
                // get only the last updated version
194 32
                $xref['trailer'] = [];
195
                // parse trailer_data
196 32
                if (preg_match('/Size[\s]+([0-9]+)/i', $trailer_data, $matches) > 0) {
197 32
                    $xref['trailer']['size'] = (int) $matches[1];
198
                }
199 32
                if (preg_match('/Root[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
200 32
                    $xref['trailer']['root'] = (int) $matches[1].'_'.(int) $matches[2];
201
                }
202 32
                if (preg_match('/Encrypt[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
203
                    $xref['trailer']['encrypt'] = (int) $matches[1].'_'.(int) $matches[2];
204
                }
205 32
                if (preg_match('/Info[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
206 31
                    $xref['trailer']['info'] = (int) $matches[1].'_'.(int) $matches[2];
207
                }
208 32
                if (preg_match('/ID[\s]*[\[][\s]*[<]([^>]*)[>][\s]*[<]([^>]*)[>]/i', $trailer_data, $matches) > 0) {
209 26
                    $xref['trailer']['id'] = [];
210 26
                    $xref['trailer']['id'][0] = $matches[1];
211 26
                    $xref['trailer']['id'][1] = $matches[2];
212
                }
213
            }
214 32
            if (preg_match('/Prev[\s]+([0-9]+)/i', $trailer_data, $matches) > 0) {
215
                // get previous xref
216 32
                $xref = $this->getXrefData($pdfData, (int) $matches[1], $xref);
217
            }
218
        } else {
219
            throw new \Exception('Unable to find trailer');
220
        }
221
222 32
        return $xref;
223
    }
224
225
    /**
226
     * Decode the Cross-Reference Stream section
227
     *
228
     * @param string $pdfData   PDF data
229
     * @param int    $startxref Offset at which the xref section starts
230
     * @param array  $xref      Previous xref array (if any)
231
     *
232
     * @return array containing xref and trailer data
233
     *
234
     * @throws \Exception if unknown PNG predictor detected
235
     */
236 9
    protected function decodeXrefStream(string $pdfData, int $startxref, array $xref = []): array
237
    {
238
        // try to read Cross-Reference Stream
239 9
        $xrefobj = $this->getRawObject($pdfData, $startxref);
240 9
        $xrefcrs = $this->getIndirectObject($pdfData, $xref, $xrefobj[1], $startxref, true);
241 9
        if (!isset($xref['trailer']) || empty($xref['trailer'])) {
242
            // get only the last updated version
243 9
            $xref['trailer'] = [];
244 9
            $filltrailer = true;
245
        } else {
246 7
            $filltrailer = false;
247
        }
248 9
        if (!isset($xref['xref'])) {
249 9
            $xref['xref'] = [];
250
        }
251 9
        $valid_crs = false;
252 9
        $columns = 0;
253 9
        $predictor = null;
254 9
        $sarr = $xrefcrs[0][1];
255 9
        if (!\is_array($sarr)) {
256
            $sarr = [];
257
        }
258
259 9
        $wb = [];
260
261 9
        foreach ($sarr as $k => $v) {
262
            if (
263 9
                ('/' == $v[0])
264 9
                && ('Type' == $v[1])
265
                && (
266 9
                    isset($sarr[$k + 1])
267 9
                    && '/' == $sarr[$k + 1][0]
268 9
                    && 'XRef' == $sarr[$k + 1][1]
269
                )
270
            ) {
271 9
                $valid_crs = true;
272 9
            } elseif (('/' == $v[0]) && ('Index' == $v[1]) && (isset($sarr[$k + 1]))) {
273
                // initialize list for: first object number in the subsection / number of objects
274 7
                $index_blocks = [];
275 7
                for ($m = 0; $m < \count($sarr[$k + 1][1]); $m += 2) {
0 ignored issues
show
Performance Best Practice introduced by
It seems like you are calling the size function count() as part of the test condition. You might want to compute the size beforehand, and not on each iteration.

If the size of the collection does not change during the iteration, it is generally a good practice to compute it beforehand, and not on each iteration:

for ($i=0; $i<count($array); $i++) { // calls count() on each iteration
}

// Better
for ($i=0, $c=count($array); $i<$c; $i++) { // calls count() just once
}
Loading history...
276 7
                    $index_blocks[] = [$sarr[$k + 1][1][$m][1], $sarr[$k + 1][1][$m + 1][1]];
277
                }
278 9
            } elseif (('/' == $v[0]) && ('Prev' == $v[1]) && (isset($sarr[$k + 1]) && ('numeric' == $sarr[$k + 1][0]))) {
279
                // get previous xref offset
280 7
                $prevxref = (int) $sarr[$k + 1][1];
281 9
            } elseif (('/' == $v[0]) && ('W' == $v[1]) && (isset($sarr[$k + 1]))) {
282
                // number of bytes (in the decoded stream) of the corresponding field
283 9
                $wb[0] = (int) $sarr[$k + 1][1][0][1];
284 9
                $wb[1] = (int) $sarr[$k + 1][1][1][1];
285 9
                $wb[2] = (int) $sarr[$k + 1][1][2][1];
286 9
            } elseif (('/' == $v[0]) && ('DecodeParms' == $v[1]) && (isset($sarr[$k + 1][1]))) {
287 8
                $decpar = $sarr[$k + 1][1];
288 8
                foreach ($decpar as $kdc => $vdc) {
289
                    if (
290 8
                        '/' == $vdc[0]
291 8
                        && 'Columns' == $vdc[1]
292
                        && (
293 8
                            isset($decpar[$kdc + 1])
294 8
                            && 'numeric' == $decpar[$kdc + 1][0]
295
                        )
296
                    ) {
297 8
                        $columns = (int) $decpar[$kdc + 1][1];
298
                    } elseif (
299 8
                        '/' == $vdc[0]
300 8
                        && 'Predictor' == $vdc[1]
301
                        && (
302 8
                            isset($decpar[$kdc + 1])
303 8
                            && 'numeric' == $decpar[$kdc + 1][0]
304
                        )
305
                    ) {
306 8
                        $predictor = (int) $decpar[$kdc + 1][1];
307
                    }
308
                }
309 9
            } elseif ($filltrailer) {
310 9
                if (('/' == $v[0]) && ('Size' == $v[1]) && (isset($sarr[$k + 1]) && ('numeric' == $sarr[$k + 1][0]))) {
311 9
                    $xref['trailer']['size'] = $sarr[$k + 1][1];
312 9
                } elseif (('/' == $v[0]) && ('Root' == $v[1]) && (isset($sarr[$k + 1]) && ('objref' == $sarr[$k + 1][0]))) {
313 9
                    $xref['trailer']['root'] = $sarr[$k + 1][1];
314 9
                } elseif (('/' == $v[0]) && ('Info' == $v[1]) && (isset($sarr[$k + 1]) && ('objref' == $sarr[$k + 1][0]))) {
315 9
                    $xref['trailer']['info'] = $sarr[$k + 1][1];
316 9
                } elseif (('/' == $v[0]) && ('Encrypt' == $v[1]) && (isset($sarr[$k + 1]) && ('objref' == $sarr[$k + 1][0]))) {
317
                    $xref['trailer']['encrypt'] = $sarr[$k + 1][1];
318 9
                } elseif (('/' == $v[0]) && ('ID' == $v[1]) && (isset($sarr[$k + 1]))) {
319 9
                    $xref['trailer']['id'] = [];
320 9
                    $xref['trailer']['id'][0] = $sarr[$k + 1][1][0][1];
321 9
                    $xref['trailer']['id'][1] = $sarr[$k + 1][1][1][1];
322
                }
323
            }
324
        }
325
326
        // decode data
327 9
        if ($valid_crs && isset($xrefcrs[1][3][0])) {
328 9
            if (null !== $predictor) {
329
                // number of bytes in a row
330 8
                $rowlen = ($columns + 1);
331
                // convert the stream into an array of integers
332
                /** @var array<int> */
333 8
                $sdata = unpack('C*', $xrefcrs[1][3][0]);
334
                // TODO: Handle the case when unpack returns false
335
336
                // split the rows
337 8
                $sdata = array_chunk($sdata, $rowlen);
338
339
                // initialize decoded array
340 8
                $ddata = [];
341
                // initialize first row with zeros
342 8
                $prev_row = array_fill(0, $rowlen, 0);
343
                // for each row apply PNG unpredictor
344 8
                foreach ($sdata as $k => $row) {
345
                    // initialize new row
346 8
                    $ddata[$k] = [];
347
                    // get PNG predictor value
348 8
                    $predictor = (10 + $row[0]);
349
                    // for each byte on the row
350 8
                    for ($i = 1; $i <= $columns; ++$i) {
351
                        // new index
352 8
                        $j = ($i - 1);
353 8
                        $row_up = $prev_row[$j];
354 8
                        if (1 == $i) {
355 8
                            $row_left = 0;
356 8
                            $row_upleft = 0;
357
                        } else {
358 8
                            $row_left = $row[$i - 1];
359 8
                            $row_upleft = $prev_row[$j - 1];
360
                        }
361 8
                        switch ($predictor) {
362 8
                            case 10:  // PNG prediction (on encoding, PNG None on all rows)
363
                                $ddata[$k][$j] = $row[$i];
364
                                break;
365
366 8
                            case 11:  // PNG prediction (on encoding, PNG Sub on all rows)
367
                                $ddata[$k][$j] = (($row[$i] + $row_left) & 0xFF);
368
                                break;
369
370 8
                            case 12:  // PNG prediction (on encoding, PNG Up on all rows)
371 8
                                $ddata[$k][$j] = (($row[$i] + $row_up) & 0xFF);
372 8
                                break;
373
374
                            case 13:  // PNG prediction (on encoding, PNG Average on all rows)
375
                                $ddata[$k][$j] = (($row[$i] + (($row_left + $row_up) / 2)) & 0xFF);
376
                                break;
377
378
                            case 14:  // PNG prediction (on encoding, PNG Paeth on all rows)
379
                                // initial estimate
380
                                $p = ($row_left + $row_up - $row_upleft);
381
                                // distances
382
                                $pa = abs($p - $row_left);
383
                                $pb = abs($p - $row_up);
384
                                $pc = abs($p - $row_upleft);
385
                                $pmin = min($pa, $pb, $pc);
386
                                // return minimum distance
387
                                switch ($pmin) {
388
                                    case $pa:
389
                                        $ddata[$k][$j] = (($row[$i] + $row_left) & 0xFF);
390
                                        break;
391
392
                                    case $pb:
393
                                        $ddata[$k][$j] = (($row[$i] + $row_up) & 0xFF);
394
                                        break;
395
396
                                    case $pc:
397
                                        $ddata[$k][$j] = (($row[$i] + $row_upleft) & 0xFF);
398
                                        break;
399
                                }
400
                                break;
401
402
                            default:  // PNG prediction (on encoding, PNG optimum)
403
                                throw new \Exception('Unknown PNG predictor: '.$predictor);
404
                        }
405
                    }
406 8
                    $prev_row = $ddata[$k];
407
                } // end for each row
408
            // complete decoding
409
            } else {
410
                // number of bytes in a row
411 1
                $rowlen = array_sum($wb);
412
                // convert the stream into an array of integers
413 1
                $sdata = unpack('C*', $xrefcrs[1][3][0]);
414
                // split the rows
415 1
                $ddata = array_chunk($sdata, $rowlen);
0 ignored issues
show
Bug introduced by
It seems like $rowlen can also be of type double; however, parameter $length of array_chunk() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

415
                $ddata = array_chunk($sdata, /** @scrutinizer ignore-type */ $rowlen);
Loading history...
416
            }
417
418 9
            $sdata = [];
419
420
            // for every row
421 9
            foreach ($ddata as $k => $row) {
422
                // initialize new row
423 9
                $sdata[$k] = [0, 0, 0];
424 9
                if (0 == $wb[0]) {
425
                    // default type field
426
                    $sdata[$k][0] = 1;
427
                }
428 9
                $i = 0; // count bytes in the row
429
                // for every column
430 9
                for ($c = 0; $c < 3; ++$c) {
431
                    // for every byte on the column
432 9
                    for ($b = 0; $b < $wb[$c]; ++$b) {
433 9
                        if (isset($row[$i])) {
434 9
                            $sdata[$k][$c] += ($row[$i] << (($wb[$c] - 1 - $b) * 8));
435
                        }
436 9
                        ++$i;
437
                    }
438
                }
439
            }
440
441
            // fill xref
442 9
            if (isset($index_blocks)) {
443
                // load the first object number of the first /Index entry
444 7
                $obj_num = $index_blocks[0][0];
445
            } else {
446 9
                $obj_num = 0;
447
            }
448 9
            foreach ($sdata as $k => $row) {
449 9
                switch ($row[0]) {
450 9
                    case 0:  // (f) linked list of free objects
451 9
                        break;
452
453 9
                    case 1:  // (n) objects that are in use but are not compressed
454
                        // create unique object index: [object number]_[generation number]
455 9
                        $index = $obj_num.'_'.$row[2];
456
                        // check if object already exist
457 9
                        if (!isset($xref['xref'][$index])) {
458
                            // store object offset position
459 9
                            $xref['xref'][$index] = $row[1];
460
                        }
461 9
                        break;
462
463 9
                    case 2:  // compressed objects
464
                        // $row[1] = object number of the object stream in which this object is stored
465
                        // $row[2] = index of this object within the object stream
466 9
                        $index = $row[1].'_0_'.$row[2];
467 9
                        $xref['xref'][$index] = -1;
468 9
                        break;
469
470
                    default:  // null objects
471
                        break;
472
                }
473 9
                ++$obj_num;
474 9
                if (isset($index_blocks)) {
475
                    // reduce the number of remaining objects
476 7
                    --$index_blocks[0][1];
477 7
                    if (0 == $index_blocks[0][1]) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $index_blocks does not seem to be defined for all execution paths leading up to this point.
Loading history...
478
                        // remove the actual used /Index entry
479 7
                        array_shift($index_blocks);
480 7
                        if (0 < \count($index_blocks)) {
481
                            // load the first object number of the following /Index entry
482 5
                            $obj_num = $index_blocks[0][0];
483
                        } else {
484
                            // if there are no more entries, remove $index_blocks to avoid actions on an empty array
485 7
                            unset($index_blocks);
486
                        }
487
                    }
488
                }
489
            }
490
        } // end decoding data
491 9
        if (isset($prevxref)) {
492
            // get previous xref
493 7
            $xref = $this->getXrefData($pdfData, $prevxref, $xref);
494
        }
495
496 9
        return $xref;
497
    }
498
499 41
    protected function getObjectHeaderPattern(array $objRefs): string
500
    {
501
        // consider all whitespace character (PDF specifications)
502 41
        return '/'.$objRefs[0].$this->config->getPdfWhitespacesRegex().$objRefs[1].$this->config->getPdfWhitespacesRegex().'obj/';
503
    }
504
505 41
    protected function getObjectHeaderLen(array $objRefs): int
506
    {
507
        // "4 0 obj"
508
        // 2 whitespaces + strlen("obj") = 5
509 41
        return 5 + \strlen($objRefs[0]) + \strlen($objRefs[1]);
510
    }
511
512
    /**
513
     * Get content of indirect object.
514
     *
515
     * @param string $pdfData  PDF data
516
     * @param string $objRef   Object number and generation number separated by underscore character
517
     * @param int    $offset   Object offset
518
     * @param bool   $decoding If true decode streams
519
     *
520
     * @return array containing object data
521
     *
522
     * @throws \Exception if invalid object reference found
523
     */
524 41
    protected function getIndirectObject(string $pdfData, array $xref, string $objRef, int $offset = 0, bool $decoding = true): array
525
    {
526
        /*
527
         * build indirect object header
528
         */
529
        // $objHeader = "[object number] [generation number] obj"
530 41
        $objRefArr = explode('_', $objRef);
531 41
        if (2 !== \count($objRefArr)) {
532
            throw new \Exception('Invalid object reference for $obj.');
533
        }
534
535 41
        $objHeaderLen = $this->getObjectHeaderLen($objRefArr);
536
537
        /*
538
         * check if we are in position
539
         */
540
        // ignore whitespace characters at offset
541 41
        $offset += strspn($pdfData, $this->config->getPdfWhitespaces(), $offset);
542
        // ignore leading zeros for object number
543 41
        $offset += strspn($pdfData, '0', $offset);
544 41
        if (0 == preg_match($this->getObjectHeaderPattern($objRefArr), substr($pdfData, $offset, $objHeaderLen))) {
545
            // an indirect reference to an undefined object shall be considered a reference to the null object
546
            return ['null', 'null', $offset];
547
        }
548
549
        /*
550
         * get content
551
         */
552
        // starting position of object content
553 41
        $offset += $objHeaderLen;
554 41
        $objContentArr = [];
555 41
        $i = 0; // object main index
556
        do {
557 41
            $oldOffset = $offset;
558
            // get element
559 41
            $element = $this->getRawObject($pdfData, $offset);
560 41
            $offset = $element[2];
561
            // decode stream using stream's dictionary information
562 41
            if ($decoding && ('stream' === $element[0]) && (isset($objContentArr[$i - 1][0])) && ('<<' === $objContentArr[$i - 1][0])) {
563 41
                $element[3] = $this->decodeStream($pdfData, $xref, $objContentArr[$i - 1][1], $element[1]);
564
            }
565 41
            $objContentArr[$i] = $element;
566 41
            ++$i;
567 41
        } while (('endobj' !== $element[0]) && ($offset !== $oldOffset));
568
        // remove closing delimiter
569 41
        array_pop($objContentArr);
570
571
        /*
572
         * return raw object content
573
         */
574 41
        return $objContentArr;
575
    }
576
577
    /**
578
     * Get the content of object, resolving indirect object reference if necessary.
579
     *
580
     * @param string $pdfData PDF data
581
     * @param array  $obj     Object value
582
     *
583
     * @return array containing object data
584
     *
585
     * @throws \Exception
586
     */
587 41
    protected function getObjectVal(string $pdfData, $xref, array $obj): array
588
    {
589 41
        if ('objref' == $obj[0]) {
590
            // reference to indirect object
591
            if (isset($this->objects[$obj[1]])) {
592
                // this object has been already parsed
593
                return $this->objects[$obj[1]];
594
            } elseif (isset($xref[$obj[1]])) {
595
                // parse new object
596
                $this->objects[$obj[1]] = $this->getIndirectObject($pdfData, $xref, $obj[1], $xref[$obj[1]], false);
597
598
                return $this->objects[$obj[1]];
599
            }
600
        }
601
602 41
        return $obj;
603
    }
604
605
    /**
606
     * Get object type, raw value and offset to next object
607
     *
608
     * @param int $offset Object offset
609
     *
610
     * @return array containing object type, raw value and offset to next object
611
     */
612 42
    protected function getRawObject(string $pdfData, int $offset = 0): array
613
    {
614 42
        $objtype = ''; // object type to be returned
615 42
        $objval = ''; // object value to be returned
616
617
        // skip initial white space chars
618 42
        $offset += strspn($pdfData, $this->config->getPdfWhitespaces(), $offset);
619
620
        // get first char
621 42
        $char = $pdfData[$offset];
622
        // get object type
623 42
        switch ($char) {
624 42
            case '%':  // \x25 PERCENT SIGN
625
                // skip comment and search for next token
626 1
                $next = strcspn($pdfData, "\r\n", $offset);
627 1
                if ($next > 0) {
628 1
                    $offset += $next;
629
630 1
                    return $this->getRawObject($pdfData, $offset);
631
                }
632
                break;
633
634 42
            case '/':  // \x2F SOLIDUS
635
                // name object
636 42
                $objtype = $char;
637 42
                ++$offset;
638 42
                $span = strcspn($pdfData, "\x00\x09\x0a\x0c\x0d\x20\n\t\r\v\f\x28\x29\x3c\x3e\x5b\x5d\x7b\x7d\x2f\x25", $offset, 256);
639 42
                if ($span > 0) {
640 42
                    $objval = substr($pdfData, $offset, $span); // unescaped value
641
                    $offset += $span;
642
                }
643 42
                break;
644 42
645 42
            case '(':   // \x28 LEFT PARENTHESIS
646
            case ')':  // \x29 RIGHT PARENTHESIS
647 42
                    // literal string object
648
                $objtype = $char;
649 42
                ++$offset;
650 42
                $strpos = $offset;
651
                if ('(' == $char) {
652 37
                    $open_bracket = 1;
653 37
                    while ($open_bracket > 0) {
654 37
                        if (!isset($pdfData[$strpos])) {
655 37
                            break;
656 37
                        }
657 37
                        $ch = $pdfData[$strpos];
658 37
                        switch ($ch) {
659
                            case '\\':  // REVERSE SOLIDUS (5Ch) (Backslash)
660
                                // skip next character
661 37
                                ++$strpos;
662 37
                                break;
663 37
664
                            case '(':  // LEFT PARENHESIS (28h)
665 19
                                ++$open_bracket;
666 19
                                break;
667
668 37
                            case ')':  // RIGHT PARENTHESIS (29h)
669
                                --$open_bracket;
670
                                break;
671
                        }
672 37
                        ++$strpos;
673 37
                    }
674 37
                    $objval = substr($pdfData, $offset, $strpos - $offset - 1);
675
                    $offset = $strpos;
676 37
                }
677
                break;
678 37
679 37
            case '[':   // \x5B LEFT SQUARE BRACKET
680
            case ']':  // \x5D RIGHT SQUARE BRACKET
681 37
                // array object
682
                $objtype = $char;
683 42
                ++$offset;
684 42
                if ('[' == $char) {
685
                    // get array content
686 41
                    $objval = [];
687 41
                    do {
688 41
                        $oldOffset = $offset;
689
                        // get element
690 41
                        $element = $this->getRawObject($pdfData, $offset);
691
                        $offset = $element[2];
692 41
                        $objval[] = $element;
693
                    } while ((']' != $element[0]) && ($offset != $oldOffset));
694 41
                    // remove closing delimiter
695 41
                    array_pop($objval);
696 41
                }
697 41
                break;
698
699 41
            case '<':  // \x3C LESS-THAN SIGN
700
            case '>':  // \x3E GREATER-THAN SIGN
701 41
                if (isset($pdfData[$offset + 1]) && ($pdfData[$offset + 1] == $char)) {
702
                    // dictionary object
703 42
                    $objtype = $char.$char;
704 42
                    $offset += 2;
705 42
                    if ('<' == $char) {
706
                        // get array content
707 42
                        $objval = [];
708 42
                        do {
709 42
                            $oldOffset = $offset;
710
                            // get element
711 42
                            $element = $this->getRawObject($pdfData, $offset);
712
                            $offset = $element[2];
713 42
                            $objval[] = $element;
714
                        } while (('>>' != $element[0]) && ($offset != $oldOffset));
715 42
                        // remove closing delimiter
716 42
                        array_pop($objval);
717 42
                    }
718 42
                } else {
719
                    // hexadecimal string object
720 42
                    $objtype = $char;
721
                    ++$offset;
722
723
                    $span = strspn($pdfData, "0123456789abcdefABCDEF\x09\x0a\x0c\x0d\x20", $offset);
724 18
                    if (('<' == $char) && $span > 0 && @$pdfData[$offset+$span] == '>') {
725 18
                        // remove white space characters
726 18
                        $objval = strtr(substr($pdfData, $offset, $span), $this->config->getPdfWhitespaces(), '');
727 18
                        $offset += $span + 1;
728 18
                    } elseif (false !== ($endpos = strpos($pdfData, '>', $offset))) {
729
                        $offset = $endpos + 1;
730
                    }
731 18
                }
732
                break;
733 18
734 18
            default:
735
                if ('endobj' == substr($pdfData, $offset, 6)) {
736
                    // indirect object
737
                    $objtype = 'endobj';
738
                    $offset += 6;
739 42
                } elseif ('null' == substr($pdfData, $offset, 4)) {
740
                    // null object
741
                    $objtype = 'null';
742 42
                    $offset += 4;
743
                    $objval = 'null';
744 41
                } elseif ('true' == substr($pdfData, $offset, 4)) {
745 41
                    // boolean true object
746 42
                    $objtype = 'boolean';
747
                    $offset += 4;
748 3
                    $objval = 'true';
749 3
                } elseif ('false' == substr($pdfData, $offset, 5)) {
750 3
                    // boolean false object
751 42
                    $objtype = 'boolean';
752
                    $offset += 5;
753 15
                    $objval = 'false';
754 15
                } elseif ('stream' == substr($pdfData, $offset, 6)) {
755 15
                    // start stream object
756 42
                    $objtype = 'stream';
757
                    $offset += 6;
758 3
                    if (1 == preg_match('/^([\r]?[\n])/isU', substr($pdfData, $offset, 4), $matches)) {
759 3
                        $offset += \strlen($matches[0]);
760 3
                        $pregResult = preg_match(
761 42
                            '/(endstream)[\x09\x0a\x0c\x0d\x20]/isU',
762
                            $pdfData,
763 41
                            $matches,
764 41
                            \PREG_OFFSET_CAPTURE,
765 41
                            $offset
766 41
                        );
767 41
                        if (1 == $pregResult) {
768 41
                            $objval = substr($pdfData, $offset, $matches[0][1] - $offset);
769 41
                            $offset = $matches[1][1];
770
                        }
771 41
                    }
772
                } elseif ('endstream' == substr($pdfData, $offset, 9)) {
773 41
                    // end stream object
774 41
                    $objtype = 'endstream';
775 41
                    $offset += 9;
776
                } elseif (1 == preg_match('/^([0-9]+)[\s]+([0-9]+)[\s]+R/iU', substr($pdfData, $offset, 33), $matches)) {
777
                    // indirect object reference
778 42
                    $objtype = 'objref';
779
                    $offset += \strlen($matches[0]);
780 41
                    $objval = (int) $matches[1].'_'.(int) $matches[2];
781 41
                } elseif (1 == preg_match('/^([0-9]+)[\s]+([0-9]+)[\s]+obj/iU', substr($pdfData, $offset, 33), $matches)) {
782 42
                    // object start
783
                    $objtype = 'obj';
784 41
                    $objval = (int) $matches[1].'_'.(int) $matches[2];
785 41
                    $offset += \strlen($matches[0]);
786 41
                } elseif (($numlen = strspn($pdfData, '+-.0123456789', $offset)) > 0) {
787 42
                    // numeric object
788
                    $objtype = 'numeric';
789 10
                    $objval = substr($pdfData, $offset, $numlen);
790 10
                    $offset += $numlen;
791 10
                }
792 42
                break;
793
        }
794 41
795 41
        return [$objtype, $objval, $offset];
796 41
    }
797
798 42
    /**
799
     * Get Cross-Reference (xref) table and trailer data from PDF document data.
800
     *
801 42
     * @param int   $offset xref offset (if known)
802
     * @param array $xref   previous xref array (if any)
803
     *
804
     * @return array containing xref and trailer data
805
     *
806
     * @throws \Exception if it was unable to find startxref
807
     * @throws \Exception if it was unable to find xref
808
     */
809
    protected function getXrefData(string $pdfData, int $offset = 0, array $xref = []): array
810
    {
811
        $startxrefPreg = preg_match(
812
            '/[\r\n]startxref[\s]*[\r\n]+([0-9]+)[\s]*[\r\n]+%%EOF/i',
813
            $pdfData,
814
            $matches,
815 42
            \PREG_OFFSET_CAPTURE,
816
            $offset
817 42
        );
818 42
819
        if (0 == $offset) {
820
            // find last startxref
821 42
            $pregResult = preg_match_all(
822
                '/[\r\n]startxref[\s]*[\r\n]+([0-9]+)[\s]*[\r\n]+%%EOF/i',
823
                $pdfData, $matches,
824
                \PREG_SET_ORDER,
825 42
                $offset
826
            );
827 42
            if (0 == $pregResult) {
0 ignored issues
show
Bug Best Practice introduced by
It seems like you are loosely comparing $pregResult of type integer|null to 0; this is ambiguous as not only 0 == 0 is true, but null == 0 is true, too. Consider using a strict comparison ===.
Loading history...
828 42
                throw new \Exception('Unable to find startxref');
829
            }
830 42
            $matches = array_pop($matches);
831
            $startxref = $matches[1];
832
        } elseif (strpos($pdfData, 'xref', $offset) == $offset) {
833 42
            // Already pointing at the xref table
834
            $startxref = $offset;
835
        } elseif (preg_match('/([0-9]+[\s][0-9]+[\s]obj)/i', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset)) {
836 42
            // Cross-Reference Stream object
837 42
            $startxref = $offset;
838 14
        } elseif ($startxrefPreg) {
839
            // startxref found
840 7
            $startxref = $matches[1][0];
841 7
        } else {
842
            throw new \Exception('Unable to find startxref');
843 7
        }
844
845
        if ($startxref > \strlen($pdfData)) {
846
            throw new \Exception('Unable to find xref (PDF corrupted?)');
847
        }
848
849
        // check xref position
850
        if (strpos($pdfData, 'xref', $startxref) == $startxref) {
851 42
            // Cross-Reference
852 1
            $xref = $this->decodeXref($pdfData, $startxref, $xref);
853
        } else {
854
            // Cross-Reference Stream
855
            $xref = $this->decodeXrefStream($pdfData, $startxref, $xref);
856 41
        }
857
        if (empty($xref)) {
858 32
            throw new \Exception('Unable to find xref');
859
        }
860
861 9
        return $xref;
862
    }
863 41
864
    /**
865
     * Parses PDF data and returns extracted data as array.
866
     *
867 41
     * @param string $data PDF data to parse
868
     *
869
     * @return array array of parsed PDF document objects
870
     *
871
     * @throws \Exception if empty PDF data given
872
     * @throws \Exception if PDF data missing %PDF header
873
     */
874
    public function parseData(string $data): array
875
    {
876
        if (empty($data)) {
877
            throw new \Exception('Empty PDF data given.');
878
        }
879
        // find the pdf header starting position
880 42
        if (false === ($trimpos = strpos($data, '%PDF-'))) {
881
            throw new \Exception('Invalid PDF data: missing %PDF header.');
882 42
        }
883
884
        // get PDF content string
885
        $pdfData = $trimpos > 0 ? substr($data, $trimpos) : $data;
886 42
887
        // get xref and trailer data
888
        $xref = $this->getXrefData($pdfData);
889
890
        // parse all document objects
891 42
        $objects = [];
892
        foreach ($xref['xref'] as $obj => $offset) {
893
            if (!isset($objects[$obj]) && ($offset > 0)) {
894 42
                // decode objects with positive offset
895
                $objects[$obj] = $this->getIndirectObject($pdfData, $xref, $obj, $offset, true);
896
            }
897 41
        }
898 41
899 41
        return [$xref, $objects];
900
    }
901
}
902