Passed
Push — feature/php-8.4-support ( fabd62...304cd3 )
by Konrad
14:44 queued 16s
created

RawDataParser   F

Complexity

Total Complexity 213

Size/Duplication

Total Lines 929
Duplicated Lines 0 %

Test Coverage

Coverage 89.04%

Importance

Changes 11
Bugs 2 Features 1
Metric Value
eloc 471
c 11
b 2
f 1
dl 0
loc 929
ccs 398
cts 447
cp 0.8904
rs 2
wmc 213

12 Methods

Rating   Name   Duplication   Size   Complexity  
A __construct() 0 7 2
D decodeStream() 0 56 21
D decodeXref() 0 67 16
A getObjectHeaderLen() 0 5 1
A getObjectHeaderPattern() 0 4 1
F getRawObject() 0 191 43
A getObjectVal() 0 16 4
F decodeXrefStream() 0 266 84
B getXrefData() 0 61 11
B getHeaderValue() 0 28 11
B parseData() 0 32 8
B getIndirectObject() 0 53 11

How to fix   Complexity   

Complex Class

Complex classes like RawDataParser often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use RawDataParser, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/**
4
 * This file is based on code of tecnickcom/TCPDF PDF library.
5
 *
6
 * Original author Nicola Asuni ([email protected]) and
7
 * contributors (https://github.com/tecnickcom/TCPDF/graphs/contributors).
8
 *
9
 * @see https://github.com/tecnickcom/TCPDF
10
 *
11
 * Original code was licensed on the terms of the LGPL v3.
12
 *
13
 * ------------------------------------------------------------------------------
14
 *
15
 * @file This file is part of the PdfParser library.
16
 *
17
 * @author  Konrad Abicht <[email protected]>
18
 *
19
 * @date    2020-01-06
20
 *
21
 * @license LGPLv3
22
 *
23
 * @url     <https://github.com/smalot/pdfparser>
24
 *
25
 *  PdfParser is a pdf library written in PHP, extraction oriented.
26
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
27
 *
28
 *  This program is free software: you can redistribute it and/or modify
29
 *  it under the terms of the GNU Lesser General Public License as published by
30
 *  the Free Software Foundation, either version 3 of the License, or
31
 *  (at your option) any later version.
32
 *
33
 *  This program is distributed in the hope that it will be useful,
34
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
35
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
36
 *  GNU Lesser General Public License for more details.
37
 *
38
 *  You should have received a copy of the GNU Lesser General Public License
39
 *  along with this program.
40
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
41
 */
42
43
namespace Smalot\PdfParser\RawData;
44
45
use Smalot\PdfParser\Config;
46
47
class RawDataParser
48
{
49
    /**
50
     * @var Config
51
     */
52
    private $config;
53
54
    /**
55
     * Configuration array.
56
     *
57
     * @var array<string,bool>
58
     */
59
    protected $cfg = [
60
        // if `true` ignore filter decoding errors
61
        'ignore_filter_decoding_errors' => true,
62
        // if `true` ignore missing filter decoding errors
63
        'ignore_missing_filter_decoders' => true,
64
    ];
65
66
    protected $filterHelper;
67
    protected $objects;
68
69
    /**
70
     * @param array $cfg Configuration array, default is []
71
     */
72 74
    public function __construct($cfg = [], ?Config $config = null)
73
    {
74
        // merge given array with default values
75 74
        $this->cfg = array_merge($this->cfg, $cfg);
76
77 74
        $this->filterHelper = new FilterHelper();
78 74
        $this->config = $config ?: new Config();
79
    }
80
81
    /**
82
     * Decode the specified stream.
83
     *
84
     * @param string $pdfData PDF data
85
     * @param array  $sdic    Stream's dictionary array
86
     * @param string $stream  Stream to decode
87
     *
88
     * @return array containing decoded stream data and remaining filters
89
     *
90
     * @throws \Exception
91
     */
92 67
    protected function decodeStream(string $pdfData, array $xref, array $sdic, string $stream): array
93
    {
94
        // get stream length and filters
95 67
        $slength = \strlen($stream);
96 67
        if ($slength <= 0) {
97
            return ['', []];
98
        }
99 67
        $filters = [];
100 67
        foreach ($sdic as $k => $v) {
101 67
            if ('/' == $v[0]) {
102 67
                if (('Length' == $v[1]) && (isset($sdic[$k + 1])) && ('numeric' == $sdic[$k + 1][0])) {
103
                    // get declared stream length
104 62
                    $declength = (int) $sdic[$k + 1][1];
105 62
                    if ($declength < $slength) {
106 62
                        $stream = substr($stream, 0, $declength);
107 62
                        $slength = $declength;
108
                    }
109 67
                } elseif (('Filter' == $v[1]) && (isset($sdic[$k + 1]))) {
110
                    // resolve indirect object
111 67
                    $objval = $this->getObjectVal($pdfData, $xref, $sdic[$k + 1]);
112 67
                    if ('/' == $objval[0]) {
113
                        // single filter
114 67
                        $filters[] = $objval[1];
115 4
                    } elseif ('[' == $objval[0]) {
116
                        // array of filters
117 4
                        foreach ($objval[1] as $flt) {
118 4
                            if ('/' == $flt[0]) {
119 4
                                $filters[] = $flt[1];
120
                            }
121
                        }
122
                    }
123
                }
124
            }
125
        }
126
127
        // decode the stream
128 67
        $remaining_filters = [];
129 67
        foreach ($filters as $filter) {
130 67
            if (\in_array($filter, $this->filterHelper->getAvailableFilters(), true)) {
131
                try {
132 67
                    $stream = $this->filterHelper->decodeFilter($filter, $stream, $this->config->getDecodeMemoryLimit());
133 2
                } catch (\Exception $e) {
134 2
                    $emsg = $e->getMessage();
135 2
                    if ((('~' == $emsg[0]) && !$this->cfg['ignore_missing_filter_decoders'])
136 2
                        || (('~' != $emsg[0]) && !$this->cfg['ignore_filter_decoding_errors'])
137
                    ) {
138 67
                        throw new \Exception($e->getMessage());
139
                    }
140
                }
141
            } else {
142
                // add missing filter to array
143 9
                $remaining_filters[] = $filter;
144
            }
145
        }
146
147 67
        return [$stream, $remaining_filters];
148
    }
149
150
    /**
151
     * Decode the Cross-Reference section
152
     *
153
     * @param string $pdfData   PDF data
154
     * @param int    $startxref Offset at which the xref section starts (position of the 'xref' keyword)
155
     * @param array  $xref      Previous xref array (if any)
156
     *
157
     * @return array containing xref and trailer data
158
     *
159
     * @throws \Exception
160
     */
161 54
    protected function decodeXref(string $pdfData, int $startxref, array $xref = []): array
162
    {
163 54
        $startxref += 4; // 4 is the length of the word 'xref'
164
        // skip initial white space chars
165 54
        $offset = $startxref + strspn($pdfData, $this->config->getPdfWhitespaces(), $startxref);
166
        // initialize object number
167 54
        $obj_num = 0;
168
        // search for cross-reference entries or subsection
169 54
        while (preg_match('/([0-9]+)[\x20]([0-9]+)[\x20]?([nf]?)(\r\n|[\x20]?[\r\n])/', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset) > 0) {
170 54
            if ($matches[0][1] != $offset) {
171
                // we are on another section
172 12
                break;
173
            }
174 54
            $offset += \strlen($matches[0][0]);
175 54
            if ('n' == $matches[3][0]) {
176
                // create unique object index: [object number]_[generation number]
177 54
                $index = $obj_num.'_'.(int) $matches[2][0];
178
                // check if object already exist
179 54
                if (!isset($xref['xref'][$index])) {
180
                    // store object offset position
181 54
                    $xref['xref'][$index] = (int) $matches[1][0];
182
                }
183 54
                ++$obj_num;
184 54
            } elseif ('f' == $matches[3][0]) {
185 53
                ++$obj_num;
186
            } else {
187
                // object number (index)
188 54
                $obj_num = (int) $matches[1][0];
189
            }
190
        }
191
        // get trailer data
192 54
        if (preg_match('/trailer[\s]*<<(.*)>>/isU', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset) > 0) {
193 54
            $trailer_data = $matches[1][0];
194 54
            if (!isset($xref['trailer']) || empty($xref['trailer'])) {
195
                // get only the last updated version
196 54
                $xref['trailer'] = [];
197
                // parse trailer_data
198 54
                if (preg_match('/Size[\s]+([0-9]+)/i', $trailer_data, $matches) > 0) {
199 54
                    $xref['trailer']['size'] = (int) $matches[1];
200
                }
201 54
                if (preg_match('/Root[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
202 54
                    $xref['trailer']['root'] = (int) $matches[1].'_'.(int) $matches[2];
203
                }
204 54
                if (preg_match('/Encrypt[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
205 2
                    $xref['trailer']['encrypt'] = (int) $matches[1].'_'.(int) $matches[2];
206
                }
207 54
                if (preg_match('/Info[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
208 50
                    $xref['trailer']['info'] = (int) $matches[1].'_'.(int) $matches[2];
209
                }
210 54
                if (preg_match('/ID[\s]*[\[][\s]*[<]([^>]*)[>][\s]*[<]([^>]*)[>]/i', $trailer_data, $matches) > 0) {
211 41
                    $xref['trailer']['id'] = [];
212 41
                    $xref['trailer']['id'][0] = $matches[1];
213 41
                    $xref['trailer']['id'][1] = $matches[2];
214
                }
215
            }
216 54
            if (preg_match('/Prev[\s]+([0-9]+)/i', $trailer_data, $matches) > 0) {
217 13
                $offset = (int) $matches[1];
218 13
                if (0 != $offset) {
219
                    // get previous xref
220 54
                    $xref = $this->getXrefData($pdfData, $offset, $xref);
221
                }
222
            }
223
        } else {
224
            throw new \Exception('Unable to find trailer');
225
        }
226
227 54
        return $xref;
228
    }
229
230
    /**
231
     * Decode the Cross-Reference Stream section
232
     *
233
     * @param string $pdfData   PDF data
234
     * @param int    $startxref Offset at which the xref section starts
235
     * @param array  $xref      Previous xref array (if any)
236
     *
237
     * @return array containing xref and trailer data
238
     *
239
     * @throws \Exception if unknown PNG predictor detected
240
     */
241 13
    protected function decodeXrefStream(string $pdfData, int $startxref, array $xref = []): array
242
    {
243
        // try to read Cross-Reference Stream
244 13
        $xrefobj = $this->getRawObject($pdfData, $startxref);
245 13
        $xrefcrs = $this->getIndirectObject($pdfData, $xref, $xrefobj[1], $startxref, true);
246 13
        if (!isset($xref['trailer']) || empty($xref['trailer'])) {
247
            // get only the last updated version
248 13
            $xref['trailer'] = [];
249 13
            $filltrailer = true;
250
        } else {
251 11
            $filltrailer = false;
252
        }
253 13
        if (!isset($xref['xref'])) {
254 13
            $xref['xref'] = [];
255
        }
256 13
        $valid_crs = false;
257 13
        $columns = 0;
258 13
        $predictor = null;
259 13
        $sarr = $xrefcrs[0][1];
260 13
        if (!\is_array($sarr)) {
261
            $sarr = [];
262
        }
263
264 13
        $wb = [];
265
266 13
        foreach ($sarr as $k => $v) {
267
            if (
268 13
                ('/' == $v[0])
269 13
                && ('Type' == $v[1])
270
                && (
271 13
                    isset($sarr[$k + 1])
272 13
                    && '/' == $sarr[$k + 1][0]
273 13
                    && 'XRef' == $sarr[$k + 1][1]
274
                )
275
            ) {
276 13
                $valid_crs = true;
277 13
            } elseif (('/' == $v[0]) && ('Index' == $v[1]) && (isset($sarr[$k + 1]))) {
278
                // initialize list for: first object number in the subsection / number of objects
279 11
                $index_blocks = [];
280 11
                for ($m = 0; $m < \count($sarr[$k + 1][1]); $m += 2) {
0 ignored issues
show
Performance Best Practice introduced by
It seems like you are calling the size function count() as part of the test condition. You might want to compute the size beforehand, and not on each iteration.

If the size of the collection does not change during the iteration, it is generally a good practice to compute it beforehand, and not on each iteration:

for ($i=0; $i<count($array); $i++) { // calls count() on each iteration
}

// Better
for ($i=0, $c=count($array); $i<$c; $i++) { // calls count() just once
}
Loading history...
281 11
                    $index_blocks[] = [$sarr[$k + 1][1][$m][1], $sarr[$k + 1][1][$m + 1][1]];
282
                }
283 13
            } elseif (('/' == $v[0]) && ('Prev' == $v[1]) && (isset($sarr[$k + 1]) && ('numeric' == $sarr[$k + 1][0]))) {
284
                // get previous xref offset
285 11
                $prevxref = (int) $sarr[$k + 1][1];
286 13
            } elseif (('/' == $v[0]) && ('W' == $v[1]) && (isset($sarr[$k + 1]))) {
287
                // number of bytes (in the decoded stream) of the corresponding field
288 13
                $wb[0] = (int) $sarr[$k + 1][1][0][1];
289 13
                $wb[1] = (int) $sarr[$k + 1][1][1][1];
290 13
                $wb[2] = (int) $sarr[$k + 1][1][2][1];
291 13
            } elseif (('/' == $v[0]) && ('DecodeParms' == $v[1]) && (isset($sarr[$k + 1][1]))) {
292 11
                $decpar = $sarr[$k + 1][1];
293 11
                foreach ($decpar as $kdc => $vdc) {
294
                    if (
295 11
                        '/' == $vdc[0]
296 11
                        && 'Columns' == $vdc[1]
297
                        && (
298 11
                            isset($decpar[$kdc + 1])
299 11
                            && 'numeric' == $decpar[$kdc + 1][0]
300
                        )
301
                    ) {
302 11
                        $columns = (int) $decpar[$kdc + 1][1];
303
                    } elseif (
304 11
                        '/' == $vdc[0]
305 11
                        && 'Predictor' == $vdc[1]
306
                        && (
307 11
                            isset($decpar[$kdc + 1])
308 11
                            && 'numeric' == $decpar[$kdc + 1][0]
309
                        )
310
                    ) {
311 11
                        $predictor = (int) $decpar[$kdc + 1][1];
312
                    }
313
                }
314 13
            } elseif ($filltrailer) {
315 13
                if (('/' == $v[0]) && ('Size' == $v[1]) && (isset($sarr[$k + 1]) && ('numeric' == $sarr[$k + 1][0]))) {
316 13
                    $xref['trailer']['size'] = $sarr[$k + 1][1];
317 13
                } elseif (('/' == $v[0]) && ('Root' == $v[1]) && (isset($sarr[$k + 1]) && ('objref' == $sarr[$k + 1][0]))) {
318 13
                    $xref['trailer']['root'] = $sarr[$k + 1][1];
319 13
                } elseif (('/' == $v[0]) && ('Info' == $v[1]) && (isset($sarr[$k + 1]) && ('objref' == $sarr[$k + 1][0]))) {
320 13
                    $xref['trailer']['info'] = $sarr[$k + 1][1];
321 13
                } elseif (('/' == $v[0]) && ('Encrypt' == $v[1]) && (isset($sarr[$k + 1]) && ('objref' == $sarr[$k + 1][0]))) {
322
                    $xref['trailer']['encrypt'] = $sarr[$k + 1][1];
323 13
                } elseif (('/' == $v[0]) && ('ID' == $v[1]) && (isset($sarr[$k + 1]))) {
324 13
                    $xref['trailer']['id'] = [];
325 13
                    $xref['trailer']['id'][0] = $sarr[$k + 1][1][0][1];
326 13
                    $xref['trailer']['id'][1] = $sarr[$k + 1][1][1][1];
327
                }
328
            }
329
        }
330
331
        // decode data
332 13
        if ($valid_crs && isset($xrefcrs[1][3][0])) {
333 13
            if (null !== $predictor) {
334
                // number of bytes in a row
335 11
                $rowlen = ($columns + 1);
336
                // convert the stream into an array of integers
337
                /** @var array<int> */
338 11
                $sdata = unpack('C*', $xrefcrs[1][3][0]);
339
                // TODO: Handle the case when unpack returns false
340
341
                // split the rows
342 11
                $sdata = array_chunk($sdata, $rowlen);
343
344
                // initialize decoded array
345 11
                $ddata = [];
346
                // initialize first row with zeros
347 11
                $prev_row = array_fill(0, $rowlen, 0);
348
                // for each row apply PNG unpredictor
349 11
                foreach ($sdata as $k => $row) {
350
                    // initialize new row
351 11
                    $ddata[$k] = [];
352
                    // get PNG predictor value
353 11
                    $predictor = (10 + $row[0]);
354
                    // for each byte on the row
355 11
                    for ($i = 1; $i <= $columns; ++$i) {
356
                        // new index
357 11
                        $j = ($i - 1);
358 11
                        $row_up = $prev_row[$j];
359 11
                        if (1 == $i) {
360 11
                            $row_left = 0;
361 11
                            $row_upleft = 0;
362
                        } else {
363 11
                            $row_left = $row[$i - 1];
364 11
                            $row_upleft = $prev_row[$j - 1];
365
                        }
366
                        switch ($predictor) {
367 11
                            case 10:  // PNG prediction (on encoding, PNG None on all rows)
368
                                $ddata[$k][$j] = $row[$i];
369
                                break;
370
371 11
                            case 11:  // PNG prediction (on encoding, PNG Sub on all rows)
372
                                $ddata[$k][$j] = (($row[$i] + $row_left) & 0xFF);
373
                                break;
374
375 11
                            case 12:  // PNG prediction (on encoding, PNG Up on all rows)
376 11
                                $ddata[$k][$j] = (($row[$i] + $row_up) & 0xFF);
377 11
                                break;
378
379
                            case 13:  // PNG prediction (on encoding, PNG Average on all rows)
380
                                $ddata[$k][$j] = (($row[$i] + (($row_left + $row_up) / 2)) & 0xFF);
381
                                break;
382
383
                            case 14:  // PNG prediction (on encoding, PNG Paeth on all rows)
384
                                // initial estimate
385
                                $p = ($row_left + $row_up - $row_upleft);
386
                                // distances
387
                                $pa = abs($p - $row_left);
388
                                $pb = abs($p - $row_up);
389
                                $pc = abs($p - $row_upleft);
390
                                $pmin = min($pa, $pb, $pc);
391
                                // return minimum distance
392
                                switch ($pmin) {
393
                                    case $pa:
394
                                        $ddata[$k][$j] = (($row[$i] + $row_left) & 0xFF);
395
                                        break;
396
397
                                    case $pb:
398
                                        $ddata[$k][$j] = (($row[$i] + $row_up) & 0xFF);
399
                                        break;
400
401
                                    case $pc:
402
                                        $ddata[$k][$j] = (($row[$i] + $row_upleft) & 0xFF);
403
                                        break;
404
                                }
405
                                break;
406
407
                            default:  // PNG prediction (on encoding, PNG optimum)
408
                                throw new \Exception('Unknown PNG predictor: '.$predictor);
409
                        }
410
                    }
411 11
                    $prev_row = $ddata[$k];
412
                } // end for each row
413
                // complete decoding
414
            } else {
415
                // number of bytes in a row
416 2
                $rowlen = array_sum($wb);
417 2
                if (0 < $rowlen) {
418
                    // convert the stream into an array of integers
419 2
                    $sdata = unpack('C*', $xrefcrs[1][3][0]);
420
                    // split the rows
421 2
                    $ddata = array_chunk($sdata, $rowlen);
0 ignored issues
show
Bug introduced by
It seems like $rowlen can also be of type double; however, parameter $length of array_chunk() does only seem to accept integer, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

421
                    $ddata = array_chunk($sdata, /** @scrutinizer ignore-type */ $rowlen);
Loading history...
422
                } else {
423
                    // if the row length is zero, $ddata should be an empty array as well
424
                    $ddata = [];
425
                }
426
            }
427
428 13
            $sdata = [];
429
430
            // for every row
431 13
            foreach ($ddata as $k => $row) {
432
                // initialize new row
433 13
                $sdata[$k] = [0, 0, 0];
434 13
                if (0 == $wb[0]) {
435
                    // default type field
436
                    $sdata[$k][0] = 1;
437
                }
438 13
                $i = 0; // count bytes in the row
439
                // for every column
440 13
                for ($c = 0; $c < 3; ++$c) {
441
                    // for every byte on the column
442 13
                    for ($b = 0; $b < $wb[$c]; ++$b) {
443 13
                        if (isset($row[$i])) {
444 13
                            $sdata[$k][$c] += ($row[$i] << (($wb[$c] - 1 - $b) * 8));
445
                        }
446 13
                        ++$i;
447
                    }
448
                }
449
            }
450
451
            // fill xref
452 13
            if (isset($index_blocks)) {
453
                // load the first object number of the first /Index entry
454 11
                $obj_num = $index_blocks[0][0];
455
            } else {
456 12
                $obj_num = 0;
457
            }
458 13
            foreach ($sdata as $k => $row) {
459 13
                switch ($row[0]) {
460 13
                    case 0:  // (f) linked list of free objects
461 13
                        break;
462
463 13
                    case 1:  // (n) objects that are in use but are not compressed
464
                        // create unique object index: [object number]_[generation number]
465 13
                        $index = $obj_num.'_'.$row[2];
466
                        // check if object already exist
467 13
                        if (!isset($xref['xref'][$index])) {
468
                            // store object offset position
469 13
                            $xref['xref'][$index] = $row[1];
470
                        }
471 13
                        break;
472
473 13
                    case 2:  // compressed objects
474
                        // $row[1] = object number of the object stream in which this object is stored
475
                        // $row[2] = index of this object within the object stream
476 13
                        $index = $row[1].'_0_'.$row[2];
477 13
                        $xref['xref'][$index] = -1;
478 13
                        break;
479
480
                    default:  // null objects
481
                        break;
482
                }
483 13
                ++$obj_num;
484 13
                if (isset($index_blocks)) {
485
                    // reduce the number of remaining objects
486 11
                    --$index_blocks[0][1];
487 11
                    if (0 == $index_blocks[0][1]) {
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $index_blocks does not seem to be defined for all execution paths leading up to this point.
Loading history...
488
                        // remove the actual used /Index entry
489 11
                        array_shift($index_blocks);
490 11
                        if (0 < \count($index_blocks)) {
491
                            // load the first object number of the following /Index entry
492 8
                            $obj_num = $index_blocks[0][0];
493
                        } else {
494
                            // if there are no more entries, remove $index_blocks to avoid actions on an empty array
495 11
                            unset($index_blocks);
496
                        }
497
                    }
498
                }
499
            }
500
        } // end decoding data
501 13
        if (isset($prevxref)) {
502
            // get previous xref
503 11
            $xref = $this->getXrefData($pdfData, $prevxref, $xref);
504
        }
505
506 13
        return $xref;
507
    }
508
509 67
    protected function getObjectHeaderPattern(array $objRefs): string
510
    {
511
        // consider all whitespace character (PDF specifications)
512 67
        return '/'.$objRefs[0].$this->config->getPdfWhitespacesRegex().$objRefs[1].$this->config->getPdfWhitespacesRegex().'obj/';
513
    }
514
515 67
    protected function getObjectHeaderLen(array $objRefs): int
516
    {
517
        // "4 0 obj"
518
        // 2 whitespaces + strlen("obj") = 5
519 67
        return 5 + \strlen($objRefs[0]) + \strlen($objRefs[1]);
520
    }
521
522
    /**
523
     * Get content of indirect object.
524
     *
525
     * @param string $pdfData  PDF data
526
     * @param string $objRef   Object number and generation number separated by underscore character
527
     * @param int    $offset   Object offset
528
     * @param bool   $decoding If true decode streams
529
     *
530
     * @return array containing object data
531
     *
532
     * @throws \Exception if invalid object reference found
533
     */
534 67
    protected function getIndirectObject(string $pdfData, array $xref, string $objRef, int $offset = 0, bool $decoding = true): array
535
    {
536
        /*
537
         * build indirect object header
538
         */
539
        // $objHeader = "[object number] [generation number] obj"
540 67
        $objRefArr = explode('_', $objRef);
541 67
        if (2 !== \count($objRefArr)) {
542
            throw new \Exception('Invalid object reference for $obj.');
543
        }
544
545 67
        $objHeaderLen = $this->getObjectHeaderLen($objRefArr);
546
547
        /*
548
         * check if we are in position
549
         */
550
        // ignore whitespace characters at offset
551 67
        $offset += strspn($pdfData, $this->config->getPdfWhitespaces(), $offset);
552
        // ignore leading zeros for object number
553 67
        $offset += strspn($pdfData, '0', $offset);
554 67
        if (0 == preg_match($this->getObjectHeaderPattern($objRefArr), substr($pdfData, $offset, $objHeaderLen))) {
555
            // an indirect reference to an undefined object shall be considered a reference to the null object
556
            return ['null', 'null', $offset];
557
        }
558
559
        /*
560
         * get content
561
         */
562
        // starting position of object content
563 67
        $offset += $objHeaderLen;
564 67
        $objContentArr = [];
565 67
        $i = 0; // object main index
566 67
        $header = null;
567
        do {
568 67
            $oldOffset = $offset;
569
            // get element
570 67
            $element = $this->getRawObject($pdfData, $offset, null != $header ? $header[1] : null);
571 67
            $offset = $element[2];
572
            // decode stream using stream's dictionary information
573 67
            if ($decoding && ('stream' === $element[0]) && null != $header) {
574 67
                $element[3] = $this->decodeStream($pdfData, $xref, $header[1], $element[1]);
575
            }
576 67
            $objContentArr[$i] = $element;
577 67
            $header = isset($element[0]) && '<<' === $element[0] ? $element : null;
578 67
            ++$i;
579 67
        } while (('endobj' !== $element[0]) && ($offset !== $oldOffset));
580
        // remove closing delimiter
581 67
        array_pop($objContentArr);
582
583
        /*
584
         * return raw object content
585
         */
586 67
        return $objContentArr;
587
    }
588
589
    /**
590
     * Get the content of object, resolving indirect object reference if necessary.
591
     *
592
     * @param string $pdfData PDF data
593
     * @param array  $obj     Object value
594
     *
595
     * @return array containing object data
596
     *
597
     * @throws \Exception
598
     */
599 67
    protected function getObjectVal(string $pdfData, $xref, array $obj): array
600
    {
601 67
        if ('objref' == $obj[0]) {
602
            // reference to indirect object
603
            if (isset($this->objects[$obj[1]])) {
604
                // this object has been already parsed
605
                return $this->objects[$obj[1]];
606
            } elseif (isset($xref[$obj[1]])) {
607
                // parse new object
608
                $this->objects[$obj[1]] = $this->getIndirectObject($pdfData, $xref, $obj[1], $xref[$obj[1]], false);
609
610
                return $this->objects[$obj[1]];
611
            }
612
        }
613
614 67
        return $obj;
615
    }
616
617
    /**
618
     * Get object type, raw value and offset to next object
619
     *
620
     * @param int        $offset    Object offset
621
     * @param array|null $headerDic obj header's dictionary, parsed by getRawObject. Used for stream parsing optimization
622
     *
623
     * @return array containing object type, raw value and offset to next object
624
     */
625 68
    protected function getRawObject(string $pdfData, int $offset = 0, ?array $headerDic = null): array
626
    {
627 68
        $objtype = ''; // object type to be returned
628 68
        $objval = ''; // object value to be returned
629
630
        // skip initial white space chars
631 68
        $offset += strspn($pdfData, $this->config->getPdfWhitespaces(), $offset);
632
633
        // get first char
634 68
        $char = $pdfData[$offset];
635
        // get object type
636
        switch ($char) {
637 68
            case '%':  // \x25 PERCENT SIGN
638
                // skip comment and search for next token
639 3
                $next = strcspn($pdfData, "\r\n", $offset);
640 3
                if ($next > 0) {
641 3
                    $offset += $next;
642
643 3
                    return $this->getRawObject($pdfData, $offset);
644
                }
645
                break;
646
647 68
            case '/':  // \x2F SOLIDUS
648
                // name object
649 68
                $objtype = $char;
650 68
                ++$offset;
651 68
                $span = strcspn($pdfData, "\x00\x09\x0a\x0c\x0d\x20\n\t\r\v\f\x28\x29\x3c\x3e\x5b\x5d\x7b\x7d\x2f\x25", $offset, 256);
652 68
                if ($span > 0) {
653 68
                    $objval = substr($pdfData, $offset, $span); // unescaped value
654 68
                    $offset += $span;
655
                }
656 68
                break;
657
658 68
            case '(':   // \x28 LEFT PARENTHESIS
659 68
            case ')':  // \x29 RIGHT PARENTHESIS
660
                // literal string object
661 62
                $objtype = $char;
662 62
                ++$offset;
663 62
                $strpos = $offset;
664 62
                if ('(' == $char) {
665 62
                    $open_bracket = 1;
666 62
                    while ($open_bracket > 0) {
667 62
                        if (!isset($pdfData[$strpos])) {
668
                            break;
669
                        }
670 62
                        $ch = $pdfData[$strpos];
671
                        switch ($ch) {
672 62
                            case '\\':  // REVERSE SOLIDUS (5Ch) (Backslash)
673
                                // skip next character
674 29
                                ++$strpos;
675 29
                                break;
676
677 62
                            case '(':  // LEFT PARENHESIS (28h)
678 2
                                ++$open_bracket;
679 2
                                break;
680
681 62
                            case ')':  // RIGHT PARENTHESIS (29h)
682 62
                                --$open_bracket;
683 62
                                break;
684
                        }
685 62
                        ++$strpos;
686
                    }
687 62
                    $objval = substr($pdfData, $offset, $strpos - $offset - 1);
688 62
                    $offset = $strpos;
689
                }
690 62
                break;
691
692 68
            case '[':   // \x5B LEFT SQUARE BRACKET
693 68
            case ']':  // \x5D RIGHT SQUARE BRACKET
694
                // array object
695 67
                $objtype = $char;
696 67
                ++$offset;
697 67
                if ('[' == $char) {
698
                    // get array content
699 67
                    $objval = [];
700
                    do {
701 67
                        $oldOffset = $offset;
702
                        // get element
703 67
                        $element = $this->getRawObject($pdfData, $offset);
704 67
                        $offset = $element[2];
705 67
                        $objval[] = $element;
706 67
                    } while ((']' != $element[0]) && ($offset != $oldOffset));
707
                    // remove closing delimiter
708 67
                    array_pop($objval);
709
                }
710 67
                break;
711
712 68
            case '<':  // \x3C LESS-THAN SIGN
713 68
            case '>':  // \x3E GREATER-THAN SIGN
714 68
                if (isset($pdfData[$offset + 1]) && ($pdfData[$offset + 1] == $char)) {
715
                    // dictionary object
716 68
                    $objtype = $char.$char;
717 68
                    $offset += 2;
718 68
                    if ('<' == $char) {
719
                        // get array content
720 68
                        $objval = [];
721
                        do {
722 68
                            $oldOffset = $offset;
723
                            // get element
724 68
                            $element = $this->getRawObject($pdfData, $offset);
725 68
                            $offset = $element[2];
726 68
                            $objval[] = $element;
727 68
                        } while (('>>' != $element[0]) && ($offset != $oldOffset));
728
                        // remove closing delimiter
729 68
                        array_pop($objval);
730
                    }
731
                } else {
732
                    // hexadecimal string object
733 32
                    $objtype = $char;
734 32
                    ++$offset;
735
736 32
                    $span = strspn($pdfData, "0123456789abcdefABCDEF\x09\x0a\x0c\x0d\x20", $offset);
737 32
                    $dataToCheck = $pdfData[$offset + $span] ?? null;
738 32
                    if ('<' == $char && $span > 0 && '>' == $dataToCheck) {
739
                        // remove white space characters
740 32
                        $objval = strtr(substr($pdfData, $offset, $span), $this->config->getPdfWhitespaces(), '');
741 32
                        $offset += $span + 1;
742 2
                    } elseif (false !== ($endpos = strpos($pdfData, '>', $offset))) {
743 2
                        $offset = $endpos + 1;
744
                    }
745
                }
746 68
                break;
747
748
            default:
749 68
                if ('endobj' == substr($pdfData, $offset, 6)) {
750
                    // indirect object
751 67
                    $objtype = 'endobj';
752 67
                    $offset += 6;
753 68
                } elseif ('null' == substr($pdfData, $offset, 4)) {
754
                    // null object
755 11
                    $objtype = 'null';
756 11
                    $offset += 4;
757 11
                    $objval = 'null';
758 68
                } elseif ('true' == substr($pdfData, $offset, 4)) {
759
                    // boolean true object
760 31
                    $objtype = 'boolean';
761 31
                    $offset += 4;
762 31
                    $objval = 'true';
763 68
                } elseif ('false' == substr($pdfData, $offset, 5)) {
764
                    // boolean false object
765 5
                    $objtype = 'boolean';
766 5
                    $offset += 5;
767 5
                    $objval = 'false';
768 68
                } elseif ('stream' == substr($pdfData, $offset, 6)) {
769
                    // start stream object
770 68
                    $objtype = 'stream';
771 68
                    $offset += 6;
772 68
                    if (1 == preg_match('/^( *[\r]?[\n])/isU', substr($pdfData, $offset, 4), $matches)) {
773 68
                        $offset += \strlen($matches[0]);
774
775
                        // we get stream length here to later help preg_match test less data
776 68
                        $streamLen = (int) $this->getHeaderValue($headerDic, 'Length', 'numeric', 0);
777 68
                        $skip = false === $this->config->getRetainImageContent() && 'XObject' == $this->getHeaderValue($headerDic, 'Type', '/') && 'Image' == $this->getHeaderValue($headerDic, 'Subtype', '/');
778
779 68
                        $pregResult = preg_match(
780 68
                            '/(endstream)[\x09\x0a\x0c\x0d\x20]/isU',
781 68
                            $pdfData,
782 68
                            $matches,
783 68
                            \PREG_OFFSET_CAPTURE,
784 68
                            $offset + $streamLen
785 68
                        );
786
787 68
                        if (1 == $pregResult) {
788 68
                            $objval = $skip ? '' : substr($pdfData, $offset, $matches[0][1] - $offset);
789 68
                            $offset = $matches[1][1];
790
                        }
791
                    }
792 68
                } elseif ('endstream' == substr($pdfData, $offset, 9)) {
793
                    // end stream object
794 67
                    $objtype = 'endstream';
795 67
                    $offset += 9;
796 68
                } elseif (1 == preg_match('/^([0-9]+)[\s]+([0-9]+)[\s]+R/iU', substr($pdfData, $offset, 33), $matches)) {
797
                    // indirect object reference
798 67
                    $objtype = 'objref';
799 67
                    $offset += \strlen($matches[0]);
800 67
                    $objval = (int) $matches[1].'_'.(int) $matches[2];
801 68
                } elseif (1 == preg_match('/^([0-9]+)[\s]+([0-9]+)[\s]+obj/iU', substr($pdfData, $offset, 33), $matches)) {
802
                    // object start
803 14
                    $objtype = 'obj';
804 14
                    $objval = (int) $matches[1].'_'.(int) $matches[2];
805 14
                    $offset += \strlen($matches[0]);
806 68
                } elseif (($numlen = strspn($pdfData, '+-.0123456789', $offset)) > 0) {
807
                    // numeric object
808 67
                    $objtype = 'numeric';
809 67
                    $objval = substr($pdfData, $offset, $numlen);
810 67
                    $offset += $numlen;
811
                }
812 68
                break;
813
        }
814
815 68
        return [$objtype, $objval, $offset];
816
    }
817
818
    /**
819
     * Get value of an object header's section (obj << YYY >> part ).
820
     *
821
     * It is similar to Header::get('...')->getContent(), the only difference is it can be used during the parsing process,
822
     * when no Smalot\PdfParser\Header objects are created yet.
823
     *
824
     * @param string            $key     header's section name
825
     * @param string            $type    type of the section (i.e. 'numeric', '/', '<<', etc.)
826
     * @param string|array|null $default default value for header's section
827
     *
828
     * @return string|array|null value of obj header's section, or default value if none found, or its type doesn't match $type param
829
     */
830 68
    private function getHeaderValue(?array $headerDic, string $key, string $type, $default = '')
831
    {
832 68
        if (false === \is_array($headerDic)) {
0 ignored issues
show
introduced by
The condition false === is_array($headerDic) is always false.
Loading history...
833 1
            return $default;
834
        }
835
836
        /*
837
         * It recieves dictionary of header fields, as it is returned by RawDataParser::getRawObject,
838
         * iterates over it, searching for section of type '/' whith requested key.
839
         * If such a section is found, it tries to receive it's value (next object in dictionary),
840
         * returning it, if it matches requested type, or default value otherwise.
841
         */
842 67
        foreach ($headerDic as $i => $val) {
843 67
            $isSectionName = \is_array($val) && 3 == \count($val) && '/' == $val[0];
844
            if (
845 67
                $isSectionName
846 67
                && $val[1] == $key
847 67
                && isset($headerDic[$i + 1])
848
            ) {
849 67
                $isSectionValue = \is_array($headerDic[$i + 1]) && 1 < \count($headerDic[$i + 1]);
850
851 67
                return $isSectionValue && $type == $headerDic[$i + 1][0]
852 62
                    ? $headerDic[$i + 1][1]
853 67
                    : $default;
854
            }
855
        }
856
857
        return $default;
858
    }
859
860
    /**
861
     * Get Cross-Reference (xref) table and trailer data from PDF document data.
862
     *
863
     * @param int   $offset xref offset (if known)
864
     * @param array $xref   previous xref array (if any)
865
     *
866
     * @return array containing xref and trailer data
867
     *
868
     * @throws \Exception if it was unable to find startxref
869
     * @throws \Exception if it was unable to find xref
870
     */
871 68
    protected function getXrefData(string $pdfData, int $offset = 0, array $xref = []): array
872
    {
873
        // If the $offset is currently pointed at whitespace, bump it
874
        // forward until it isn't; affects loosely targetted offsets
875
        // for the 'xref' keyword
876
        // See: https://github.com/smalot/pdfparser/issues/673
877 68
        $bumpOffset = $offset;
878 68
        while (preg_match('/\s/', substr($pdfData, $bumpOffset, 1))) {
879 1
            ++$bumpOffset;
880
        }
881
882
        // Find all startxref tables from this $offset forward
883 68
        $startxrefPreg = preg_match_all(
884 68
            '/(?<=[\r\n])startxref[\s]*[\r\n]+([0-9]+)[\s]*[\r\n]+%%EOF/i',
885 68
            $pdfData,
886 68
            $startxrefMatches,
887 68
            \PREG_SET_ORDER,
888 68
            $offset
889 68
        );
890
891 68
        if (0 == $startxrefPreg) {
0 ignored issues
show
Bug Best Practice introduced by
It seems like you are loosely comparing $startxrefPreg of type integer|null to 0; this is ambiguous as not only 0 == 0 is true, but null == 0 is true, too. Consider using a strict comparison ===.
Loading history...
892
            // No startxref tables were found
893
            throw new \Exception('Unable to find startxref');
894 68
        } elseif (0 == $offset) {
895
            // Use the last startxref in the document
896 68
            $startxref = (int) $startxrefMatches[\count($startxrefMatches) - 1][1];
897 23
        } elseif (strpos($pdfData, 'xref', $bumpOffset) == $bumpOffset) {
898
            // Already pointing at the xref table
899 12
            $startxref = $bumpOffset;
900 11
        } elseif (preg_match('/([0-9]+[\s][0-9]+[\s]obj)/i', $pdfData, $matches, 0, $bumpOffset)) {
901
            // Cross-Reference Stream object
902 11
            $startxref = $bumpOffset;
903
        } else {
904
            // Use the next startxref from this $offset
905
            $startxref = (int) $startxrefMatches[0][1];
906
        }
907
908 68
        if ($startxref > \strlen($pdfData)) {
909 1
            throw new \Exception('Unable to find xref (PDF corrupted?)');
910
        }
911
912
        // check xref position
913 67
        if (strpos($pdfData, 'xref', $startxref) == $startxref) {
914
            // Cross-Reference
915 54
            $xref = $this->decodeXref($pdfData, $startxref, $xref);
916
        } else {
917
            // Check if the $pdfData might have the wrong line-endings
918 13
            $pdfDataUnix = str_replace("\r\n", "\n", $pdfData);
919 13
            if ($startxref < \strlen($pdfDataUnix) && strpos($pdfDataUnix, 'xref', $startxref) == $startxref) {
920
                // Return Unix-line-ending flag
921
                $xref = ['Unix' => true];
922
            } else {
923
                // Cross-Reference Stream
924 13
                $xref = $this->decodeXrefStream($pdfData, $startxref, $xref);
925
            }
926
        }
927 67
        if (empty($xref)) {
928
            throw new \Exception('Unable to find xref');
929
        }
930
931 67
        return $xref;
932
    }
933
934
    /**
935
     * Parses PDF data and returns extracted data as array.
936
     *
937
     * @param string $data PDF data to parse
938
     *
939
     * @return array array of parsed PDF document objects
940
     *
941
     * @throws \Exception if empty PDF data given
942
     * @throws \Exception if PDF data missing %PDF header
943
     */
944 68
    public function parseData(string $data): array
945
    {
946 68
        if (empty($data)) {
947
            throw new \Exception('Empty PDF data given.');
948
        }
949
        // find the pdf header starting position
950 68
        if (false === ($trimpos = strpos($data, '%PDF-'))) {
951
            throw new \Exception('Invalid PDF data: missing %PDF header.');
952
        }
953
954
        // get PDF content string
955 68
        $pdfData = $trimpos > 0 ? substr($data, $trimpos) : $data;
956
957
        // get xref and trailer data
958 68
        $xref = $this->getXrefData($pdfData);
959
960
        // If we found Unix line-endings
961 67
        if (isset($xref['Unix'])) {
962
            $pdfData = str_replace("\r\n", "\n", $pdfData);
963
            $xref = $this->getXrefData($pdfData);
964
        }
965
966
        // parse all document objects
967 67
        $objects = [];
968 67
        foreach ($xref['xref'] as $obj => $offset) {
969 67
            if (!isset($objects[$obj]) && ($offset > 0)) {
970
                // decode objects with positive offset
971 67
                $objects[$obj] = $this->getIndirectObject($pdfData, $xref, $obj, $offset, true);
972
            }
973
        }
974
975 67
        return [$xref, $objects];
976
    }
977
}
978