Passed
Push — feature/remove-tcpdf-lib ( d32e07...6276a3 )
by Konrad
01:47
created

RawDataParser::decodeStream()   D

Complexity

Conditions 21
Paths 51

Size

Total Lines 56
Code Lines 32

Duplication

Lines 0
Ratio 0 %

Importance

Changes 2
Bugs 1 Features 1
Metric Value
cc 21
eloc 32
c 2
b 1
f 1
nc 51
nop 3
dl 0
loc 56
rs 4.1666

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * This file is based on code of tecnickcom/TCPDF PDF library.
5
 *
6
 * Original author Nicola Asuni ([email protected]) and
7
 * contributors (https://github.com/tecnickcom/TCPDF/graphs/contributors).
8
 *
9
 * @see https://github.com/tecnickcom/TCPDF
10
 *
11
 * Original code was licensed on the terms of the LGPL v3.
12
 *
13
 * ------------------------------------------------------------------------------
14
 *
15
 * @file This file is part of the PdfParser library.
16
 *
17
 * @author  Konrad Abicht <[email protected]>
18
 * @date    2020-01-06
19
 *
20
 * @license LGPLv3
21
 * @url     <https://github.com/smalot/pdfparser>
22
 *
23
 *  PdfParser is a pdf library written in PHP, extraction oriented.
24
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
25
 *
26
 *  This program is free software: you can redistribute it and/or modify
27
 *  it under the terms of the GNU Lesser General Public License as published by
28
 *  the Free Software Foundation, either version 3 of the License, or
29
 *  (at your option) any later version.
30
 *
31
 *  This program is distributed in the hope that it will be useful,
32
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
33
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
34
 *  GNU Lesser General Public License for more details.
35
 *
36
 *  You should have received a copy of the GNU Lesser General Public License
37
 *  along with this program.
38
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
39
 */
40
41
namespace Smalot\PdfParser\RawData;
42
43
use Exception;
44
45
class RawDataParser
46
{
47
    /**
48
     * Configuration array.
49
     */
50
    protected $cfg = [
51
        // if `true` ignore filter decoding errors
52
        'ignore_filter_decoding_errors' => true,
53
        // if `true` ignore missing filter decoding errors
54
        'ignore_missing_filter_decoders' => true,
55
    ];
56
57
    protected $filterHelper;
58
59
    /**
60
     * @param array $cfg Configuration array, default is []
61
     */
62
    public function __construct($cfg = [])
63
    {
64
        // merge given array with default values
65
        $this->cfg = array_merge($this->cfg, $cfg);
66
67
        $this->filterHelper = new FilterHelper();
68
    }
69
70
    /**
71
     * Decode the specified stream.
72
     *
73
     * @param string $pdfData PDF data
74
     * @param array  $sdic    Stream's dictionary array
75
     * @param string $stream  Stream to decode
76
     *
77
     * @return array containing decoded stream data and remaining filters
78
     */
79
    public function decodeStream($pdfData, $sdic, $stream)
80
    {
81
        // get stream length and filters
82
        $slength = \strlen($stream);
83
        if ($slength <= 0) {
84
            return ['', []];
85
        }
86
        $filters = [];
87
        foreach ($sdic as $k => $v) {
88
            if ('/' == $v[0]) {
89
                if (('Length' == $v[1]) and (isset($sdic[($k + 1)])) and ('numeric' == $sdic[($k + 1)][0])) {
90
                    // get declared stream length
91
                    $declength = (int) ($sdic[($k + 1)][1]);
92
                    if ($declength < $slength) {
93
                        $stream = substr($stream, 0, $declength);
94
                        $slength = $declength;
95
                    }
96
                } elseif (('Filter' == $v[1]) and (isset($sdic[($k + 1)]))) {
97
                    // resolve indirect object
98
                    $objval = $this->getObjectVal($pdfData, $sdic[($k + 1)]);
99
                    if ('/' == $objval[0]) {
100
                        // single filter
101
                        $filters[] = $objval[1];
102
                    } elseif ('[' == $objval[0]) {
103
                        // array of filters
104
                        foreach ($objval[1] as $flt) {
105
                            if ('/' == $flt[0]) {
106
                                $filters[] = $flt[1];
107
                            }
108
                        }
109
                    }
110
                }
111
            }
112
        }
113
114
        // decode the stream
115
        $remaining_filters = [];
116
        foreach ($filters as $filter) {
117
            if (\in_array($filter, $this->filterHelper->getAvailableFilters())) {
118
                try {
119
                    $stream = $this->filterHelper->decodeFilter($filter, $stream);
120
                } catch (Exception $e) {
121
                    $emsg = $e->getMessage();
122
                    if ((('~' == $emsg[0]) && !$this->cfg['ignore_missing_filter_decoders'])
123
                        || (('~' != $emsg[0]) && !$this->cfg['ignore_filter_decoding_errors'])
124
                    ) {
125
                        throw new Exception($e->getMessage());
126
                    }
127
                }
128
            } else {
129
                // add missing filter to array
130
                $remaining_filters[] = $filter;
131
            }
132
        }
133
134
        return [$stream, $remaining_filters];
135
    }
136
137
    /**
138
     * Decode the Cross-Reference section
139
     *
140
     * @param string $pdfData   PDF data
141
     * @param int    $startxref Offset at which the xref section starts (position of the 'xref' keyword)
142
     * @param array  $xref      Previous xref array (if any)
143
     *
144
     * @return array containing xref and trailer data
145
     */
146
    public function decodeXref($pdfData, $startxref, $xref = [])
147
    {
148
        $startxref += 4; // 4 is the length of the word 'xref'
149
        // skip initial white space chars: \x00 null (NUL), \x09 horizontal tab (HT), \x0A line feed (LF), \x0C form feed (FF), \x0D carriage return (CR), \x20 space (SP)
150
        $offset = $startxref + strspn($pdfData, "\x00\x09\x0a\x0c\x0d\x20", $startxref);
151
        // initialize object number
152
        $obj_num = 0;
153
        // search for cross-reference entries or subsection
154
        while (preg_match('/([0-9]+)[\x20]([0-9]+)[\x20]?([nf]?)(\r\n|[\x20]?[\r\n])/', $pdfData, $matches, PREG_OFFSET_CAPTURE, $offset) > 0) {
155
            if ($matches[0][1] != $offset) {
156
                // we are on another section
157
                break;
158
            }
159
            $offset += \strlen($matches[0][0]);
160
            if ('n' == $matches[3][0]) {
161
                // create unique object index: [object number]_[generation number]
162
                $index = $obj_num.'_'.(int) ($matches[2][0]);
163
                // check if object already exist
164
                if (!isset($xref['xref'][$index])) {
165
                    // store object offset position
166
                    $xref['xref'][$index] = (int) ($matches[1][0]);
167
                }
168
                ++$obj_num;
169
            } elseif ('f' == $matches[3][0]) {
170
                ++$obj_num;
171
            } else {
172
                // object number (index)
173
                $obj_num = (int) ($matches[1][0]);
174
            }
175
        }
176
        // get trailer data
177
        if (preg_match('/trailer[\s]*<<(.*)>>/isU', $pdfData, $matches, PREG_OFFSET_CAPTURE, $offset) > 0) {
178
            $trailer_data = $matches[1][0];
179
            if (!isset($xref['trailer']) or empty($xref['trailer'])) {
180
                // get only the last updated version
181
                $xref['trailer'] = [];
182
                // parse trailer_data
183
                if (preg_match('/Size[\s]+([0-9]+)/i', $trailer_data, $matches) > 0) {
184
                    $xref['trailer']['size'] = (int) ($matches[1]);
185
                }
186
                if (preg_match('/Root[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
187
                    $xref['trailer']['root'] = (int) ($matches[1]).'_'.(int) ($matches[2]);
188
                }
189
                if (preg_match('/Encrypt[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
190
                    $xref['trailer']['encrypt'] = (int) ($matches[1]).'_'.(int) ($matches[2]);
191
                }
192
                if (preg_match('/Info[\s]+([0-9]+)[\s]+([0-9]+)[\s]+R/i', $trailer_data, $matches) > 0) {
193
                    $xref['trailer']['info'] = (int) ($matches[1]).'_'.(int) ($matches[2]);
194
                }
195
                if (preg_match('/ID[\s]*[\[][\s]*[<]([^>]*)[>][\s]*[<]([^>]*)[>]/i', $trailer_data, $matches) > 0) {
196
                    $xref['trailer']['id'] = [];
197
                    $xref['trailer']['id'][0] = $matches[1];
198
                    $xref['trailer']['id'][1] = $matches[2];
199
                }
200
            }
201
            if (preg_match('/Prev[\s]+([0-9]+)/i', $trailer_data, $matches) > 0) {
202
                // get previous xref
203
                $xref = $this->getXrefData($pdfData, (int) ($matches[1]), $xref);
204
            }
205
        } else {
206
            throw new Exception('Unable to find trailer');
207
        }
208
209
        return $xref;
210
    }
211
212
    /**
213
     * Decode the Cross-Reference Stream section
214
     *
215
     * @param string $pdfData   PDF data
216
     * @param int    $startxref Offset at which the xref section starts
217
     * @param array  $xref      Previous xref array (if any)
218
     *
219
     * @return array containing xref and trailer data
220
     *
221
     * @throws Exception if unknown PNG predictor detected
222
     */
223
    public function decodeXrefStream($pdfData, $startxref, $xref = [])
224
    {
225
        // try to read Cross-Reference Stream
226
        $xrefobj = $this->getRawObject($pdfData, $startxref);
227
        $xrefcrs = $this->getIndirectObject($pdfData, $xrefobj[1], $startxref, true);
228
        if (!isset($xref['trailer']) or empty($xref['trailer'])) {
229
            // get only the last updated version
230
            $xref['trailer'] = [];
231
            $filltrailer = true;
232
        } else {
233
            $filltrailer = false;
234
        }
235
        if (!isset($xref['xref'])) {
236
            $xref['xref'] = [];
237
        }
238
        $valid_crs = false;
239
        $columns = 0;
240
        $sarr = $xrefcrs[0][1];
241
        if (!\is_array($sarr)) {
242
            $sarr = [];
243
        }
244
245
        $wb = [];
246
247
        foreach ($sarr as $k => $v) {
248
            if (
249
                ('/' == $v[0])
250
                && ('Type' == $v[1])
251
                && (
252
                    isset($sarr[($k + 1)])
253
                    && '/' == $sarr[($k + 1)][0]
254
                    && 'XRef' == $sarr[($k + 1)][1]
255
                )
256
            ) {
257
                $valid_crs = true;
258
            } elseif (('/' == $v[0]) and ('Index' == $v[1]) and (isset($sarr[($k + 1)]))) {
259
                // first object number in the subsection
260
                $index_first = (int) ($sarr[($k + 1)][1][0][1]);
261
            } elseif (('/' == $v[0]) and ('Prev' == $v[1]) and (isset($sarr[($k + 1)]) and ('numeric' == $sarr[($k + 1)][0]))) {
262
                // get previous xref offset
263
                $prevxref = (int) ($sarr[($k + 1)][1]);
264
            } elseif (('/' == $v[0]) and ('W' == $v[1]) and (isset($sarr[($k + 1)]))) {
265
                // number of bytes (in the decoded stream) of the corresponding field
266
                $wb[0] = (int) ($sarr[($k + 1)][1][0][1]);
267
                $wb[1] = (int) ($sarr[($k + 1)][1][1][1]);
268
                $wb[2] = (int) ($sarr[($k + 1)][1][2][1]);
269
            } elseif (('/' == $v[0]) and ('DecodeParms' == $v[1]) and (isset($sarr[($k + 1)][1]))) {
270
                $decpar = $sarr[($k + 1)][1];
271
                foreach ($decpar as $kdc => $vdc) {
272
                    if (
273
                        '/' == $vdc[0]
274
                        && 'Columns' == $vdc[1]
275
                        && (
276
                            isset($decpar[($kdc + 1)])
277
                            && 'numeric' == $decpar[($kdc + 1)][0]
278
                        )
279
                    ) {
280
                        $columns = (int) ($decpar[($kdc + 1)][1]);
281
                    } elseif (
282
                        '/' == $vdc[0]
283
                        && 'Predictor' == $vdc[1]
284
                        && (
285
                            isset($decpar[($kdc + 1)])
286
                            && 'numeric' == $decpar[($kdc + 1)][0]
287
                        )
288
                    ) {
289
                        $predictor = (int) ($decpar[($kdc + 1)][1]);
0 ignored issues
show
Unused Code introduced by
The assignment to $predictor is dead and can be removed.
Loading history...
290
                    }
291
                }
292
            } elseif ($filltrailer) {
293
                if (('/' == $v[0]) and ('Size' == $v[1]) and (isset($sarr[($k + 1)]) and ('numeric' == $sarr[($k + 1)][0]))) {
294
                    $xref['trailer']['size'] = $sarr[($k + 1)][1];
295
                } elseif (('/' == $v[0]) and ('Root' == $v[1]) and (isset($sarr[($k + 1)]) and ('objref' == $sarr[($k + 1)][0]))) {
296
                    $xref['trailer']['root'] = $sarr[($k + 1)][1];
297
                } elseif (('/' == $v[0]) and ('Info' == $v[1]) and (isset($sarr[($k + 1)]) and ('objref' == $sarr[($k + 1)][0]))) {
298
                    $xref['trailer']['info'] = $sarr[($k + 1)][1];
299
                } elseif (('/' == $v[0]) and ('Encrypt' == $v[1]) and (isset($sarr[($k + 1)]) and ('objref' == $sarr[($k + 1)][0]))) {
300
                    $xref['trailer']['encrypt'] = $sarr[($k + 1)][1];
301
                } elseif (('/' == $v[0]) and ('ID' == $v[1]) and (isset($sarr[($k + 1)]))) {
302
                    $xref['trailer']['id'] = [];
303
                    $xref['trailer']['id'][0] = $sarr[($k + 1)][1][0][1];
304
                    $xref['trailer']['id'][1] = $sarr[($k + 1)][1][1][1];
305
                }
306
            }
307
        }
308
309
        // decode data
310
        if ($valid_crs and isset($xrefcrs[1][3][0])) {
311
            // number of bytes in a row
312
            $rowlen = ($columns + 1);
313
            // convert the stream into an array of integers
314
            $sdata = unpack('C*', $xrefcrs[1][3][0]);
315
            // split the rows
316
            $sdata = array_chunk($sdata, $rowlen);
0 ignored issues
show
Bug introduced by
It seems like $sdata can also be of type false; however, parameter $input of array_chunk() does only seem to accept array, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

316
            $sdata = array_chunk(/** @scrutinizer ignore-type */ $sdata, $rowlen);
Loading history...
317
            // initialize decoded array
318
            $ddata = [];
319
            // initialize first row with zeros
320
            $prev_row = array_fill(0, $rowlen, 0);
321
            // for each row apply PNG unpredictor
322
            foreach ($sdata as $k => $row) {
323
                // initialize new row
324
                $ddata[$k] = [];
325
                // get PNG predictor value
326
                $predictor = (10 + $row[0]);
327
                // for each byte on the row
328
                for ($i = 1; $i <= $columns; ++$i) {
329
                    // new index
330
                    $j = ($i - 1);
331
                    $row_up = $prev_row[$j];
332
                    if (1 == $i) {
333
                        $row_left = 0;
334
                        $row_upleft = 0;
335
                    } else {
336
                        $row_left = $row[($i - 1)];
337
                        $row_upleft = $prev_row[($j - 1)];
338
                    }
339
                    switch ($predictor) {
340
                        case 10:  // PNG prediction (on encoding, PNG None on all rows)
341
                            $ddata[$k][$j] = $row[$i];
342
                            break;
343
344
                        case 11:  // PNG prediction (on encoding, PNG Sub on all rows)
345
                            $ddata[$k][$j] = (($row[$i] + $row_left) & 0xff);
346
                            break;
347
348
                        case 12:  // PNG prediction (on encoding, PNG Up on all rows)
349
                            $ddata[$k][$j] = (($row[$i] + $row_up) & 0xff);
350
                            break;
351
352
                        case 13:  // PNG prediction (on encoding, PNG Average on all rows)
353
                            $ddata[$k][$j] = (($row[$i] + (($row_left + $row_up) / 2)) & 0xff);
354
                            break;
355
356
                        case 14:  // PNG prediction (on encoding, PNG Paeth on all rows)
357
                            // initial estimate
358
                            $p = ($row_left + $row_up - $row_upleft);
359
                            // distances
360
                            $pa = abs($p - $row_left);
361
                            $pb = abs($p - $row_up);
362
                            $pc = abs($p - $row_upleft);
363
                            $pmin = min($pa, $pb, $pc);
364
                            // return minimum distance
365
                            switch ($pmin) {
366
                                case $pa:
367
                                    $ddata[$k][$j] = (($row[$i] + $row_left) & 0xff);
368
                                    break;
369
370
                                case $pb:
371
                                    $ddata[$k][$j] = (($row[$i] + $row_up) & 0xff);
372
                                    break;
373
374
                                case $pc:
375
                                    $ddata[$k][$j] = (($row[$i] + $row_upleft) & 0xff);
376
                                    break;
377
                            }
378
                            break;
379
380
                        default:  // PNG prediction (on encoding, PNG optimum)
381
                            throw new Exception('Unknown PNG predictor');
382
                    }
383
                }
384
                $prev_row = $ddata[$k];
385
            } // end for each row
386
            // complete decoding
387
            $sdata = [];
388
            // for every row
389
            foreach ($ddata as $k => $row) {
390
                // initialize new row
391
                $sdata[$k] = [0, 0, 0];
392
                if (0 == $wb[0]) {
393
                    // default type field
394
                    $sdata[$k][0] = 1;
395
                }
396
                $i = 0; // count bytes in the row
397
                // for every column
398
                for ($c = 0; $c < 3; ++$c) {
399
                    // for every byte on the column
400
                    for ($b = 0; $b < $wb[$c]; ++$b) {
401
                        if (isset($row[$i])) {
402
                            $sdata[$k][$c] += ($row[$i] << (($wb[$c] - 1 - $b) * 8));
403
                        }
404
                        ++$i;
405
                    }
406
                }
407
            }
408
            $ddata = [];
0 ignored issues
show
Unused Code introduced by
The assignment to $ddata is dead and can be removed.
Loading history...
409
            // fill xref
410
            if (isset($index_first)) {
411
                $obj_num = $index_first;
412
            } else {
413
                $obj_num = 0;
414
            }
415
            foreach ($sdata as $k => $row) {
416
                switch ($row[0]) {
417
                    case 0:  // (f) linked list of free objects
418
                            break;
419
420
                    case 1:  // (n) objects that are in use but are not compressed
421
                            // create unique object index: [object number]_[generation number]
422
                            $index = $obj_num.'_'.$row[2];
423
                            // check if object already exist
424
                            if (!isset($xref['xref'][$index])) {
425
                                // store object offset position
426
                                $xref['xref'][$index] = $row[1];
427
                            }
428
                            break;
429
430
                    case 2:  // compressed objects
431
                            // $row[1] = object number of the object stream in which this object is stored
432
                            // $row[2] = index of this object within the object stream
433
                            $index = $row[1].'_0_'.$row[2];
434
                            $xref['xref'][$index] = -1;
435
                            break;
436
437
                    default:  // null objects
438
                            break;
439
                }
440
                ++$obj_num;
441
            }
442
        } // end decoding data
443
        if (isset($prevxref)) {
444
            // get previous xref
445
            $xref = $this->getXrefData($pdfData, $prevxref, $xref);
446
        }
447
448
        return $xref;
449
    }
450
451
    /**
452
     * Get content of indirect object.
453
     *
454
     * @param string $pdfData  PDF data
455
     * @param string $obj_ref  Object number and generation number separated by underscore character
456
     * @param int    $offset   Object offset
457
     * @param bool   $decoding If true decode streams
458
     *
459
     * @return array containing object data
460
     *
461
     * @throws Exception if invalid object reference found
462
     */
463
    public function getIndirectObject($pdfData, $obj_ref, $offset = 0, $decoding = true)
464
    {
465
        $obj = explode('_', $obj_ref);
466
        if ((false === $obj) or (2 != \count($obj))) {
467
            throw new Exception('Invalid object reference for $obj.');
468
        }
469
        $objref = $obj[0].' '.$obj[1].' obj';
470
        // ignore leading zeros
471
        $offset += strspn($pdfData, '0', $offset);
472
        if (strpos($pdfData, $objref, $offset) != $offset) {
473
            // an indirect reference to an undefined object shall be considered a reference to the null object
474
            return ['null', 'null', $offset];
475
        }
476
        // starting position of object content
477
        $offset += \strlen($objref);
478
        // get array of object content
479
        $objdata = [];
480
        $i = 0; // object main index
481
        do {
482
            $oldoffset = $offset;
483
            // get element
484
            $element = $this->getRawObject($pdfData, $offset);
485
            $offset = $element[2];
486
            // decode stream using stream's dictionary information
487
            if ($decoding and ('stream' == $element[0]) and (isset($objdata[($i - 1)][0])) and ('<<' == $objdata[($i - 1)][0])) {
488
                $element[3] = $this->decodeStream($pdfData, $objdata[($i - 1)][1], $element[1]);
489
            }
490
            $objdata[$i] = $element;
491
            ++$i;
492
        } while (('endobj' != $element[0]) and ($offset != $oldoffset));
493
494
        // remove closing delimiter
495
        array_pop($objdata);
496
497
        // return raw object content
498
        return $objdata;
499
    }
500
501
    /**
502
     * Get the content of object, resolving indect object reference if necessary.
503
     *
504
     * @param string $pdfData PDF data
505
     * @param string $obj     Object value
506
     *
507
     * @return array containing object data
508
     */
509
    public function getObjectVal($pdfData, $obj)
510
    {
511
        if ('objref' == $obj[0]) {
512
            // reference to indirect object
513
            if (isset($this->objects[$obj[1]])) {
514
                // this object has been already parsed
515
                return $this->objects[$obj[1]];
516
            } elseif (isset($this->xref[$obj[1]])) {
0 ignored issues
show
Bug Best Practice introduced by
The property xref does not exist on Smalot\PdfParser\RawData\RawDataParser. Did you maybe forget to declare it?
Loading history...
517
                // parse new object
518
                $this->objects[$obj[1]] = $this->getIndirectObject($pdfData, $obj[1], $this->xref[$obj[1]], false);
0 ignored issues
show
Bug Best Practice introduced by
The property objects does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
519
520
                return $this->objects[$obj[1]];
521
            }
522
        }
523
524
        return $obj;
0 ignored issues
show
Bug Best Practice introduced by
The expression return $obj returns the type string which is incompatible with the documented return type array.
Loading history...
525
    }
526
527
    /**
528
     * Get object type, raw value and offset to next object
529
     *
530
     * @param int $offset Object offset
531
     *
532
     * @return array containing object type, raw value and offset to next object
533
     */
534
    public function getRawObject($pdfData, $offset = 0)
535
    {
536
        $objtype = ''; // object type to be returned
537
        $objval = ''; // object value to be returned
538
539
        /*
540
         * skip initial white space chars:
541
         *      \x00 null (NUL)
542
         *      \x09 horizontal tab (HT)
543
         *      \x0A line feed (LF)
544
         *      \x0C form feed (FF)
545
         *      \x0D carriage return (CR)
546
         *      \x20 space (SP)
547
         */
548
        $offset += strspn($pdfData, "\x00\x09\x0a\x0c\x0d\x20", $offset);
549
550
        // get first char
551
        $char = $pdfData[$offset];
552
        // get object type
553
        switch ($char) {
554
            case '%':  // \x25 PERCENT SIGN
555
                    // skip comment and search for next token
556
                    $next = strcspn($pdfData, "\r\n", $offset);
557
                    if ($next > 0) {
558
                        $offset += $next;
559
560
                        return $this->getRawObject($pdfData, $offset);
561
                    }
562
                    break;
563
564
            case '/':  // \x2F SOLIDUS
565
                    // name object
566
                    $objtype = $char;
567
                    ++$offset;
568
                    $pregResult = preg_match(
569
                        '/^([^\x00\x09\x0a\x0c\x0d\x20\s\x28\x29\x3c\x3e\x5b\x5d\x7b\x7d\x2f\x25]+)/',
570
                        substr($pdfData, $offset, 256),
571
                        $matches
572
                    );
573
                    if (1 == $pregResult) {
574
                        $objval = $matches[1]; // unescaped value
575
                        $offset += \strlen($objval);
576
                    }
577
                    break;
578
579
            case '(':   // \x28 LEFT PARENTHESIS
580
            case ')':  // \x29 RIGHT PARENTHESIS
581
                    // literal string object
582
                    $objtype = $char;
583
                    ++$offset;
584
                    $strpos = $offset;
585
                    if ('(' == $char) {
586
                        $open_bracket = 1;
587
                        while ($open_bracket > 0) {
588
                            if (!isset($pdfData[$strpos])) {
589
                                break;
590
                            }
591
                            $ch = $pdfData[$strpos];
592
                            switch ($ch) {
593
                                case '\\':  // REVERSE SOLIDUS (5Ch) (Backslash)
594
                                        // skip next character
595
                                        ++$strpos;
596
                                        break;
597
598
                                case '(':  // LEFT PARENHESIS (28h)
599
                                        ++$open_bracket;
600
                                        break;
601
602
                                case ')':  // RIGHT PARENTHESIS (29h)
603
                                        --$open_bracket;
604
                                        break;
605
                            }
606
                            ++$strpos;
607
                        }
608
                        $objval = substr($pdfData, $offset, ($strpos - $offset - 1));
609
                        $offset = $strpos;
610
                    }
611
                    break;
612
613
            case '[':   // \x5B LEFT SQUARE BRACKET
614
            case ']':  // \x5D RIGHT SQUARE BRACKET
615
                    // array object
616
                    $objtype = $char;
617
                    ++$offset;
618
                    if ('[' == $char) {
619
                        // get array content
620
                        $objval = [];
621
                        do {
622
                            // get element
623
                            $element = $this->getRawObject($pdfData, $offset);
624
                            $offset = $element[2];
625
                            $objval[] = $element;
626
                        } while (']' != $element[0]);
627
                        // remove closing delimiter
628
                        array_pop($objval);
629
                    }
630
                    break;
631
632
            case '<':  // \x3C LESS-THAN SIGN
633
            case '>':  // \x3E GREATER-THAN SIGN
634
                    if (isset($pdfData[($offset + 1)]) and ($pdfData[($offset + 1)] == $char)) {
635
                        // dictionary object
636
                        $objtype = $char.$char;
637
                        $offset += 2;
638
                        if ('<' == $char) {
639
                            // get array content
640
                            $objval = [];
641
                            do {
642
                                // get element
643
                                $element = $this->getRawObject($pdfData, $offset);
644
                                $offset = $element[2];
645
                                $objval[] = $element;
646
                            } while ('>>' != $element[0]);
647
                            // remove closing delimiter
648
                            array_pop($objval);
649
                        }
650
                    } else {
651
                        // hexadecimal string object
652
                        $objtype = $char;
653
                        ++$offset;
654
                        $pregResult = preg_match(
655
                            '/^([0-9A-Fa-f\x09\x0a\x0c\x0d\x20]+)>/iU',
656
                            substr($pdfData, $offset),
657
                            $matches
658
                        );
659
                        if (('<' == $char) && 1 == $pregResult) {
660
                            // remove white space characters
661
                            $objval = strtr($matches[1], "\x09\x0a\x0c\x0d\x20", '');
662
                            $offset += \strlen($matches[0]);
663
                        } elseif (false !== ($endpos = strpos($pdfData, '>', $offset))) {
664
                            $offset = $endpos + 1;
665
                        }
666
                    }
667
                    break;
668
669
            default:
670
                    if ('endobj' == substr($pdfData, $offset, 6)) {
671
                        // indirect object
672
                        $objtype = 'endobj';
673
                        $offset += 6;
674
                    } elseif ('null' == substr($pdfData, $offset, 4)) {
675
                        // null object
676
                        $objtype = 'null';
677
                        $offset += 4;
678
                        $objval = 'null';
679
                    } elseif ('true' == substr($pdfData, $offset, 4)) {
680
                        // boolean true object
681
                        $objtype = 'boolean';
682
                        $offset += 4;
683
                        $objval = 'true';
684
                    } elseif ('false' == substr($pdfData, $offset, 5)) {
685
                        // boolean false object
686
                        $objtype = 'boolean';
687
                        $offset += 5;
688
                        $objval = 'false';
689
                    } elseif ('stream' == substr($pdfData, $offset, 6)) {
690
                        // start stream object
691
                        $objtype = 'stream';
692
                        $offset += 6;
693
                        if (1 == preg_match('/^([\r]?[\n])/isU', substr($pdfData, $offset), $matches)) {
694
                            $offset += \strlen($matches[0]);
695
                            $pregResult = preg_match(
696
                                '/(endstream)[\x09\x0a\x0c\x0d\x20]/isU',
697
                                substr($pdfData, $offset),
698
                                $matches,
699
                                PREG_OFFSET_CAPTURE
700
                            );
701
                            if (1 == $pregResult) {
702
                                $objval = substr($pdfData, $offset, $matches[0][1]);
703
                                $offset += $matches[1][1];
704
                            }
705
                        }
706
                    } elseif ('endstream' == substr($pdfData, $offset, 9)) {
707
                        // end stream object
708
                        $objtype = 'endstream';
709
                        $offset += 9;
710
                    } elseif (1 == preg_match('/^([0-9]+)[\s]+([0-9]+)[\s]+R/iU', substr($pdfData, $offset, 33), $matches)) {
711
                        // indirect object reference
712
                        $objtype = 'objref';
713
                        $offset += \strlen($matches[0]);
714
                        $objval = (int) ($matches[1]).'_'.(int) ($matches[2]);
715
                    } elseif (1 == preg_match('/^([0-9]+)[\s]+([0-9]+)[\s]+obj/iU', substr($pdfData, $offset, 33), $matches)) {
716
                        // object start
717
                        $objtype = 'obj';
718
                        $objval = (int) ($matches[1]).'_'.(int) ($matches[2]);
719
                        $offset += \strlen($matches[0]);
720
                    } elseif (($numlen = strspn($pdfData, '+-.0123456789', $offset)) > 0) {
721
                        // numeric object
722
                        $objtype = 'numeric';
723
                        $objval = substr($pdfData, $offset, $numlen);
724
                        $offset += $numlen;
725
                    }
726
                    break;
727
        }
728
729
        return [$objtype, $objval, $offset];
730
    }
731
732
    /**
733
     * Get Cross-Reference (xref) table and trailer data from PDF document data.
734
     *
735
     * @param string $pdfData
736
     * @param int    $offset  xref offset (if know)
737
     * @param array  $xref    previous xref array (if any)
738
     *
739
     * @return array containing xref and trailer data
740
     *
741
     * @throws Exception if it was unable to find startxref
742
     * @throws Exception if it was unable to find xref
743
     */
744
    public function getXrefData($pdfData, $offset = 0, $xref = [])
745
    {
746
        $startxrefPreg = preg_match(
747
            '/[\r\n]startxref[\s]*[\r\n]+([0-9]+)[\s]*[\r\n]+%%EOF/i',
748
            $pdfData,
749
            $matches,
750
            PREG_OFFSET_CAPTURE,
751
            $offset
752
        );
753
754
        if (0 == $offset) {
755
            // find last startxref
756
            $pregResult = preg_match_all(
757
                '/[\r\n]startxref[\s]*[\r\n]+([0-9]+)[\s]*[\r\n]+%%EOF/i',
758
                $pdfData, $matches,
759
                PREG_SET_ORDER,
760
                $offset
761
            );
762
            if (0 == $pregResult) {
763
                throw new Exception('Unable to find startxref');
764
            }
765
            $matches = array_pop($matches);
766
            $startxref = $matches[1];
767
        } elseif (strpos($pdfData, 'xref', $offset) == $offset) {
768
            // Already pointing at the xref table
769
            $startxref = $offset;
770
        } elseif (preg_match('/([0-9]+[\s][0-9]+[\s]obj)/i', $pdfData, $matches, PREG_OFFSET_CAPTURE, $offset)) {
771
            // Cross-Reference Stream object
772
            $startxref = $offset;
773
        } elseif ($startxrefPreg) {
774
            // startxref found
775
            $startxref = $matches[1][0];
776
        } else {
777
            throw new Exception('Unable to find startxref');
778
        }
779
780
        // check xref position
781
        if (strpos($pdfData, 'xref', $startxref) == $startxref) {
782
            // Cross-Reference
783
            $xref = $this->decodeXref($pdfData, $startxref, $xref);
784
        } else {
785
            // Cross-Reference Stream
786
            $xref = $this->decodeXrefStream($pdfData, $startxref, $xref);
787
        }
788
        if (empty($xref)) {
789
            throw new Exception('Unable to find xref');
790
        }
791
792
        return $xref;
793
    }
794
795
    /**
796
     * Parses PDF data and returns extracted data as array.
797
     *
798
     * @param string $data PDF data to parse
799
     *
800
     * @return array array of parsed PDF document objects
801
     *
802
     * @throws Exception if empty PDF data given
803
     * @throws Exception if PDF data missing %PDF header
804
     */
805
    public function parseData($data)
806
    {
807
        if (empty($data)) {
808
            throw new Exception('Empty PDF data given.');
809
        }
810
        // find the pdf header starting position
811
        if (false === ($trimpos = strpos($data, '%PDF-'))) {
812
            throw new Exception('Invalid PDF data: missing %PDF header.');
813
        }
814
815
        // get PDF content string
816
        $pdfData = substr($data, $trimpos);
817
818
        // get xref and trailer data
819
        $xref = $this->getXrefData($pdfData);
820
821
        // parse all document objects
822
        $objects = [];
823
        foreach ($xref['xref'] as $obj => $offset) {
824
            if (!isset($objects[$obj]) and ($offset > 0)) {
825
                // decode objects with positive offset
826
                $objects[$obj] = $this->getIndirectObject($pdfData, $xref, $obj, $offset, true);
0 ignored issues
show
Bug introduced by
$xref of type array is incompatible with the type string expected by parameter $obj_ref of Smalot\PdfParser\RawData...er::getIndirectObject(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

826
                $objects[$obj] = $this->getIndirectObject($pdfData, /** @scrutinizer ignore-type */ $xref, $obj, $offset, true);
Loading history...
Unused Code introduced by
The call to Smalot\PdfParser\RawData...er::getIndirectObject() has too many arguments starting with true. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

826
                /** @scrutinizer ignore-call */ 
827
                $objects[$obj] = $this->getIndirectObject($pdfData, $xref, $obj, $offset, true);

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
827
            }
828
        }
829
830
        return [$xref, $objects];
831
    }
832
}
833