Passed
Pull Request — master (#533)
by Frank
03:01 queued 43s
created

Parser::parseObject()   C

Complexity

Conditions 16
Paths 74

Size

Total Lines 92
Code Lines 56

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 55
CRAP Score 16.0014

Importance

Changes 6
Bugs 2 Features 0
Metric Value
cc 16
eloc 56
c 6
b 2
f 0
nc 74
nop 3
dl 0
loc 92
ccs 55
cts 56
cp 0.9821
crap 16.0014
rs 5.5666

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementBoolean;
35
use Smalot\PdfParser\Element\ElementDate;
36
use Smalot\PdfParser\Element\ElementHexa;
37
use Smalot\PdfParser\Element\ElementName;
38
use Smalot\PdfParser\Element\ElementNull;
39
use Smalot\PdfParser\Element\ElementNumeric;
40
use Smalot\PdfParser\Element\ElementString;
41
use Smalot\PdfParser\Element\ElementXRef;
42
use Smalot\PdfParser\RawData\RawDataParser;
43
44
/**
45
 * Class Parser
46
 */
47
class Parser
48
{
49
    /**
50
     * @var Config
51
     */
52
    private $config;
53
54
    /**
55
     * @var PDFObject[]
56
     */
57
    protected $objects = [];
58
59
    protected $rawDataParser;
60
61
    /**
62
     * @var array<string, int> an array of xrefs with the number of objects we have seen for it (0-based indexed)
63
     */
64
    protected $xrefIndices = [];
65
66 41
    public function __construct($cfg = [], ?Config $config = null)
67
    {
68 41
        $this->config = $config ?: new Config();
69 41
        $this->rawDataParser = new RawDataParser($cfg, $this->config);
70 41
    }
71
72 1
    public function getConfig(): Config
73
    {
74 1
        return $this->config;
75
    }
76
77
    /**
78
     * @throws \Exception
79
     */
80 39
    public function parseFile(string $filename): Document
81
    {
82 39
        $content = file_get_contents($filename);
83
        /*
84
         * 2018/06/20 @doganoo as multiple times a
85
         * users have complained that the parseFile()
86
         * method dies silently, it is an better option
87
         * to remove the error control operator (@) and
88
         * let the users know that the method throws an exception
89
         * by adding @throws tag to PHPDoc.
90
         *
91
         * See here for an example: https://github.com/smalot/pdfparser/issues/204
92
         */
93 39
        return $this->parseContent($content);
94
    }
95
96
    /**
97
     * @param string $content PDF content to parse
98
     *
99
     * @throws \Exception if secured PDF file was detected
100
     * @throws \Exception if no object list was found
101
     */
102 39
    public function parseContent(string $content): Document
103
    {
104
        // Create structure from raw data.
105 39
        list($xref, $data) = $this->rawDataParser->parseData($content);
106
107 38
        if (isset($xref['trailer']['encrypt'])) {
108
            throw new \Exception('Secured pdf file are currently not supported.');
109
        }
110
111 38
        if (empty($data)) {
112
            throw new \Exception('Object list not found. Possible secured file.');
113
        }
114
115
        // Create destination object.
116 38
        $document = new Document();
117 38
        $this->objects = [];
118
119 38
        foreach ($data as $id => $structure) {
120 38
            $this->parseObject($id, $structure, $document);
121 38
            unset($data[$id]);
122
        }
123
124 38
        $document->setTrailer($this->parseTrailer($xref['trailer'], $document));
125 38
        $document->setObjects($this->objects);
126
127 38
        return $document;
128
    }
129
130 38
    protected function parseTrailer(array $structure, ?Document $document)
131
    {
132 38
        $trailer = [];
133
134 38
        foreach ($structure as $name => $values) {
135 38
            $name = ucfirst($name);
136
137 38
            if (is_numeric($values)) {
138 38
                $trailer[$name] = new ElementNumeric($values);
139 38
            } elseif (\is_array($values)) {
140 33
                $value = $this->parseTrailer($values, null);
141 33
                $trailer[$name] = new ElementArray($value, null);
142 38
            } elseif (false !== strpos($values, '_')) {
143 38
                $trailer[$name] = new ElementXRef($values, $document);
144
            } else {
145 33
                $trailer[$name] = $this->parseHeaderElement('(', $values, $document);
146
            }
147
        }
148
149 38
        return new Header($trailer, $document);
150
    }
151
152 39
    protected function parseObject(string $id, array $structure, ?Document $document)
153
    {
154 39
        $header = new Header([], $document);
155 39
        $content = '';
156
157 39
        foreach ($structure as $position => $part) {
158 39
            if (\is_int($part)) {
159
                $part = [null, null];
160
            }
161 39
            switch ($part[0]) {
162 39
                case '[':
163 13
                    $elements = [];
164
165 13
                    foreach ($part[1] as $sub_element) {
0 ignored issues
show
Bug introduced by
The expression $part[1] of type null is not traversable.
Loading history...
166 13
                        $sub_type = $sub_element[0];
167 13
                        $sub_value = $sub_element[1];
168 13
                        $elements[] = $this->parseHeaderElement($sub_type, $sub_value, $document);
169
                    }
170
171 13
                    $header = new Header($elements, $document);
172 13
                    break;
173
174 39
                case '<<':
175 39
                    $header = $this->parseHeader($part[1], $document);
0 ignored issues
show
Bug introduced by
$part[1] of type null is incompatible with the type array expected by parameter $structure of Smalot\PdfParser\Parser::parseHeader(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

175
                    $header = $this->parseHeader(/** @scrutinizer ignore-type */ $part[1], $document);
Loading history...
176 39
                    break;
177
178 39
                case 'stream':
179 39
                    $content = isset($part[3][0]) ? $part[3][0] : $part[1];
180
181 39
                    if ($header->get('Type')->equals('ObjStm')) {
182 11
                        $match = [];
183
184
                        // Split xrefs and contents.
185 11
                        preg_match('/^((\d+\s+\d+\s*)*)(.*)$/s', $content, $match);
186 11
                        $content = $match[3];
187
188
                        // Extract xrefs.
189 11
                        $xrefs = preg_split(
190 11
                            '/(\d+\s+\d+\s*)/s',
191 11
                            $match[1],
192 11
                            -1,
193 11
                          \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
194
                        );
195 11
                        $table = [];
196
197 11
                        foreach ($xrefs as $xref) {
198 11
                            list($id, $position) = preg_split("/\s+/", trim($xref));
199 11
                            $table[$position] = $id;
200
                        }
201
202 11
                        ksort($table);
203
204 11
                        $ids = array_values($table);
205 11
                        $positions = array_keys($table);
206
207 11
                        foreach ($positions as $index => $position) {
0 ignored issues
show
Comprehensibility Bug introduced by
$position is overwriting a variable from outer foreach loop.
Loading history...
208 11
                            $xrefId = (string) $ids[$index];
209 11
                            if (\array_key_exists($xrefId, $this->xrefIndices)) {
210 2
                                $xrefIndex = $this->xrefIndices[$xrefId]++; // This xref was seen before, id becomes 9999_1 or 9999_2 etc.
211
                            } else {
212 11
                                $xrefIndex = $this->xrefIndices[$xrefId] = 0; // This xref was not seen before. id becomes 9999_0
213
                            }
214
215 11
                            $id = $xrefId.'_'.$xrefIndex;
216 11
                            $next_position = isset($positions[$index + 1]) ? $positions[$index + 1] : \strlen($content);
217 11
                            $sub_content = substr($content, $position, (int) $next_position - (int) $position);
218
219 11
                            $sub_header = Header::parse($sub_content, $document);
0 ignored issues
show
Bug introduced by
It seems like $document can also be of type null; however, parameter $document of Smalot\PdfParser\Header::parse() does only seem to accept Smalot\PdfParser\Document, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

219
                            $sub_header = Header::parse($sub_content, /** @scrutinizer ignore-type */ $document);
Loading history...
220 11
                            $object = PDFObject::factory($document, $sub_header, '', $this->config);
0 ignored issues
show
Bug introduced by
It seems like $document can also be of type null; however, parameter $document of Smalot\PdfParser\PDFObject::factory() does only seem to accept Smalot\PdfParser\Document, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

220
                            $object = PDFObject::factory(/** @scrutinizer ignore-type */ $document, $sub_header, '', $this->config);
Loading history...
221 11
                            $this->objects[$id] = $object;
222
                        }
223
224
                        // It is not necessary to store this content.
225
226 11
                        return;
227
                    }
228 38
                    break;
229
230
                default:
231 38
                    if ('null' != $part) {
232 38
                        $element = $this->parseHeaderElement($part[0], $part[1], $document);
233
234 38
                        if ($element) {
235 20
                            $header = new Header([$element], $document);
236
                        }
237
                    }
238 38
                    break;
239
            }
240
        }
241
242 38
        if (!isset($this->objects[$id])) {
243 38
            $this->objects[$id] = PDFObject::factory($document, $header, $content, $this->config);
244
        }
245 38
    }
246
247
    /**
248
     * @throws \Exception
249
     */
250 39
    protected function parseHeader(array $structure, ?Document $document): Header
251
    {
252 39
        $elements = [];
253 39
        $count = \count($structure);
254
255 39
        for ($position = 0; $position < $count; $position += 2) {
256 39
            $name = $structure[$position][1];
257 39
            $type = $structure[$position + 1][0];
258 39
            $value = $structure[$position + 1][1];
259
260 39
            $elements[$name] = $this->parseHeaderElement($type, $value, $document);
261
        }
262
263 39
        return new Header($elements, $document);
264
    }
265
266
    /**
267
     * @param string|array $value
268
     *
269
     * @return Element|Header|null
270
     *
271
     * @throws \Exception
272
     */
273 39
    protected function parseHeaderElement(?string $type, $value, ?Document $document)
274
    {
275 39
        switch ($type) {
276 39
            case '<<':
277 39
            case '>>':
278 38
                $header = $this->parseHeader($value, $document);
0 ignored issues
show
Bug introduced by
It seems like $value can also be of type string; however, parameter $structure of Smalot\PdfParser\Parser::parseHeader() does only seem to accept array, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

278
                $header = $this->parseHeader(/** @scrutinizer ignore-type */ $value, $document);
Loading history...
279 38
                PDFObject::factory($document, $header, null, $this->config);
0 ignored issues
show
Bug introduced by
It seems like $document can also be of type null; however, parameter $document of Smalot\PdfParser\PDFObject::factory() does only seem to accept Smalot\PdfParser\Document, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

279
                PDFObject::factory(/** @scrutinizer ignore-type */ $document, $header, null, $this->config);
Loading history...
280
281 38
                return $header;
282
283 39
            case 'numeric':
284 38
                return new ElementNumeric($value);
0 ignored issues
show
Bug introduced by
It seems like $value can also be of type array; however, parameter $value of Smalot\PdfParser\Element...tNumeric::__construct() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

284
                return new ElementNumeric(/** @scrutinizer ignore-type */ $value);
Loading history...
285
286 39
            case 'boolean':
287 13
                return new ElementBoolean($value);
0 ignored issues
show
Bug introduced by
It seems like $value can also be of type array; however, parameter $value of Smalot\PdfParser\Element...tBoolean::__construct() does only seem to accept boolean|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

287
                return new ElementBoolean(/** @scrutinizer ignore-type */ $value);
Loading history...
288
289 39
            case 'null':
290 3
                return new ElementNull();
291
292 39
            case '(':
293 38
                if ($date = ElementDate::parse('('.$value.')', $document)) {
0 ignored issues
show
Bug introduced by
Are you sure $value of type array|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

293
                if ($date = ElementDate::parse('('./** @scrutinizer ignore-type */ $value.')', $document)) {
Loading history...
294 31
                    return $date;
295
                }
296
297 38
                return ElementString::parse('('.$value.')', $document);
0 ignored issues
show
Bug Best Practice introduced by
The expression return Smalot\PdfParser\...value . ')', $document) could also return false which is incompatible with the documented return type Smalot\PdfParser\Element...t\PdfParser\Header|null. Did you maybe forget to handle an error condition?

If the returned type also contains false, it is an indicator that maybe an error condition leading to the specific return statement remains unhandled.

Loading history...
298
299 39
            case '<':
300 15
                return $this->parseHeaderElement('(', ElementHexa::decode($value), $document);
0 ignored issues
show
Bug introduced by
It seems like $value can also be of type array; however, parameter $value of Smalot\PdfParser\Element\ElementHexa::decode() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

300
                return $this->parseHeaderElement('(', ElementHexa::decode(/** @scrutinizer ignore-type */ $value), $document);
Loading history...
301
302 39
            case '/':
303 39
                return ElementName::parse('/'.$value, $document);
0 ignored issues
show
Bug Best Practice introduced by
The expression return Smalot\PdfParser\.../' . $value, $document) could also return false which is incompatible with the documented return type Smalot\PdfParser\Element...t\PdfParser\Header|null. Did you maybe forget to handle an error condition?

If the returned type also contains false, it is an indicator that maybe an error condition leading to the specific return statement remains unhandled.

Loading history...
304
305 38
            case 'ojbref': // old mistake in tcpdf parser
306 38
            case 'objref':
307 38
                return new ElementXRef($value, $document);
308
309 38
            case '[':
310 38
                $values = [];
311
312 38
                if (\is_array($value)) {
313 38
                    foreach ($value as $sub_element) {
314 38
                        $sub_type = $sub_element[0];
315 38
                        $sub_value = $sub_element[1];
316 38
                        $values[] = $this->parseHeaderElement($sub_type, $sub_value, $document);
317
                    }
318
                }
319
320 38
                return new ElementArray($values, $document);
321
322 38
            case 'endstream':
323 1
            case 'obj': //I don't know what it means but got my project fixed.
324
            case '':
325
                // Nothing to do with.
326 38
                return null;
327
328
            default:
329
                throw new \Exception('Invalid type: "'.$type.'".');
330
        }
331
    }
332
}
333