Passed
Pull Request — master (#560)
by Konrad
04:51 queued 02:37
created

Parser   C

Complexity

Total Complexity 53

Size/Duplication

Total Lines 276
Duplicated Lines 0 %

Test Coverage

Coverage 96.38%

Importance

Changes 21
Bugs 7 Features 2
Metric Value
eloc 134
c 21
b 7
f 2
dl 0
loc 276
ccs 133
cts 138
cp 0.9638
rs 6.96
wmc 53

8 Methods

Rating   Name   Duplication   Size   Complexity  
A parseHeader() 0 14 2
A parseContent() 0 26 4
A parseFile() 0 14 1
A parseTrailer() 0 20 5
C parseObject() 0 85 15
A getConfig() 0 3 1
A __construct() 0 4 2
D parseHeaderElement() 0 62 23

How to fix   Complexity   

Complex Class

Complex classes like Parser often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use Parser, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\Element\ElementArray;
36
use Smalot\PdfParser\Element\ElementBoolean;
37
use Smalot\PdfParser\Element\ElementDate;
38
use Smalot\PdfParser\Element\ElementHexa;
39
use Smalot\PdfParser\Element\ElementName;
40
use Smalot\PdfParser\Element\ElementNull;
41
use Smalot\PdfParser\Element\ElementNumeric;
42
use Smalot\PdfParser\Element\ElementString;
43
use Smalot\PdfParser\Element\ElementXRef;
44
use Smalot\PdfParser\RawData\RawDataParser;
45
46
/**
47
 * Class Parser
48
 */
49
class Parser
50
{
51
    /**
52
     * @var Config
53
     */
54
    private $config;
55
56
    /**
57
     * @var PDFObject[]
58
     */
59
    protected $objects = [];
60
61
    protected $rawDataParser;
62
63 42
    public function __construct($cfg = [], ?Config $config = null)
64
    {
65 42
        $this->config = $config ?: new Config();
66 42
        $this->rawDataParser = new RawDataParser($cfg, $this->config);
67 42
    }
68
69 1
    public function getConfig(): Config
70
    {
71 1
        return $this->config;
72
    }
73
74
    /**
75
     * @throws \Exception
76
     */
77 40
    public function parseFile(string $filename): Document
78
    {
79 40
        $content = file_get_contents($filename);
80
        /*
81
         * 2018/06/20 @doganoo as multiple times a
82
         * users have complained that the parseFile()
83
         * method dies silently, it is an better option
84
         * to remove the error control operator (@) and
85
         * let the users know that the method throws an exception
86
         * by adding @throws tag to PHPDoc.
87
         *
88
         * See here for an example: https://github.com/smalot/pdfparser/issues/204
89
         */
90 40
        return $this->parseContent($content);
91
    }
92
93
    /**
94
     * @param string $content PDF content to parse
95
     *
96
     * @throws \Exception if secured PDF file was detected
97
     * @throws \Exception if no object list was found
98
     */
99 40
    public function parseContent(string $content): Document
100
    {
101
        // Create structure from raw data.
102 40
        list($xref, $data) = $this->rawDataParser->parseData($content);
103
104 39
        if (isset($xref['trailer']['encrypt'])) {
105
            throw new \Exception('Secured pdf file are currently not supported.');
106
        }
107
108 39
        if (empty($data)) {
109
            throw new \Exception('Object list not found. Possible secured file.');
110
        }
111
112
        // Create destination object.
113 39
        $document = new Document();
114 39
        $this->objects = [];
115
116 39
        foreach ($data as $id => $structure) {
117 39
            $this->parseObject($id, $structure, $document);
118 39
            unset($data[$id]);
119
        }
120
121 39
        $document->setTrailer($this->parseTrailer($xref['trailer'], $document));
122 39
        $document->setObjects($this->objects);
123
124 39
        return $document;
125
    }
126
127 39
    protected function parseTrailer(array $structure, ?Document $document)
128
    {
129 39
        $trailer = [];
130
131 39
        foreach ($structure as $name => $values) {
132 39
            $name = ucfirst($name);
133
134 39
            if (is_numeric($values)) {
135 39
                $trailer[$name] = new ElementNumeric($values);
136 39
            } elseif (\is_array($values)) {
137 33
                $value = $this->parseTrailer($values, null);
138 33
                $trailer[$name] = new ElementArray($value, null);
139 39
            } elseif (false !== strpos($values, '_')) {
140 39
                $trailer[$name] = new ElementXRef($values, $document);
141
            } else {
142 33
                $trailer[$name] = $this->parseHeaderElement('(', $values, $document);
143
            }
144
        }
145
146 39
        return new Header($trailer, $document);
147
    }
148
149 40
    protected function parseObject(string $id, array $structure, ?Document $document)
150
    {
151 40
        $header = new Header([], $document);
152 40
        $content = '';
153
154 40
        foreach ($structure as $position => $part) {
155 40
            if (\is_int($part)) {
156
                $part = [null, null];
157
            }
158 40
            switch ($part[0]) {
159 40
                case '[':
160 14
                    $elements = [];
161
162 14
                    foreach ($part[1] as $sub_element) {
0 ignored issues
show
Bug introduced by
The expression $part[1] of type null is not traversable.
Loading history...
163 14
                        $sub_type = $sub_element[0];
164 14
                        $sub_value = $sub_element[1];
165 14
                        $elements[] = $this->parseHeaderElement($sub_type, $sub_value, $document);
166
                    }
167
168 14
                    $header = new Header($elements, $document);
169 14
                    break;
170
171 40
                case '<<':
172 40
                    $header = $this->parseHeader($part[1], $document);
0 ignored issues
show
Bug introduced by
$part[1] of type null is incompatible with the type array expected by parameter $structure of Smalot\PdfParser\Parser::parseHeader(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

172
                    $header = $this->parseHeader(/** @scrutinizer ignore-type */ $part[1], $document);
Loading history...
173 40
                    break;
174
175 40
                case 'stream':
176 40
                    $content = isset($part[3][0]) ? $part[3][0] : $part[1];
177
178 40
                    if ($header->get('Type')->equals('ObjStm')) {
179 11
                        $match = [];
180
181
                        // Split xrefs and contents.
182 11
                        preg_match('/^((\d+\s+\d+\s*)*)(.*)$/s', $content, $match);
183 11
                        $content = $match[3];
184
185
                        // Extract xrefs.
186 11
                        $xrefs = preg_split(
187 11
                            '/(\d+\s+\d+\s*)/s',
188 11
                            $match[1],
189 11
                            -1,
190 11
                            \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
191
                        );
192 11
                        $table = [];
193
194 11
                        foreach ($xrefs as $xref) {
195 11
                            list($id, $position) = preg_split("/\s+/", trim($xref));
196 11
                            $table[$position] = $id;
197
                        }
198
199 11
                        ksort($table);
200
201 11
                        $ids = array_values($table);
202 11
                        $positions = array_keys($table);
203
204 11
                        foreach ($positions as $index => $position) {
0 ignored issues
show
Comprehensibility Bug introduced by
$position is overwriting a variable from outer foreach loop.
Loading history...
205 11
                            $id = $ids[$index].'_0';
206 11
                            $next_position = isset($positions[$index + 1]) ? $positions[$index + 1] : \strlen($content);
207 11
                            $sub_content = substr($content, $position, (int) $next_position - (int) $position);
208
209 11
                            $sub_header = Header::parse($sub_content, $document);
0 ignored issues
show
Bug introduced by
It seems like $document can also be of type null; however, parameter $document of Smalot\PdfParser\Header::parse() does only seem to accept Smalot\PdfParser\Document, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

209
                            $sub_header = Header::parse($sub_content, /** @scrutinizer ignore-type */ $document);
Loading history...
210 11
                            $object = PDFObject::factory($document, $sub_header, '', $this->config);
0 ignored issues
show
Bug introduced by
It seems like $document can also be of type null; however, parameter $document of Smalot\PdfParser\PDFObject::factory() does only seem to accept Smalot\PdfParser\Document, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

210
                            $object = PDFObject::factory(/** @scrutinizer ignore-type */ $document, $sub_header, '', $this->config);
Loading history...
211 11
                            $this->objects[$id] = $object;
212
                        }
213
214
                        // It is not necessary to store this content.
215
216 11
                        return;
217
                    }
218 39
                    break;
219
220
                default:
221 39
                    if ('null' != $part) {
222 39
                        $element = $this->parseHeaderElement($part[0], $part[1], $document);
223
224 39
                        if ($element) {
225 21
                            $header = new Header([$element], $document);
226
                        }
227
                    }
228 39
                    break;
229
            }
230
        }
231
232 39
        if (!isset($this->objects[$id])) {
233 39
            $this->objects[$id] = PDFObject::factory($document, $header, $content, $this->config);
234
        }
235 39
    }
236
237
    /**
238
     * @throws \Exception
239
     */
240 40
    protected function parseHeader(array $structure, ?Document $document): Header
241
    {
242 40
        $elements = [];
243 40
        $count = \count($structure);
244
245 40
        for ($position = 0; $position < $count; $position += 2) {
246 40
            $name = $structure[$position][1];
247 40
            $type = $structure[$position + 1][0];
248 40
            $value = $structure[$position + 1][1];
249
250 40
            $elements[$name] = $this->parseHeaderElement($type, $value, $document);
251
        }
252
253 40
        return new Header($elements, $document);
254
    }
255
256
    /**
257
     * @param string|array $value
258
     *
259
     * @return Element|Header|null
260
     *
261
     * @throws \Exception
262
     */
263 40
    protected function parseHeaderElement(?string $type, $value, ?Document $document)
264
    {
265 40
        $valueIsEmpty = null == $value || '' == $value || false == $value;
266 40
        if (('<<' === $type || '>>' === $type) && $valueIsEmpty) {
267 6
            $value = [];
268
        }
269
270 40
        switch ($type) {
271 40
            case '<<':
272 40
            case '>>':
273 39
                $header = $this->parseHeader($value, $document);
0 ignored issues
show
Bug introduced by
It seems like $value can also be of type string; however, parameter $structure of Smalot\PdfParser\Parser::parseHeader() does only seem to accept array, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

273
                $header = $this->parseHeader(/** @scrutinizer ignore-type */ $value, $document);
Loading history...
274 39
                PDFObject::factory($document, $header, null, $this->config);
0 ignored issues
show
Bug introduced by
It seems like $document can also be of type null; however, parameter $document of Smalot\PdfParser\PDFObject::factory() does only seem to accept Smalot\PdfParser\Document, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

274
                PDFObject::factory(/** @scrutinizer ignore-type */ $document, $header, null, $this->config);
Loading history...
275
276 39
                return $header;
277
278 40
            case 'numeric':
279 39
                return new ElementNumeric($value);
0 ignored issues
show
Bug introduced by
It seems like $value can also be of type array and array; however, parameter $value of Smalot\PdfParser\Element...tNumeric::__construct() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

279
                return new ElementNumeric(/** @scrutinizer ignore-type */ $value);
Loading history...
280
281 40
            case 'boolean':
282 13
                return new ElementBoolean($value);
0 ignored issues
show
Bug introduced by
It seems like $value can also be of type array and array; however, parameter $value of Smalot\PdfParser\Element...tBoolean::__construct() does only seem to accept boolean|string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

282
                return new ElementBoolean(/** @scrutinizer ignore-type */ $value);
Loading history...
283
284 40
            case 'null':
285 3
                return new ElementNull();
286
287 40
            case '(':
288 39
                if ($date = ElementDate::parse('('.$value.')', $document)) {
0 ignored issues
show
Bug introduced by
Are you sure $value of type array|string can be used in concatenation? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

288
                if ($date = ElementDate::parse('('./** @scrutinizer ignore-type */ $value.')', $document)) {
Loading history...
289 32
                    return $date;
290
                }
291
292 39
                return ElementString::parse('('.$value.')', $document);
0 ignored issues
show
Bug Best Practice introduced by
The expression return Smalot\PdfParser\...value . ')', $document) could also return false which is incompatible with the documented return type Smalot\PdfParser\Element...t\PdfParser\Header|null. Did you maybe forget to handle an error condition?

If the returned type also contains false, it is an indicator that maybe an error condition leading to the specific return statement remains unhandled.

Loading history...
293
294 40
            case '<':
295 16
                return $this->parseHeaderElement('(', ElementHexa::decode($value), $document);
0 ignored issues
show
Bug introduced by
It seems like $value can also be of type array and array; however, parameter $value of Smalot\PdfParser\Element\ElementHexa::decode() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

295
                return $this->parseHeaderElement('(', ElementHexa::decode(/** @scrutinizer ignore-type */ $value), $document);
Loading history...
296
297 40
            case '/':
298 40
                return ElementName::parse('/'.$value, $document);
0 ignored issues
show
Bug Best Practice introduced by
The expression return Smalot\PdfParser\.../' . $value, $document) could also return false which is incompatible with the documented return type Smalot\PdfParser\Element...t\PdfParser\Header|null. Did you maybe forget to handle an error condition?

If the returned type also contains false, it is an indicator that maybe an error condition leading to the specific return statement remains unhandled.

Loading history...
299
300 39
            case 'ojbref': // old mistake in tcpdf parser
301 39
            case 'objref':
302 39
                return new ElementXRef($value, $document);
303
304 39
            case '[':
305 39
                $values = [];
306
307 39
                if (\is_array($value)) {
308 39
                    foreach ($value as $sub_element) {
309 39
                        $sub_type = $sub_element[0];
310 39
                        $sub_value = $sub_element[1];
311 39
                        $values[] = $this->parseHeaderElement($sub_type, $sub_value, $document);
312
                    }
313
                }
314
315 39
                return new ElementArray($values, $document);
316
317 39
            case 'endstream':
318 1
            case 'obj': // I don't know what it means but got my project fixed.
319
            case '':
320
                // Nothing to do with.
321 39
                return null;
322
323
            default:
324
                throw new \Exception('Invalid type: "'.$type.'".');
325
        }
326
    }
327
}
328