Passed
Pull Request — master (#544)
by Konrad
07:43 queued 05:08
created

Page   F

Complexity

Total Complexity 136

Size/Duplication

Total Lines 899
Duplicated Lines 0 %

Test Coverage

Coverage 82.8%

Importance

Changes 14
Bugs 3 Features 2
Metric Value
eloc 377
c 14
b 3
f 2
dl 0
loc 899
ccs 313
cts 378
cp 0.828
rs 2
wmc 136

16 Methods

Rating   Name   Duplication   Size   Complexity  
B getFonts() 0 37 9
A getFont() 0 22 4
A getXObject() 0 9 2
B getXObjects() 0 31 7
A createPDFObjectForFpdf() 0 8 1
A createPageForFpdf() 0 8 1
A isFpdf() 0 9 4
B getText() 0 48 10
A getPDFObjectForFpdf() 0 6 1
A getPageNumber() 0 11 3
D getDataCommands() 0 137 18
C getTextArray() 0 54 12
B extractRawData() 0 36 9
D getTextXY() 0 51 18
D getDataTm() 0 214 18
D extractDecodedRawData() 0 64 19

How to fix   Complexity   

Complex Class

Complex classes like Page often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use Page, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementMissing;
35
use Smalot\PdfParser\Element\ElementNull;
36
use Smalot\PdfParser\Element\ElementXRef;
37
38
class Page extends PDFObject
39
{
40
    /**
41
     * @var Font[]
42
     */
43
    protected $fonts = null;
44
45
    /**
46
     * @var PDFObject[]
47
     */
48
    protected $xobjects = null;
49
50
    /**
51
     * @var array
52
     */
53
    protected $dataTm = null;
54
55
    /**
56
     * @return Font[]
57
     */
58 29
    public function getFonts()
59
    {
60 29
        if (null !== $this->fonts) {
61 24
            return $this->fonts;
62
        }
63
64 29
        $resources = $this->get('Resources');
65
66 29
        if (method_exists($resources, 'has') && $resources->has('Font')) {
67 25
            if ($resources->get('Font') instanceof ElementMissing) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

67
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof ElementMissing) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
68 1
                return [];
69
            }
70
71 24
            if ($resources->get('Font') instanceof Header) {
72 17
                $fonts = $resources->get('Font')->getElements();
73
            } else {
74 11
                $fonts = $resources->get('Font')->getHeader()->getElements();
75
            }
76
77 24
            $table = [];
78
79 24
            foreach ($fonts as $id => $font) {
80 24
                if ($font instanceof Font) {
81 24
                    $table[$id] = $font;
82
83
                    // Store too on cleaned id value (only numeric)
84 24
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
85 24
                    if ('' != $id) {
86 23
                        $table[$id] = $font;
87
                    }
88
                }
89
            }
90
91 24
            return $this->fonts = $table;
92
        }
93
94 7
        return [];
95
    }
96
97 26
    public function getFont(string $id): ?Font
98
    {
99 26
        $fonts = $this->getFonts();
100
101 26
        if (isset($fonts[$id])) {
102 23
            return $fonts[$id];
103
        }
104
105
        // According to the PDF specs (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 238)
106
        // "The font resource name presented to the Tf operator is arbitrary, as are the names for all kinds of resources"
107
        // Instead, we search for the unfiltered name first and then do this cleaning as a fallback, so all tests still pass.
108
109 4
        if (isset($fonts[$id])) {
110
            return $fonts[$id];
111
        } else {
112 4
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
113 4
            if (isset($fonts[$id])) {
114 1
                return $fonts[$id];
115
            }
116
        }
117
118 3
        return null;
119
    }
120
121
    /**
122
     * Support for XObject
123
     *
124
     * @return PDFObject[]
125
     */
126 6
    public function getXObjects()
127
    {
128 6
        if (null !== $this->xobjects) {
129 6
            return $this->xobjects;
130
        }
131
132 6
        $resources = $this->get('Resources');
133
134 6
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
135 6
            if ($resources->get('XObject') instanceof Header) {
136 6
                $xobjects = $resources->get('XObject')->getElements();
137
            } else {
138
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
139
            }
140
141 6
            $table = [];
142
143 6
            foreach ($xobjects as $id => $xobject) {
144 6
                $table[$id] = $xobject;
145
146
                // Store too on cleaned id value (only numeric)
147 6
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
148 6
                if ('' != $id) {
149 6
                    $table[$id] = $xobject;
150
                }
151
            }
152
153 6
            return $this->xobjects = $table;
154
        }
155
156
        return [];
157
    }
158
159 5
    public function getXObject(string $id): ?PDFObject
160
    {
161 5
        $xobjects = $this->getXObjects();
162
163 5
        if (isset($xobjects[$id])) {
164 5
            return $xobjects[$id];
165
        }
166
167
        return null;
168
        /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
169
170
        if (isset($xobjects[$id])) {
171
            return $xobjects[$id];
172
        } else {
173
            return null;
174
        }*/
175
    }
176
177 16
    public function getText(self $page = null): string
178
    {
179 16
        if ($contents = $this->get('Contents')) {
180 16
            if ($contents instanceof ElementMissing) {
181
                return '';
182 16
            } elseif ($contents instanceof ElementNull) {
183
                return '';
184 16
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
185 13
                $elements = $contents->getHeader()->getElements();
186
187 13
                if (is_numeric(key($elements))) {
188
                    $new_content = '';
189
190
                    foreach ($elements as $element) {
191
                        if ($element instanceof ElementXRef) {
192
                            $new_content .= $element->getObject()->getContent();
193
                        } else {
194
                            $new_content .= $element->getContent();
195
                        }
196
                    }
197
198
                    $header = new Header([], $this->document);
199 13
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
200
                }
201 4
            } elseif ($contents instanceof ElementArray) {
202
                // Create a virtual global content.
203 4
                $new_content = '';
204
205 4
                foreach ($contents->getContent() as $content) {
206 4
                    $new_content .= $content->getContent()."\n";
207
                }
208
209 4
                $header = new Header([], $this->document);
210 4
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
211
            }
212
213
            /*
214
             * Elements referencing each other on the same page can cause endless loops during text parsing.
215
             * To combat this we keep a recursionStack containing already parsed elements on the page.
216
             * The stack is only emptied here after getting text from a page.
217
             */
218 16
            $contentsText = $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

218
            /** @scrutinizer ignore-call */ 
219
            $contentsText = $contents->getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
219 16
            PDFObject::$recursionStack = [];
220
221 16
            return $contentsText;
222
        }
223
224
        return '';
225
    }
226
227
    /**
228
     * Return true if the current page is a (setasign\Fpdi\Fpdi) FPDI/FPDF document
229
     *
230
     * The metadata 'Producer' should have the value of "FPDF" . FPDF_VERSION if the
231
     * pdf file was generated by FPDF/Fpfi.
232
     *
233
     * @return bool true is the current page is a FPDI/FPDF document
234
     */
235 11
    public function isFpdf(): bool
236
    {
237 11
        if (\array_key_exists('Producer', $this->document->getDetails()) &&
238 11
            \is_string($this->document->getDetails()['Producer']) &&
239 11
            0 === strncmp($this->document->getDetails()['Producer'], 'FPDF', 4)) {
240 2
            return true;
241
        }
242
243 10
        return false;
244
    }
245
246
    /**
247
     * Return the page number of the PDF document of the page object
248
     *
249
     * @return int the page number
250
     */
251 2
    public function getPageNumber(): int
252
    {
253 2
        $pages = $this->document->getPages();
254 2
        $numOfPages = \count($pages);
255 2
        for ($pageNum = 0; $pageNum < $numOfPages; ++$pageNum) {
256 2
            if ($pages[$pageNum] === $this) {
257 2
                break;
258
            }
259
        }
260
261 2
        return $pageNum;
262
    }
263
264
    /**
265
     * Return the Object of the page if the document is a FPDF/FPDI document
266
     *
267
     * If the document was generated by FPDF/FPDI it returns the
268
     * PDFObject of the given page
269
     *
270
     * @return PDFObject The PDFObject for the page
271
     */
272 1
    public function getPDFObjectForFpdf(): PDFObject
273
    {
274 1
        $pageNum = $this->getPageNumber();
275 1
        $xObjects = $this->getXObjects();
276
277 1
        return $xObjects[$pageNum];
278
    }
279
280
    /**
281
     * Return a new PDFObject of the document created with FPDF/FPDI
282
     *
283
     * For a document generated by FPDF/FPDI, it generates a
284
     * new PDFObject for that document
285
     *
286
     * @return PDFObject The PDFObject
287
     */
288 1
    public function createPDFObjectForFpdf(): PDFObject
289
    {
290 1
        $pdfObject = $this->getPDFObjectForFpdf();
291 1
        $new_content = $pdfObject->getContent();
292 1
        $header = $pdfObject->getHeader();
293 1
        $config = $pdfObject->config;
294
295 1
        return new PDFObject($pdfObject->document, $header, $new_content, $config);
296
    }
297
298
    /**
299
     * Return page if document is a FPDF/FPDI document
300
     *
301
     * @return Page The page
302
     */
303 1
    public function createPageForFpdf(): self
304
    {
305 1
        $pdfObject = $this->getPDFObjectForFpdf();
306 1
        $new_content = $pdfObject->getContent();
307 1
        $header = $pdfObject->getHeader();
308 1
        $config = $pdfObject->config;
309
310 1
        return new self($pdfObject->document, $header, $new_content, $config);
311
    }
312
313 6
    public function getTextArray(self $page = null): array
314
    {
315 6
        if ($this->isFpdf()) {
316 1
            $pdfObject = $this->getPDFObjectForFpdf();
317 1
            $newPdfObject = $this->createPDFObjectForFpdf();
318
319 1
            return $newPdfObject->getTextArray($pdfObject);
320
        } else {
321 5
            if ($contents = $this->get('Contents')) {
322 5
                if ($contents instanceof ElementMissing) {
323
                    return [];
324 5
                } elseif ($contents instanceof ElementNull) {
325
                    return [];
326 5
                } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
327 5
                    $elements = $contents->getHeader()->getElements();
328
329 5
                    if (is_numeric(key($elements))) {
330
                        $new_content = '';
331
332
                        /** @var PDFObject $element */
333
                        foreach ($elements as $element) {
334
                            if ($element instanceof ElementXRef) {
335
                                $new_content .= $element->getObject()->getContent();
336
                            } else {
337
                                $new_content .= $element->getContent();
338
                            }
339
                        }
340
341
                        $header = new Header([], $this->document);
342
                        $contents = new PDFObject($this->document, $header, $new_content, $this->config);
343
                    } else {
344
                        try {
345 5
                            $contents->getTextArray($this);
346 1
                        } catch (\Throwable $e) {
347 5
                            return $contents->getTextArray();
348
                        }
349
                    }
350 1
                } elseif ($contents instanceof ElementArray) {
351
                    // Create a virtual global content.
352 1
                    $new_content = '';
353
354
                    /** @var PDFObject $content */
355 1
                    foreach ($contents->getContent() as $content) {
356 1
                        $new_content .= $content->getContent()."\n";
357
                    }
358
359 1
                    $header = new Header([], $this->document);
360 1
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
361
                }
362
363 4
                return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

363
                return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
364
            }
365
366
            return [];
367
        }
368
    }
369
370
    /**
371
     * Gets all the text data with its internal representation of the page.
372
     *
373
     * Returns an array with the data and the internal representation
374
     */
375 10
    public function extractRawData(): array
376
    {
377
        /*
378
         * Now you can get the complete content of the object with the text on it
379
         */
380 10
        $extractedData = [];
381 10
        $content = $this->get('Contents');
382 10
        $values = $content->getContent();
383 10
        if (isset($values) && \is_array($values)) {
384 1
            $text = '';
385 1
            foreach ($values as $section) {
386 1
                $text .= $section->getContent();
387
            }
388 1
            $sectionsText = $this->getSectionsText($text);
389 1
            foreach ($sectionsText as $sectionText) {
390 1
                $commandsText = $this->getCommandsText($sectionText);
391 1
                foreach ($commandsText as $command) {
392 1
                    $extractedData[] = $command;
393
                }
394
            }
395
        } else {
396 10
            if ($this->isFpdf()) {
397 1
                $content = $this->getPDFObjectForFpdf();
398
            }
399 10
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

399
            /** @scrutinizer ignore-call */ 
400
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
400 10
            foreach ($sectionsText as $sectionText) {
401 10
                $extractedData[] = ['t' => '', 'o' => 'BT', 'c' => ''];
402
403 10
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

403
                /** @scrutinizer ignore-call */ 
404
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
404 10
                foreach ($commandsText as $command) {
405 10
                    $extractedData[] = $command;
406
                }
407
            }
408
        }
409
410 10
        return $extractedData;
411
    }
412
413
    /**
414
     * Gets all the decoded text data with it internal representation from a page.
415
     *
416
     * @param array $extractedRawData the extracted data return by extractRawData or
417
     *                                null if extractRawData should be called
418
     *
419
     * @return array An array with the data and the internal representation
420
     */
421 9
    public function extractDecodedRawData(array $extractedRawData = null): array
422
    {
423 9
        if (!isset($extractedRawData) || !$extractedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
424 9
            $extractedRawData = $this->extractRawData();
425
        }
426 9
        $currentFont = null; /** @var Font $currentFont */
427 9
        $clippedFont = null;
428 9
        $fpdfPage = null;
429 9
        if ($this->isFpdf()) {
430 1
            $fpdfPage = $this->createPageForFpdf();
431
        }
432 9
        foreach ($extractedRawData as &$command) {
433 9
            if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
434 9
                $data = $command['c'];
435 9
                if (!\is_array($data)) {
436 7
                    $tmpText = '';
437 7
                    if (isset($currentFont)) {
438 7
                        $tmpText = $currentFont->decodeOctal($data);
439
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
440
                    }
441 7
                    $tmpText = str_replace(
442 7
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
443 7
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
444
                            $tmpText
445
                    );
446 7
                    $tmpText = utf8_encode($tmpText);
447 7
                    if (isset($currentFont)) {
448 7
                        $tmpText = $currentFont->decodeContent($tmpText);
449
                    }
450 7
                    $command['c'] = $tmpText;
451 7
                    continue;
452
                }
453 9
                $numText = \count($data);
454 9
                for ($i = 0; $i < $numText; ++$i) {
455 9
                    if (0 != ($i % 2)) {
456 7
                        continue;
457
                    }
458 9
                    $tmpText = $data[$i]['c'];
459 9
                    $decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
460 9
                    $decodedText = str_replace(
461 9
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
462 9
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
463
                            $decodedText
464
                    );
465 9
                    $decodedText = utf8_encode($decodedText);
466 9
                    if (isset($currentFont)) {
467 7
                        $decodedText = $currentFont->decodeContent($decodedText);
468
                    }
469 9
                    $command['c'][$i]['c'] = $decodedText;
470 9
                    continue;
471
                }
472 9
            } elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
473 9
                $fontId = explode(' ', $command['c'])[0];
474
                // If document is a FPDI/FPDF the $page has the correct font
475 9
                $currentFont = isset($fpdfPage) ? $fpdfPage->getFont($fontId) : $this->getFont($fontId);
476 9
                continue;
477 9
            } elseif ('Q' == $command['o']) {
478 6
                $currentFont = $clippedFont;
479 9
            } elseif ('q' == $command['o']) {
480 6
                $clippedFont = $currentFont;
481
            }
482
        }
483
484 9
        return $extractedRawData;
485
    }
486
487
    /**
488
     * Gets just the Text commands that are involved in text positions and
489
     * Text Matrix (Tm)
490
     *
491
     * It extract just the PDF commands that are involved with text positions, and
492
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
493
     *
494
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData.
495
     *                                       If it is null, the method extractDecodeRawData is called.
496
     *
497
     * @return array An array with the text command of the page
498
     */
499 7
    public function getDataCommands(array $extractedDecodedRawData = null): array
500
    {
501 7
        if (!isset($extractedDecodedRawData) || !$extractedDecodedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedDecodedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
502 7
            $extractedDecodedRawData = $this->extractDecodedRawData();
503
        }
504 7
        $extractedData = [];
505 7
        foreach ($extractedDecodedRawData as $command) {
506 7
            switch ($command['o']) {
507
                /*
508
                 * BT
509
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
510
                 */
511 7
                case 'BT':
512 7
                    $extractedData[] = $command;
513 7
                    break;
514
515
                /*
516
                 * ET
517
                 * End a text object, discarding the text matrix
518
                 */
519 7
                case 'ET':
520
                    $extractedData[] = $command;
521
                    break;
522
523
                /*
524
                 * leading TL
525
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
526
                 * Initial value: 0
527
                 */
528 7
                case 'TL':
529 5
                    $extractedData[] = $command;
530 5
                    break;
531
532
                /*
533
                 * tx ty Td
534
                 * Move to the start of the next line, offset form the start of the
535
                 * current line by tx, ty.
536
                 */
537 7
                case 'Td':
538 7
                    $extractedData[] = $command;
539 7
                    break;
540
541
                /*
542
                 * tx ty TD
543
                 * Move to the start of the next line, offset form the start of the
544
                 * current line by tx, ty. As a side effect, this operator set the leading
545
                 * parameter in the text state. This operator has the same effect as the
546
                 * code:
547
                 * -ty TL
548
                 * tx ty Td
549
                 */
550 7
                case 'TD':
551
                    $extractedData[] = $command;
552
                    break;
553
554
                /*
555
                 * a b c d e f Tm
556
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
557
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
558
                 * [1 0 0 1 0 0]
559
                 */
560 7
                case 'Tm':
561 5
                    $extractedData[] = $command;
562 5
                    break;
563
564
                /*
565
                 * T*
566
                 * Move to the start of the next line. This operator has the same effect
567
                 * as the code:
568
                 * 0 Tl Td
569
                 * Where Tl is the current leading parameter in the text state.
570
                 */
571 7
                case 'T*':
572 5
                    $extractedData[] = $command;
573 5
                    break;
574
575
                /*
576
                 * string Tj
577
                 * Show a Text String
578
                 */
579 7
                case 'Tj':
580 6
                    $extractedData[] = $command;
581 6
                    break;
582
583
                /*
584
                 * string '
585
                 * Move to the next line and show a text string. This operator has the
586
                 * same effect as the code:
587
                 * T*
588
                 * string Tj
589
                 */
590 7
                case "'":
591
                    $extractedData[] = $command;
592
                    break;
593
594
                /*
595
                 * aw ac string "
596
                 * Move to the next lkine and show a text string, using aw as the word
597
                 * spacing and ac as the character spacing. This operator has the same
598
                 * effect as the code:
599
                 * aw Tw
600
                 * ac Tc
601
                 * string '
602
                 * Tw set the word spacing, Tw, to wordSpace.
603
                 * Tc Set the character spacing, Tc, to charsSpace.
604
                 */
605 7
                case '"':
606
                    $extractedData[] = $command;
607
                    break;
608
609 7
                case 'Tf':
610 7
                case 'TF':
611 7
                    if ($this->config->getDataTmFontInfoHasToBeIncluded()) {
612 1
                        $extractedData[] = $command;
613
                    }
614 7
                    break;
615
616
                /*
617
                 * array TJ
618
                 * Show one or more text strings allow individual glyph positioning.
619
                 * Each lement of array con be a string or a number. If the element is
620
                 * a string, this operator shows the string. If it is a number, the
621
                 * operator adjust the text position by that amount; that is, it translates
622
                 * the text matrix, Tm. This amount is substracted form the current
623
                 * horizontal or vertical coordinate, depending on the writing mode.
624
                 * in the default coordinate system, a positive adjustment has the effect
625
                 * of moving the next glyph painted either to the left or down by the given
626
                 * amount.
627
                 */
628 7
                case 'TJ':
629 7
                    $extractedData[] = $command;
630 7
                    break;
631
                default:
632
            }
633
        }
634
635 7
        return $extractedData;
636
    }
637
638
    /**
639
     * Gets the Text Matrix of the text in the page
640
     *
641
     * Return an array where every item is an array where the first item is the
642
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
643
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
644
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
645
     *
646
     * @param array $dataCommands the data extracted by getDataCommands
647
     *                            if null getDataCommands is called
648
     *
649
     * @return array an array with the data of the page including the Tm information
650
     *               of any text in the page
651
     */
652 6
    public function getDataTm(array $dataCommands = null): array
653
    {
654 6
        if (!isset($dataCommands) || !$dataCommands) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $dataCommands of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
655 6
            $dataCommands = $this->getDataCommands();
656
        }
657
658
        /*
659
         * At the beginning of a text object Tm is the identity matrix
660
         */
661 6
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
662
663
        /*
664
         *  Set the text leading used by T*, ' and " operators
665
         */
666 6
        $defaultTl = 0;
667
668
        /*
669
         *  Set default values for font data
670
         */
671 6
        $defaultFontId = -1;
672 6
        $defaultFontSize = 0;
673
674
        /*
675
         * Setting where are the X and Y coordinates in the matrix (Tm)
676
         */
677 6
        $x = 4;
678 6
        $y = 5;
679 6
        $Tx = 0;
680 6
        $Ty = 0;
681
682 6
        $Tm = $defaultTm;
683 6
        $Tl = $defaultTl;
684 6
        $fontId = $defaultFontId;
685 6
        $fontSize = $defaultFontSize;
686
687 6
        $extractedTexts = $this->getTextArray();
688 6
        $extractedData = [];
689 6
        foreach ($dataCommands as $command) {
690 6
            $currentText = $extractedTexts[\count($extractedData)];
691 6
            switch ($command['o']) {
692
                /*
693
                 * BT
694
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
695
                 */
696 6
                case 'BT':
697 6
                    $Tm = $defaultTm;
698 6
                    $Tl = $defaultTl; //review this.
699 6
                    $Tx = 0;
700 6
                    $Ty = 0;
701 6
                    $fontId = $defaultFontId;
702 6
                    $fontSize = $defaultFontSize;
703 6
                    break;
704
705
                /*
706
                 * ET
707
                 * End a text object, discarding the text matrix
708
                 */
709 6
                case 'ET':
710
                    $Tm = $defaultTm;
711
                    $Tl = $defaultTl;  //review this
712
                    $Tx = 0;
713
                    $Ty = 0;
714
                    $fontId = $defaultFontId;
715
                    $fontSize = $defaultFontSize;
716
                    break;
717
718
                /*
719
                 * leading TL
720
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
721
                 * Initial value: 0
722
                 */
723 6
                case 'TL':
724 4
                    $Tl = (float) $command['c'];
725 4
                    break;
726
727
                /*
728
                 * tx ty Td
729
                 * Move to the start of the next line, offset form the start of the
730
                 * current line by tx, ty.
731
                 */
732 6
                case 'Td':
733 6
                    $coord = explode(' ', $command['c']);
734 6
                    $Tx += (float) $coord[0];
735 6
                    $Ty += (float) $coord[1];
736 6
                    $Tm[$x] = (string) $Tx;
737 6
                    $Tm[$y] = (string) $Ty;
738 6
                    break;
739
740
                /*
741
                 * tx ty TD
742
                 * Move to the start of the next line, offset form the start of the
743
                 * current line by tx, ty. As a side effect, this operator set the leading
744
                 * parameter in the text state. This operator has the same effect as the
745
                 * code:
746
                 * -ty TL
747
                 * tx ty Td
748
                 */
749 6
                case 'TD':
750
                    $coord = explode(' ', $command['c']);
751
                    $Tl = (float) $coord[1];
752
                    $Tx += (float) $coord[0];
753
                    $Ty -= (float) $coord[1];
754
                    $Tm[$x] = (string) $Tx;
755
                    $Tm[$y] = (string) $Ty;
756
                    break;
757
758
                /*
759
                 * a b c d e f Tm
760
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
761
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
762
                 * [1 0 0 1 0 0]
763
                 */
764 6
                case 'Tm':
765 4
                    $Tm = explode(' ', $command['c']);
766 4
                    $Tx = (float) $Tm[$x];
767 4
                    $Ty = (float) $Tm[$y];
768 4
                    break;
769
770
                /*
771
                 * T*
772
                 * Move to the start of the next line. This operator has the same effect
773
                 * as the code:
774
                 * 0 Tl Td
775
                 * Where Tl is the current leading parameter in the text state.
776
                 */
777 6
                case 'T*':
778 4
                    $Ty -= $Tl;
779 4
                    $Tm[$y] = (string) $Ty;
780 4
                    break;
781
782
                /*
783
                 * string Tj
784
                 * Show a Text String
785
                 */
786 6
                case 'Tj':
787 5
                    $data = [$Tm, $currentText];
788 5
                    if ($this->config->getDataTmFontInfoHasToBeIncluded()) {
789 1
                        $data[] = $fontId;
790 1
                        $data[] = $fontSize;
791
                    }
792 5
                    $extractedData[] = $data;
793 5
                    break;
794
795
                /*
796
                 * string '
797
                 * Move to the next line and show a text string. This operator has the
798
                 * same effect as the code:
799
                 * T*
800
                 * string Tj
801
                 */
802 6
                case "'":
803
                    $Ty -= $Tl;
804
                    $Tm[$y] = (string) $Ty;
805
                    $extractedData[] = [$Tm, $currentText];
806
                    break;
807
808
                /*
809
                 * aw ac string "
810
                 * Move to the next line and show a text string, using aw as the word
811
                 * spacing and ac as the character spacing. This operator has the same
812
                 * effect as the code:
813
                 * aw Tw
814
                 * ac Tc
815
                 * string '
816
                 * Tw set the word spacing, Tw, to wordSpace.
817
                 * Tc Set the character spacing, Tc, to charsSpace.
818
                 */
819 6
                case '"':
820
                    $data = explode(' ', $currentText);
821
                    $Ty -= $Tl;
822
                    $Tm[$y] = (string) $Ty;
823
                    $extractedData[] = [$Tm, $data[2]]; //Verify
824
                    break;
825
826 6
                case 'Tf':
827
                    /*
828
                     * From PDF 1.0 specification, page 106:
829
                     *     fontname size Tf Set font and size
830
                     *     Sets the text font and text size in the graphics state. There is no default value for
831
                     *     either fontname or size; they must be selected using Tf before drawing any text.
832
                     *     fontname is a resource name. size is a number expressed in text space units.
833
                     *
834
                     * Source: https://ia902503.us.archive.org/10/items/pdfy-0vt8s-egqFwDl7L2/PDF%20Reference%201.0.pdf
835
                     * Introduced with https://github.com/smalot/pdfparser/pull/516
836
                     */
837 1
                    list($fontId, $fontSize) = explode(' ', $command['c'], 2);
838 1
                    break;
839
840
                /*
841
                 * array TJ
842
                 * Show one or more text strings allow individual glyph positioning.
843
                 * Each lement of array con be a string or a number. If the element is
844
                 * a string, this operator shows the string. If it is a number, the
845
                 * operator adjust the text position by that amount; that is, it translates
846
                 * the text matrix, Tm. This amount is substracted form the current
847
                 * horizontal or vertical coordinate, depending on the writing mode.
848
                 * in the default coordinate system, a positive adjustment has the effect
849
                 * of moving the next glyph painted either to the left or down by the given
850
                 * amount.
851
                 */
852 6
                case 'TJ':
853 6
                    $data = [$Tm, $currentText];
854 6
                    if ($this->config->getDataTmFontInfoHasToBeIncluded()) {
855 1
                        $data[] = $fontId;
856 1
                        $data[] = $fontSize;
857
                    }
858 6
                    $extractedData[] = $data;
859 6
                    break;
860
                default:
861
            }
862
        }
863 6
        $this->dataTm = $extractedData;
864
865 6
        return $extractedData;
866
    }
867
868
    /**
869
     * Gets text data that are around the given coordinates (X,Y)
870
     *
871
     * If the text is in near the given coordinates (X,Y) (or the TM info),
872
     * the text is returned.  The extractedData return by getDataTm, could be use to see
873
     * where is the coordinates of a given text, using the TM info for it.
874
     *
875
     * @param float $x      The X value of the coordinate to search for. if null
876
     *                      just the Y value is considered (same Row)
877
     * @param float $y      The Y value of the coordinate to search for
878
     *                      just the X value is considered (same column)
879
     * @param float $xError The value less or more to consider an X to be "near"
880
     * @param float $yError The value less or more to consider an Y to be "near"
881
     *
882
     * @return array An array of text that are near the given coordinates. If no text
883
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
884
     *               and y coordinates are null, null is returned.
885
     */
886 2
    public function getTextXY(float $x = null, float $y = null, float $xError = 0, float $yError = 0): array
887
    {
888 2
        if (!isset($this->dataTm) || !$this->dataTm) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->dataTm of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
889 1
            $this->getDataTm();
890
        }
891
892 2
        if (null !== $x) {
893 2
            $x = (float) $x;
894
        }
895
896 2
        if (null !== $y) {
897 2
            $y = (float) $y;
898
        }
899
900 2
        if (null === $x && null === $y) {
901
            return [];
902
        }
903
904 2
        $xError = (float) $xError;
905 2
        $yError = (float) $yError;
906
907 2
        $extractedData = [];
908 2
        foreach ($this->dataTm as $item) {
909 2
            $tm = $item[0];
910 2
            $xTm = (float) $tm[4];
911 2
            $yTm = (float) $tm[5];
912 2
            $text = $item[1];
913 2
            if (null === $y) {
914
                if (($xTm >= ($x - $xError)) &&
915
                    ($xTm <= ($x + $xError))) {
916
                    $extractedData[] = [$tm, $text];
917
                    continue;
918
                }
919
            }
920 2
            if (null === $x) {
921
                if (($yTm >= ($y - $yError)) &&
922
                    ($yTm <= ($y + $yError))) {
923
                    $extractedData[] = [$tm, $text];
924
                    continue;
925
                }
926
            }
927 2
            if (($xTm >= ($x - $xError)) &&
928 2
                ($xTm <= ($x + $xError)) &&
929 2
                ($yTm >= ($y - $yError)) &&
930 2
                ($yTm <= ($y + $yError))) {
931 2
                $extractedData[] = [$tm, $text];
932 2
                continue;
933
            }
934
        }
935
936 2
        return $extractedData;
937
    }
938
}
939