Passed
Pull Request — master (#544)
by Konrad
07:43 queued 05:08
created

Page::getTextArray()   C

Complexity

Conditions 12
Paths 9

Size

Total Lines 54
Code Lines 34

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 21
CRAP Score 16.8344

Importance

Changes 3
Bugs 1 Features 1
Metric Value
cc 12
eloc 34
c 3
b 1
f 1
nc 9
nop 1
dl 0
loc 54
ccs 21
cts 31
cp 0.6774
crap 16.8344
rs 6.9666

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementMissing;
35
use Smalot\PdfParser\Element\ElementNull;
36
use Smalot\PdfParser\Element\ElementXRef;
37
38
class Page extends PDFObject
39
{
40
    /**
41
     * @var Font[]
42
     */
43
    protected $fonts = null;
44
45
    /**
46
     * @var PDFObject[]
47
     */
48
    protected $xobjects = null;
49
50
    /**
51
     * @var array
52
     */
53
    protected $dataTm = null;
54
55
    /**
56
     * @return Font[]
57
     */
58 29
    public function getFonts()
59
    {
60 29
        if (null !== $this->fonts) {
61 24
            return $this->fonts;
62
        }
63
64 29
        $resources = $this->get('Resources');
65
66 29
        if (method_exists($resources, 'has') && $resources->has('Font')) {
67 25
            if ($resources->get('Font') instanceof ElementMissing) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

67
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof ElementMissing) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
68 1
                return [];
69
            }
70
71 24
            if ($resources->get('Font') instanceof Header) {
72 17
                $fonts = $resources->get('Font')->getElements();
73
            } else {
74 11
                $fonts = $resources->get('Font')->getHeader()->getElements();
75
            }
76
77 24
            $table = [];
78
79 24
            foreach ($fonts as $id => $font) {
80 24
                if ($font instanceof Font) {
81 24
                    $table[$id] = $font;
82
83
                    // Store too on cleaned id value (only numeric)
84 24
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
85 24
                    if ('' != $id) {
86 23
                        $table[$id] = $font;
87
                    }
88
                }
89
            }
90
91 24
            return $this->fonts = $table;
92
        }
93
94 7
        return [];
95
    }
96
97 26
    public function getFont(string $id): ?Font
98
    {
99 26
        $fonts = $this->getFonts();
100
101 26
        if (isset($fonts[$id])) {
102 23
            return $fonts[$id];
103
        }
104
105
        // According to the PDF specs (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 238)
106
        // "The font resource name presented to the Tf operator is arbitrary, as are the names for all kinds of resources"
107
        // Instead, we search for the unfiltered name first and then do this cleaning as a fallback, so all tests still pass.
108
109 4
        if (isset($fonts[$id])) {
110
            return $fonts[$id];
111
        } else {
112 4
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
113 4
            if (isset($fonts[$id])) {
114 1
                return $fonts[$id];
115
            }
116
        }
117
118 3
        return null;
119
    }
120
121
    /**
122
     * Support for XObject
123
     *
124
     * @return PDFObject[]
125
     */
126 6
    public function getXObjects()
127
    {
128 6
        if (null !== $this->xobjects) {
129 6
            return $this->xobjects;
130
        }
131
132 6
        $resources = $this->get('Resources');
133
134 6
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
135 6
            if ($resources->get('XObject') instanceof Header) {
136 6
                $xobjects = $resources->get('XObject')->getElements();
137
            } else {
138
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
139
            }
140
141 6
            $table = [];
142
143 6
            foreach ($xobjects as $id => $xobject) {
144 6
                $table[$id] = $xobject;
145
146
                // Store too on cleaned id value (only numeric)
147 6
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
148 6
                if ('' != $id) {
149 6
                    $table[$id] = $xobject;
150
                }
151
            }
152
153 6
            return $this->xobjects = $table;
154
        }
155
156
        return [];
157
    }
158
159 5
    public function getXObject(string $id): ?PDFObject
160
    {
161 5
        $xobjects = $this->getXObjects();
162
163 5
        if (isset($xobjects[$id])) {
164 5
            return $xobjects[$id];
165
        }
166
167
        return null;
168
        /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
169
170
        if (isset($xobjects[$id])) {
171
            return $xobjects[$id];
172
        } else {
173
            return null;
174
        }*/
175
    }
176
177 16
    public function getText(self $page = null): string
178
    {
179 16
        if ($contents = $this->get('Contents')) {
180 16
            if ($contents instanceof ElementMissing) {
181
                return '';
182 16
            } elseif ($contents instanceof ElementNull) {
183
                return '';
184 16
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
185 13
                $elements = $contents->getHeader()->getElements();
186
187 13
                if (is_numeric(key($elements))) {
188
                    $new_content = '';
189
190
                    foreach ($elements as $element) {
191
                        if ($element instanceof ElementXRef) {
192
                            $new_content .= $element->getObject()->getContent();
193
                        } else {
194
                            $new_content .= $element->getContent();
195
                        }
196
                    }
197
198
                    $header = new Header([], $this->document);
199 13
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
200
                }
201 4
            } elseif ($contents instanceof ElementArray) {
202
                // Create a virtual global content.
203 4
                $new_content = '';
204
205 4
                foreach ($contents->getContent() as $content) {
206 4
                    $new_content .= $content->getContent()."\n";
207
                }
208
209 4
                $header = new Header([], $this->document);
210 4
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
211
            }
212
213
            /*
214
             * Elements referencing each other on the same page can cause endless loops during text parsing.
215
             * To combat this we keep a recursionStack containing already parsed elements on the page.
216
             * The stack is only emptied here after getting text from a page.
217
             */
218 16
            $contentsText = $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

218
            /** @scrutinizer ignore-call */ 
219
            $contentsText = $contents->getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
219 16
            PDFObject::$recursionStack = [];
220
221 16
            return $contentsText;
222
        }
223
224
        return '';
225
    }
226
227
    /**
228
     * Return true if the current page is a (setasign\Fpdi\Fpdi) FPDI/FPDF document
229
     *
230
     * The metadata 'Producer' should have the value of "FPDF" . FPDF_VERSION if the
231
     * pdf file was generated by FPDF/Fpfi.
232
     *
233
     * @return bool true is the current page is a FPDI/FPDF document
234
     */
235 11
    public function isFpdf(): bool
236
    {
237 11
        if (\array_key_exists('Producer', $this->document->getDetails()) &&
238 11
            \is_string($this->document->getDetails()['Producer']) &&
239 11
            0 === strncmp($this->document->getDetails()['Producer'], 'FPDF', 4)) {
240 2
            return true;
241
        }
242
243 10
        return false;
244
    }
245
246
    /**
247
     * Return the page number of the PDF document of the page object
248
     *
249
     * @return int the page number
250
     */
251 2
    public function getPageNumber(): int
252
    {
253 2
        $pages = $this->document->getPages();
254 2
        $numOfPages = \count($pages);
255 2
        for ($pageNum = 0; $pageNum < $numOfPages; ++$pageNum) {
256 2
            if ($pages[$pageNum] === $this) {
257 2
                break;
258
            }
259
        }
260
261 2
        return $pageNum;
262
    }
263
264
    /**
265
     * Return the Object of the page if the document is a FPDF/FPDI document
266
     *
267
     * If the document was generated by FPDF/FPDI it returns the
268
     * PDFObject of the given page
269
     *
270
     * @return PDFObject The PDFObject for the page
271
     */
272 1
    public function getPDFObjectForFpdf(): PDFObject
273
    {
274 1
        $pageNum = $this->getPageNumber();
275 1
        $xObjects = $this->getXObjects();
276
277 1
        return $xObjects[$pageNum];
278
    }
279
280
    /**
281
     * Return a new PDFObject of the document created with FPDF/FPDI
282
     *
283
     * For a document generated by FPDF/FPDI, it generates a
284
     * new PDFObject for that document
285
     *
286
     * @return PDFObject The PDFObject
287
     */
288 1
    public function createPDFObjectForFpdf(): PDFObject
289
    {
290 1
        $pdfObject = $this->getPDFObjectForFpdf();
291 1
        $new_content = $pdfObject->getContent();
292 1
        $header = $pdfObject->getHeader();
293 1
        $config = $pdfObject->config;
294
295 1
        return new PDFObject($pdfObject->document, $header, $new_content, $config);
296
    }
297
298
    /**
299
     * Return page if document is a FPDF/FPDI document
300
     *
301
     * @return Page The page
302
     */
303 1
    public function createPageForFpdf(): self
304
    {
305 1
        $pdfObject = $this->getPDFObjectForFpdf();
306 1
        $new_content = $pdfObject->getContent();
307 1
        $header = $pdfObject->getHeader();
308 1
        $config = $pdfObject->config;
309
310 1
        return new self($pdfObject->document, $header, $new_content, $config);
311
    }
312
313 6
    public function getTextArray(self $page = null): array
314
    {
315 6
        if ($this->isFpdf()) {
316 1
            $pdfObject = $this->getPDFObjectForFpdf();
317 1
            $newPdfObject = $this->createPDFObjectForFpdf();
318
319 1
            return $newPdfObject->getTextArray($pdfObject);
320
        } else {
321 5
            if ($contents = $this->get('Contents')) {
322 5
                if ($contents instanceof ElementMissing) {
323
                    return [];
324 5
                } elseif ($contents instanceof ElementNull) {
325
                    return [];
326 5
                } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
327 5
                    $elements = $contents->getHeader()->getElements();
328
329 5
                    if (is_numeric(key($elements))) {
330
                        $new_content = '';
331
332
                        /** @var PDFObject $element */
333
                        foreach ($elements as $element) {
334
                            if ($element instanceof ElementXRef) {
335
                                $new_content .= $element->getObject()->getContent();
336
                            } else {
337
                                $new_content .= $element->getContent();
338
                            }
339
                        }
340
341
                        $header = new Header([], $this->document);
342
                        $contents = new PDFObject($this->document, $header, $new_content, $this->config);
343
                    } else {
344
                        try {
345 5
                            $contents->getTextArray($this);
346 1
                        } catch (\Throwable $e) {
347 5
                            return $contents->getTextArray();
348
                        }
349
                    }
350 1
                } elseif ($contents instanceof ElementArray) {
351
                    // Create a virtual global content.
352 1
                    $new_content = '';
353
354
                    /** @var PDFObject $content */
355 1
                    foreach ($contents->getContent() as $content) {
356 1
                        $new_content .= $content->getContent()."\n";
357
                    }
358
359 1
                    $header = new Header([], $this->document);
360 1
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
361
                }
362
363 4
                return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

363
                return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
364
            }
365
366
            return [];
367
        }
368
    }
369
370
    /**
371
     * Gets all the text data with its internal representation of the page.
372
     *
373
     * Returns an array with the data and the internal representation
374
     */
375 10
    public function extractRawData(): array
376
    {
377
        /*
378
         * Now you can get the complete content of the object with the text on it
379
         */
380 10
        $extractedData = [];
381 10
        $content = $this->get('Contents');
382 10
        $values = $content->getContent();
383 10
        if (isset($values) && \is_array($values)) {
384 1
            $text = '';
385 1
            foreach ($values as $section) {
386 1
                $text .= $section->getContent();
387
            }
388 1
            $sectionsText = $this->getSectionsText($text);
389 1
            foreach ($sectionsText as $sectionText) {
390 1
                $commandsText = $this->getCommandsText($sectionText);
391 1
                foreach ($commandsText as $command) {
392 1
                    $extractedData[] = $command;
393
                }
394
            }
395
        } else {
396 10
            if ($this->isFpdf()) {
397 1
                $content = $this->getPDFObjectForFpdf();
398
            }
399 10
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

399
            /** @scrutinizer ignore-call */ 
400
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
400 10
            foreach ($sectionsText as $sectionText) {
401 10
                $extractedData[] = ['t' => '', 'o' => 'BT', 'c' => ''];
402
403 10
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

403
                /** @scrutinizer ignore-call */ 
404
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
404 10
                foreach ($commandsText as $command) {
405 10
                    $extractedData[] = $command;
406
                }
407
            }
408
        }
409
410 10
        return $extractedData;
411
    }
412
413
    /**
414
     * Gets all the decoded text data with it internal representation from a page.
415
     *
416
     * @param array $extractedRawData the extracted data return by extractRawData or
417
     *                                null if extractRawData should be called
418
     *
419
     * @return array An array with the data and the internal representation
420
     */
421 9
    public function extractDecodedRawData(array $extractedRawData = null): array
422
    {
423 9
        if (!isset($extractedRawData) || !$extractedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
424 9
            $extractedRawData = $this->extractRawData();
425
        }
426 9
        $currentFont = null; /** @var Font $currentFont */
427 9
        $clippedFont = null;
428 9
        $fpdfPage = null;
429 9
        if ($this->isFpdf()) {
430 1
            $fpdfPage = $this->createPageForFpdf();
431
        }
432 9
        foreach ($extractedRawData as &$command) {
433 9
            if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
434 9
                $data = $command['c'];
435 9
                if (!\is_array($data)) {
436 7
                    $tmpText = '';
437 7
                    if (isset($currentFont)) {
438 7
                        $tmpText = $currentFont->decodeOctal($data);
439
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
440
                    }
441 7
                    $tmpText = str_replace(
442 7
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
443 7
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
444
                            $tmpText
445
                    );
446 7
                    $tmpText = utf8_encode($tmpText);
447 7
                    if (isset($currentFont)) {
448 7
                        $tmpText = $currentFont->decodeContent($tmpText);
449
                    }
450 7
                    $command['c'] = $tmpText;
451 7
                    continue;
452
                }
453 9
                $numText = \count($data);
454 9
                for ($i = 0; $i < $numText; ++$i) {
455 9
                    if (0 != ($i % 2)) {
456 7
                        continue;
457
                    }
458 9
                    $tmpText = $data[$i]['c'];
459 9
                    $decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
460 9
                    $decodedText = str_replace(
461 9
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
462 9
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
463
                            $decodedText
464
                    );
465 9
                    $decodedText = utf8_encode($decodedText);
466 9
                    if (isset($currentFont)) {
467 7
                        $decodedText = $currentFont->decodeContent($decodedText);
468
                    }
469 9
                    $command['c'][$i]['c'] = $decodedText;
470 9
                    continue;
471
                }
472 9
            } elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
473 9
                $fontId = explode(' ', $command['c'])[0];
474
                // If document is a FPDI/FPDF the $page has the correct font
475 9
                $currentFont = isset($fpdfPage) ? $fpdfPage->getFont($fontId) : $this->getFont($fontId);
476 9
                continue;
477 9
            } elseif ('Q' == $command['o']) {
478 6
                $currentFont = $clippedFont;
479 9
            } elseif ('q' == $command['o']) {
480 6
                $clippedFont = $currentFont;
481
            }
482
        }
483
484 9
        return $extractedRawData;
485
    }
486
487
    /**
488
     * Gets just the Text commands that are involved in text positions and
489
     * Text Matrix (Tm)
490
     *
491
     * It extract just the PDF commands that are involved with text positions, and
492
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
493
     *
494
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData.
495
     *                                       If it is null, the method extractDecodeRawData is called.
496
     *
497
     * @return array An array with the text command of the page
498
     */
499 7
    public function getDataCommands(array $extractedDecodedRawData = null): array
500
    {
501 7
        if (!isset($extractedDecodedRawData) || !$extractedDecodedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedDecodedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
502 7
            $extractedDecodedRawData = $this->extractDecodedRawData();
503
        }
504 7
        $extractedData = [];
505 7
        foreach ($extractedDecodedRawData as $command) {
506 7
            switch ($command['o']) {
507
                /*
508
                 * BT
509
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
510
                 */
511 7
                case 'BT':
512 7
                    $extractedData[] = $command;
513 7
                    break;
514
515
                /*
516
                 * ET
517
                 * End a text object, discarding the text matrix
518
                 */
519 7
                case 'ET':
520
                    $extractedData[] = $command;
521
                    break;
522
523
                /*
524
                 * leading TL
525
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
526
                 * Initial value: 0
527
                 */
528 7
                case 'TL':
529 5
                    $extractedData[] = $command;
530 5
                    break;
531
532
                /*
533
                 * tx ty Td
534
                 * Move to the start of the next line, offset form the start of the
535
                 * current line by tx, ty.
536
                 */
537 7
                case 'Td':
538 7
                    $extractedData[] = $command;
539 7
                    break;
540
541
                /*
542
                 * tx ty TD
543
                 * Move to the start of the next line, offset form the start of the
544
                 * current line by tx, ty. As a side effect, this operator set the leading
545
                 * parameter in the text state. This operator has the same effect as the
546
                 * code:
547
                 * -ty TL
548
                 * tx ty Td
549
                 */
550 7
                case 'TD':
551
                    $extractedData[] = $command;
552
                    break;
553
554
                /*
555
                 * a b c d e f Tm
556
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
557
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
558
                 * [1 0 0 1 0 0]
559
                 */
560 7
                case 'Tm':
561 5
                    $extractedData[] = $command;
562 5
                    break;
563
564
                /*
565
                 * T*
566
                 * Move to the start of the next line. This operator has the same effect
567
                 * as the code:
568
                 * 0 Tl Td
569
                 * Where Tl is the current leading parameter in the text state.
570
                 */
571 7
                case 'T*':
572 5
                    $extractedData[] = $command;
573 5
                    break;
574
575
                /*
576
                 * string Tj
577
                 * Show a Text String
578
                 */
579 7
                case 'Tj':
580 6
                    $extractedData[] = $command;
581 6
                    break;
582
583
                /*
584
                 * string '
585
                 * Move to the next line and show a text string. This operator has the
586
                 * same effect as the code:
587
                 * T*
588
                 * string Tj
589
                 */
590 7
                case "'":
591
                    $extractedData[] = $command;
592
                    break;
593
594
                /*
595
                 * aw ac string "
596
                 * Move to the next lkine and show a text string, using aw as the word
597
                 * spacing and ac as the character spacing. This operator has the same
598
                 * effect as the code:
599
                 * aw Tw
600
                 * ac Tc
601
                 * string '
602
                 * Tw set the word spacing, Tw, to wordSpace.
603
                 * Tc Set the character spacing, Tc, to charsSpace.
604
                 */
605 7
                case '"':
606
                    $extractedData[] = $command;
607
                    break;
608
609 7
                case 'Tf':
610 7
                case 'TF':
611 7
                    if ($this->config->getDataTmFontInfoHasToBeIncluded()) {
612 1
                        $extractedData[] = $command;
613
                    }
614 7
                    break;
615
616
                /*
617
                 * array TJ
618
                 * Show one or more text strings allow individual glyph positioning.
619
                 * Each lement of array con be a string or a number. If the element is
620
                 * a string, this operator shows the string. If it is a number, the
621
                 * operator adjust the text position by that amount; that is, it translates
622
                 * the text matrix, Tm. This amount is substracted form the current
623
                 * horizontal or vertical coordinate, depending on the writing mode.
624
                 * in the default coordinate system, a positive adjustment has the effect
625
                 * of moving the next glyph painted either to the left or down by the given
626
                 * amount.
627
                 */
628 7
                case 'TJ':
629 7
                    $extractedData[] = $command;
630 7
                    break;
631
                default:
632
            }
633
        }
634
635 7
        return $extractedData;
636
    }
637
638
    /**
639
     * Gets the Text Matrix of the text in the page
640
     *
641
     * Return an array where every item is an array where the first item is the
642
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
643
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
644
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
645
     *
646
     * @param array $dataCommands the data extracted by getDataCommands
647
     *                            if null getDataCommands is called
648
     *
649
     * @return array an array with the data of the page including the Tm information
650
     *               of any text in the page
651
     */
652 6
    public function getDataTm(array $dataCommands = null): array
653
    {
654 6
        if (!isset($dataCommands) || !$dataCommands) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $dataCommands of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
655 6
            $dataCommands = $this->getDataCommands();
656
        }
657
658
        /*
659
         * At the beginning of a text object Tm is the identity matrix
660
         */
661 6
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
662
663
        /*
664
         *  Set the text leading used by T*, ' and " operators
665
         */
666 6
        $defaultTl = 0;
667
668
        /*
669
         *  Set default values for font data
670
         */
671 6
        $defaultFontId = -1;
672 6
        $defaultFontSize = 0;
673
674
        /*
675
         * Setting where are the X and Y coordinates in the matrix (Tm)
676
         */
677 6
        $x = 4;
678 6
        $y = 5;
679 6
        $Tx = 0;
680 6
        $Ty = 0;
681
682 6
        $Tm = $defaultTm;
683 6
        $Tl = $defaultTl;
684 6
        $fontId = $defaultFontId;
685 6
        $fontSize = $defaultFontSize;
686
687 6
        $extractedTexts = $this->getTextArray();
688 6
        $extractedData = [];
689 6
        foreach ($dataCommands as $command) {
690 6
            $currentText = $extractedTexts[\count($extractedData)];
691 6
            switch ($command['o']) {
692
                /*
693
                 * BT
694
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
695
                 */
696 6
                case 'BT':
697 6
                    $Tm = $defaultTm;
698 6
                    $Tl = $defaultTl; //review this.
699 6
                    $Tx = 0;
700 6
                    $Ty = 0;
701 6
                    $fontId = $defaultFontId;
702 6
                    $fontSize = $defaultFontSize;
703 6
                    break;
704
705
                /*
706
                 * ET
707
                 * End a text object, discarding the text matrix
708
                 */
709 6
                case 'ET':
710
                    $Tm = $defaultTm;
711
                    $Tl = $defaultTl;  //review this
712
                    $Tx = 0;
713
                    $Ty = 0;
714
                    $fontId = $defaultFontId;
715
                    $fontSize = $defaultFontSize;
716
                    break;
717
718
                /*
719
                 * leading TL
720
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
721
                 * Initial value: 0
722
                 */
723 6
                case 'TL':
724 4
                    $Tl = (float) $command['c'];
725 4
                    break;
726
727
                /*
728
                 * tx ty Td
729
                 * Move to the start of the next line, offset form the start of the
730
                 * current line by tx, ty.
731
                 */
732 6
                case 'Td':
733 6
                    $coord = explode(' ', $command['c']);
734 6
                    $Tx += (float) $coord[0];
735 6
                    $Ty += (float) $coord[1];
736 6
                    $Tm[$x] = (string) $Tx;
737 6
                    $Tm[$y] = (string) $Ty;
738 6
                    break;
739
740
                /*
741
                 * tx ty TD
742
                 * Move to the start of the next line, offset form the start of the
743
                 * current line by tx, ty. As a side effect, this operator set the leading
744
                 * parameter in the text state. This operator has the same effect as the
745
                 * code:
746
                 * -ty TL
747
                 * tx ty Td
748
                 */
749 6
                case 'TD':
750
                    $coord = explode(' ', $command['c']);
751
                    $Tl = (float) $coord[1];
752
                    $Tx += (float) $coord[0];
753
                    $Ty -= (float) $coord[1];
754
                    $Tm[$x] = (string) $Tx;
755
                    $Tm[$y] = (string) $Ty;
756
                    break;
757
758
                /*
759
                 * a b c d e f Tm
760
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
761
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
762
                 * [1 0 0 1 0 0]
763
                 */
764 6
                case 'Tm':
765 4
                    $Tm = explode(' ', $command['c']);
766 4
                    $Tx = (float) $Tm[$x];
767 4
                    $Ty = (float) $Tm[$y];
768 4
                    break;
769
770
                /*
771
                 * T*
772
                 * Move to the start of the next line. This operator has the same effect
773
                 * as the code:
774
                 * 0 Tl Td
775
                 * Where Tl is the current leading parameter in the text state.
776
                 */
777 6
                case 'T*':
778 4
                    $Ty -= $Tl;
779 4
                    $Tm[$y] = (string) $Ty;
780 4
                    break;
781
782
                /*
783
                 * string Tj
784
                 * Show a Text String
785
                 */
786 6
                case 'Tj':
787 5
                    $data = [$Tm, $currentText];
788 5
                    if ($this->config->getDataTmFontInfoHasToBeIncluded()) {
789 1
                        $data[] = $fontId;
790 1
                        $data[] = $fontSize;
791
                    }
792 5
                    $extractedData[] = $data;
793 5
                    break;
794
795
                /*
796
                 * string '
797
                 * Move to the next line and show a text string. This operator has the
798
                 * same effect as the code:
799
                 * T*
800
                 * string Tj
801
                 */
802 6
                case "'":
803
                    $Ty -= $Tl;
804
                    $Tm[$y] = (string) $Ty;
805
                    $extractedData[] = [$Tm, $currentText];
806
                    break;
807
808
                /*
809
                 * aw ac string "
810
                 * Move to the next line and show a text string, using aw as the word
811
                 * spacing and ac as the character spacing. This operator has the same
812
                 * effect as the code:
813
                 * aw Tw
814
                 * ac Tc
815
                 * string '
816
                 * Tw set the word spacing, Tw, to wordSpace.
817
                 * Tc Set the character spacing, Tc, to charsSpace.
818
                 */
819 6
                case '"':
820
                    $data = explode(' ', $currentText);
821
                    $Ty -= $Tl;
822
                    $Tm[$y] = (string) $Ty;
823
                    $extractedData[] = [$Tm, $data[2]]; //Verify
824
                    break;
825
826 6
                case 'Tf':
827
                    /*
828
                     * From PDF 1.0 specification, page 106:
829
                     *     fontname size Tf Set font and size
830
                     *     Sets the text font and text size in the graphics state. There is no default value for
831
                     *     either fontname or size; they must be selected using Tf before drawing any text.
832
                     *     fontname is a resource name. size is a number expressed in text space units.
833
                     *
834
                     * Source: https://ia902503.us.archive.org/10/items/pdfy-0vt8s-egqFwDl7L2/PDF%20Reference%201.0.pdf
835
                     * Introduced with https://github.com/smalot/pdfparser/pull/516
836
                     */
837 1
                    list($fontId, $fontSize) = explode(' ', $command['c'], 2);
838 1
                    break;
839
840
                /*
841
                 * array TJ
842
                 * Show one or more text strings allow individual glyph positioning.
843
                 * Each lement of array con be a string or a number. If the element is
844
                 * a string, this operator shows the string. If it is a number, the
845
                 * operator adjust the text position by that amount; that is, it translates
846
                 * the text matrix, Tm. This amount is substracted form the current
847
                 * horizontal or vertical coordinate, depending on the writing mode.
848
                 * in the default coordinate system, a positive adjustment has the effect
849
                 * of moving the next glyph painted either to the left or down by the given
850
                 * amount.
851
                 */
852 6
                case 'TJ':
853 6
                    $data = [$Tm, $currentText];
854 6
                    if ($this->config->getDataTmFontInfoHasToBeIncluded()) {
855 1
                        $data[] = $fontId;
856 1
                        $data[] = $fontSize;
857
                    }
858 6
                    $extractedData[] = $data;
859 6
                    break;
860
                default:
861
            }
862
        }
863 6
        $this->dataTm = $extractedData;
864
865 6
        return $extractedData;
866
    }
867
868
    /**
869
     * Gets text data that are around the given coordinates (X,Y)
870
     *
871
     * If the text is in near the given coordinates (X,Y) (or the TM info),
872
     * the text is returned.  The extractedData return by getDataTm, could be use to see
873
     * where is the coordinates of a given text, using the TM info for it.
874
     *
875
     * @param float $x      The X value of the coordinate to search for. if null
876
     *                      just the Y value is considered (same Row)
877
     * @param float $y      The Y value of the coordinate to search for
878
     *                      just the X value is considered (same column)
879
     * @param float $xError The value less or more to consider an X to be "near"
880
     * @param float $yError The value less or more to consider an Y to be "near"
881
     *
882
     * @return array An array of text that are near the given coordinates. If no text
883
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
884
     *               and y coordinates are null, null is returned.
885
     */
886 2
    public function getTextXY(float $x = null, float $y = null, float $xError = 0, float $yError = 0): array
887
    {
888 2
        if (!isset($this->dataTm) || !$this->dataTm) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->dataTm of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
889 1
            $this->getDataTm();
890
        }
891
892 2
        if (null !== $x) {
893 2
            $x = (float) $x;
894
        }
895
896 2
        if (null !== $y) {
897 2
            $y = (float) $y;
898
        }
899
900 2
        if (null === $x && null === $y) {
901
            return [];
902
        }
903
904 2
        $xError = (float) $xError;
905 2
        $yError = (float) $yError;
906
907 2
        $extractedData = [];
908 2
        foreach ($this->dataTm as $item) {
909 2
            $tm = $item[0];
910 2
            $xTm = (float) $tm[4];
911 2
            $yTm = (float) $tm[5];
912 2
            $text = $item[1];
913 2
            if (null === $y) {
914
                if (($xTm >= ($x - $xError)) &&
915
                    ($xTm <= ($x + $xError))) {
916
                    $extractedData[] = [$tm, $text];
917
                    continue;
918
                }
919
            }
920 2
            if (null === $x) {
921
                if (($yTm >= ($y - $yError)) &&
922
                    ($yTm <= ($y + $yError))) {
923
                    $extractedData[] = [$tm, $text];
924
                    continue;
925
                }
926
            }
927 2
            if (($xTm >= ($x - $xError)) &&
928 2
                ($xTm <= ($x + $xError)) &&
929 2
                ($yTm >= ($y - $yError)) &&
930 2
                ($yTm <= ($y + $yError))) {
931 2
                $extractedData[] = [$tm, $text];
932 2
                continue;
933
            }
934
        }
935
936 2
        return $extractedData;
937
    }
938
}
939