Test Failed
Pull Request — master (#455)
by
unknown
02:30
created

Page::extractDecodedRawData()   D

Complexity

Conditions 19
Paths 44

Size

Total Lines 64
Code Lines 47

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 16
CRAP Score 19.495

Importance

Changes 7
Bugs 1 Features 0
Metric Value
cc 19
eloc 47
c 7
b 1
f 0
nc 44
nop 1
dl 0
loc 64
ccs 16
cts 18
cp 0.8889
crap 19.495
rs 4.5166

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementMissing;
35
use Smalot\PdfParser\Element\ElementNull;
36
use Smalot\PdfParser\Element\ElementXRef;
37
38
class Page extends PDFObject
39
{
40
    /**
41
     * @var Font[]
42
     */
43
    protected $fonts = null;
44
45
    /**
46
     * @var PDFObject[]
47
     */
48
    protected $xobjects = null;
49
50
    /**
51
     * @var array
52
     */
53
    protected $dataTm = null;
54
55
    /**
56
     * @return Font[]
57
     */
58 23
    public function getFonts()
59
    {
60 23
        if (null !== $this->fonts) {
61 19
            return $this->fonts;
62
        }
63
64 23
        $resources = $this->get('Resources');
65
66 23
        if (method_exists($resources, 'has') && $resources->has('Font')) {
67 20
            if ($resources->get('Font') instanceof ElementMissing) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

67
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof ElementMissing) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
68 1
                return [];
69
            }
70
71 19
            if ($resources->get('Font') instanceof Header) {
72 13
                $fonts = $resources->get('Font')->getElements();
73
            } else {
74 8
                $fonts = $resources->get('Font')->getHeader()->getElements();
75
            }
76
77 19
            $table = [];
78
79 19
            foreach ($fonts as $id => $font) {
80 19
                if ($font instanceof Font) {
81 19
                    $table[$id] = $font;
82
83
                    // Store too on cleaned id value (only numeric)
84 19
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
85 19
                    if ('' != $id) {
86 19
                        $table[$id] = $font;
87
                    }
88
                }
89
            }
90
91 19
            return $this->fonts = $table;
92
        }
93
94 5
        return [];
95
    }
96
97 21
    public function getFont(string $id): ?Font
98
    {
99 21
        $fonts = $this->getFonts();
100
101 21
        if (isset($fonts[$id])) {
102 18
            return $fonts[$id];
103
        }
104
105
        // According to the PDF specs (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 238)
106
        // "The font resource name presented to the Tf operator is arbitrary, as are the names for all kinds of resources"
107
        // Instead, we search for the unfiltered name first and then do this cleaning as a fallback, so all tests still pass.
108
109 4
        if (isset($fonts[$id])) {
110
            return $fonts[$id];
111
        } else {
112 4
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
113 4
            if (isset($fonts[$id])) {
114 1
                return $fonts[$id];
115
            }
116
        }
117
118 3
        return null;
119
    }
120
121
    /**
122
     * Support for XObject
123
     *
124
     * @return PDFObject[]
125
     */
126 4
    public function getXObjects()
127
    {
128 4
        if (null !== $this->xobjects) {
129 3
            return $this->xobjects;
130
        }
131
132 4
        $resources = $this->get('Resources');
133
134 4
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
135 4
            if ($resources->get('XObject') instanceof Header) {
136 4
                $xobjects = $resources->get('XObject')->getElements();
137
            } else {
138
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
139
            }
140
141 4
            $table = [];
142
143 4
            foreach ($xobjects as $id => $xobject) {
144 4
                $table[$id] = $xobject;
145
146
                // Store too on cleaned id value (only numeric)
147 4
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
148 4
                if ('' != $id) {
149 4
                    $table[$id] = $xobject;
150
                }
151
            }
152
153 4
            return $this->xobjects = $table;
154
        }
155
156
        return [];
157
    }
158
159 4
    public function getXObject(string $id): ?PDFObject
160
    {
161 4
        $xobjects = $this->getXObjects();
162
163 4
        if (isset($xobjects[$id])) {
164 4
            return $xobjects[$id];
165
        }
166
167
        return null;
168
        /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
169
170
        if (isset($xobjects[$id])) {
171
            return $xobjects[$id];
172
        } else {
173
            return null;
174
        }*/
175
    }
176
177 13
    public function getText(self $page = null): string
178
    {
179 13
        if ($contents = $this->get('Contents')) {
180 13
            if ($contents instanceof ElementMissing) {
181
                return '';
182 13
            } elseif ($contents instanceof ElementNull) {
183
                return '';
184 13
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
185 10
                $elements = $contents->getHeader()->getElements();
186
187 10
                if (is_numeric(key($elements))) {
188
                    $new_content = '';
189
190
                    foreach ($elements as $element) {
191
                        if ($element instanceof ElementXRef) {
192
                            $new_content .= $element->getObject()->getContent();
193
                        } else {
194
                            $new_content .= $element->getContent();
195
                        }
196
                    }
197
198
                    $header = new Header([], $this->document);
199 10
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
200
                }
201 3
            } elseif ($contents instanceof ElementArray) {
202
                // Create a virtual global content.
203 3
                $new_content = '';
204
205 3
                foreach ($contents->getContent() as $content) {
206 3
                    $new_content .= $content->getContent()."\n";
207
                }
208
209 3
                $header = new Header([], $this->document);
210 3
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
211
            }
212
213 13
            /*
214
             * Elements referencing each other on the same page can cause endless loops during text parsing.
215
             * To combat this we keep a recursionStack containing already parsed elements on the page.
216
             * The stack is only emptied here after getting text from a page.
217
             */
218
            $contentsText = $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

218
            /** @scrutinizer ignore-call */ 
219
            $contentsText = $contents->getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
219 4
            PDFObject::$recursionStack = [];
220
221 4
            return $contentsText;
222 4
        }
223
224 4
        return '';
225
    }
226 4
227 4
    /**
228
     * Return true if the current page is a (setasign\Fpdi\Fpdi) FPDI/FPDF document
229 4
     *
230
     * The metadata 'Producer' should have the value of "FPDF" . FPDF_VERSION if the
231
     * pdf file was generated by FPDF/Fpfi.
232
     *
233
     * @return bool true is the current page is a FPDI/FPDF document
234
     */
235
    public function isFpdf(): bool
236
    {
237
        if (\array_key_exists('Producer', $this->document->getDetails()) &&
238
            \is_string($this->document->getDetails()['Producer']) &&
239
            str_starts_with($this->document->getDetails()['Producer'], 'FPDF')) {
240
            return true;
241
        }
242
243
        return false;
244
    }
245 4
246 1
    /**
247 4
     * Return the page number of the PDF document of the page object
248
     *
249
     * @return int the page number
250
     */
251
    public function getPageNumber(): int
252
    {
253
        $pages = $this->document->getPages();
254
        $numOfPages = \count($pages);
255
        for ($pageNum = 0; $pageNum < $numOfPages; ++$pageNum) {
256
            if ($pages[$pageNum] === $this) {
257
                break;
258
            }
259
        }
260
261
        return $pageNum;
262
    }
263 3
264
    /**
265
     * Return the xObject if the document is from fpdf
266
     *
267
     * @return object The xObject for the page
268
     */
269
    public function getXObjectForFpdf(): object
270
    {
271
        $pageNum = $this->getPageNumber();
272
        $xObjects = $this->getXObjects();
273
274 8
        return $xObjects[$pageNum];
275
    }
276
277
    /**
278
     * Return a PDFObject if document is from fpdf
279 8
     *
280 8
     * @return object The xObject for the page
281 8
     */
282 8
    public function getPDFObjectForFpdf(): object
283
    {
284
        $xObject = $this->getXObjectForFpdf();
285
        $new_content = $xObject->getContent();
286
        $header = $xObject->getHeader();
287
        $config = $xObject->config;
288
289
        return new PDFObject($xObject->document, $header, $new_content, $config);
290
    }
291
292
    /**
293
     * Return page if document is from fpdf
294
     *
295 8
     * @return object The page
296 8
     */
297 8
    public function getPageForFpdf(): object
298
    {
299 8
        $xObject = $this->getXObjectForFpdf();
300 8
        $new_content = $xObject->getContent();
301 8
        $header = $xObject->getHeader();
302
        $config = $xObject->config;
303
304
        return new self($xObject->document, $header, $new_content, $config);
305
    }
306 8
307
    public function getTextArray(self $page = null): array
308
    {
309
        if ($this->isFpdf()) {
310
            $xObject = $this->getXObjectForFpdf();
311
            $pdfObject = $this->getPDFObjectForFpdf();
312
313
            return $pdfObject->getTextArray($xObject);
314
        }
315
        if ($contents = $this->get('Contents')) {
316
            if ($contents instanceof ElementMissing) {
317 7
                return [];
318
            } elseif ($contents instanceof ElementNull) {
319 7
                return [];
320 7
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
321
                $elements = $contents->getHeader()->getElements();
322 7
323 7
                if (is_numeric(key($elements))) {
324 7
                    $new_content = '';
325 7
326 7
                    /** @var PDFObject $element */
327 7
                    foreach ($elements as $element) {
328 5
                        if ($element instanceof ElementXRef) {
329 5
                            $new_content .= $element->getObject()->getContent();
330 5
                        } else {
331
                            $new_content .= $element->getContent();
332
                        }
333 5
                    }
334 5
335 5
                    $header = new Header([], $this->document);
336
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
337
                } else {
338 5
                    try {
339 5
                        $contents->getTextArray($this);
340 5
                    } catch (\Throwable $e) {
341
                        return $contents->getTextArray();
342 5
                    }
343 5
                }
344
            } elseif ($contents instanceof ElementArray) {
345 7
                // Create a virtual global content.
346 7
                $new_content = '';
347 7
348 5
                /** @var PDFObject $content */
349
                foreach ($contents->getContent() as $content) {
350 7
                    $new_content .= $content->getContent()."\n";
351 7
                }
352 7
353 7
                $header = new Header([], $this->document);
354 7
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
355
            }
356
357 7
            return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

357
            return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
358 7
        }
359 5
360
        return [];
361 7
    }
362 7
363
    /**
364 7
     * Gets all the text data with its internal representation of the page.
365 7
     *
366 7
     * Returns an array with the data and the internal representation
367 7
     */
368 7
    public function extractRawData(): array
369
    {
370 7
        /*
371
         * Now you can get the complete content of the object with the text on it
372
         */
373
        $extractedData = [];
374
        $content = $this->get('Contents');
375 7
        $values = $content->getContent();
376
        if (isset($values) && \is_array($values)) {
377
            $text = '';
378
            foreach ($values as $section) {
379
                $text .= $section->getContent();
380
            }
381
            $sectionsText = $this->getSectionsText($text);
382
            foreach ($sectionsText as $sectionText) {
383
                $commandsText = $this->getCommandsText($sectionText);
384
                foreach ($commandsText as $command) {
385
                    $extractedData[] = $command;
386
                }
387
            }
388
        } else {
389
            if ($this->isFpdf()) {
390 5
                $content = $this->getXObjectForFpdf();
391
            }
392 5
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

392
            /** @scrutinizer ignore-call */ 
393
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
393 5
            foreach ($sectionsText as $sectionText) {
394
                $extractedData[] = ['t' => '', 'o' => 'BT', 'c' => ''];
395 5
396 5
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

396
                /** @scrutinizer ignore-call */ 
397
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
397 5
                foreach ($commandsText as $command) {
398
                    $extractedData[] = $command;
399
                }
400
            }
401
        }
402 5
403 5
        return $extractedData;
404 5
    }
405
406
    /**
407
     * Gets all the decoded text data with it internal representation from a page.
408
     *
409
     * @param array $extractedRawData the extracted data return by extractRawData or
410 5
     *                                null if extractRawData should be called
411
     *
412
     * @return array An array with the data and the internal representation
413
     */
414
    public function extractDecodedRawData(array $extractedRawData = null): array
415
    {
416
        if (!isset($extractedRawData) || !$extractedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
417
            $extractedRawData = $this->extractRawData();
418
        }
419 5
        $currentFont = null; /** @var Font $currentFont */
420 3
        $clippedFont = null;
421 3
        $fpdfPage = null;
422
        if ($this->isFpdf()) {
423
            $fpdfPage = $this->getPageForFpdf();
424
        }
425
        foreach ($extractedRawData as &$command) {
426
            if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
427
                $data = $command['c'];
428 5
                if (!\is_array($data)) {
429 5
                    $tmpText = '';
430 5
                    if (isset($currentFont)) {
431
                        $tmpText = $currentFont->decodeOctal($data);
432
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
433
                    }
434
                    $tmpText = str_replace(
435
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
436
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
437
                            $tmpText
438
                    );
439
                    $tmpText = utf8_encode($tmpText);
440
                    if (isset($currentFont)) {
441 5
                        $tmpText = $currentFont->decodeContent($tmpText);
442
                    }
443
                    $command['c'] = $tmpText;
444
                    continue;
445
                }
446
                $numText = \count($data);
447
                for ($i = 0; $i < $numText; ++$i) {
448
                    if (0 != ($i % 2)) {
449
                        continue;
450
                    }
451 5
                    $tmpText = $data[$i]['c'];
452 3
                    $decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
453 3
                    $decodedText = str_replace(
454
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
455
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
456
                            $decodedText
457
                    );
458
                    $decodedText = utf8_encode($decodedText);
459
                    if (isset($currentFont)) {
460
                        $decodedText = $currentFont->decodeContent($decodedText);
461
                    }
462 5
                    $command['c'][$i]['c'] = $decodedText;
463 3
                    continue;
464 3
                }
465
            } elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
466
                $fontId = explode(' ', $command['c'])[0];
467
                // If document is a FPDI/FPDF the $page has the correct font
468
                $currentFont = isset($fpdfPage) ? $fpdfPage->getFont($fontId) : $this->getFont($fontId);
469
                continue;
470 5
            } elseif ('Q' == $command['o']) {
471 4
                $currentFont = $clippedFont;
472 4
            } elseif ('q' == $command['o']) {
473
                $clippedFont = $currentFont;
474
            }
475
        }
476
477
        return $extractedRawData;
478
    }
479
480
    /**
481 5
     * Gets just the Text commands that are involved in text positions and
482
     * Text Matrix (Tm)
483
     *
484
     * It extract just the PDF commands that are involved with text positions, and
485
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
486
     *
487
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData.
488
     *                                       If it is null, the method extractDecodeRawData is called.
489
     *
490
     * @return array An array with the text command of the page
491
     */
492
    public function getDataCommands(array $extractedDecodedRawData = null): array
493
    {
494
        if (!isset($extractedDecodedRawData) || !$extractedDecodedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedDecodedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
495
            $extractedDecodedRawData = $this->extractDecodedRawData();
496 5
        }
497
        $extractedData = [];
498
        foreach ($extractedDecodedRawData as $command) {
499
            switch ($command['o']) {
500
                /*
501
                 * BT
502
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
503
                 */
504
                case 'BT':
505
                    $extractedData[] = $command;
506
                    break;
507
508
                /*
509
                 * ET
510
                 * End a text object, discarding the text matrix
511
                 */
512 5
                case 'ET':
513 5
                    $extractedData[] = $command;
514 5
                    break;
515
516
                /*
517
                 * leading TL
518
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
519 5
                 * Initial value: 0
520
                 */
521
                case 'TL':
522
                    $extractedData[] = $command;
523
                    break;
524
525
                /*
526
                 * tx ty Td
527
                 * Move to the start of the next line, offset form the start of the
528
                 * current line by tx, ty.
529
                 */
530
                case 'Td':
531
                    $extractedData[] = $command;
532
                    break;
533
534
                /*
535
                 * tx ty TD
536 4
                 * Move to the start of the next line, offset form the start of the
537
                 * current line by tx, ty. As a side effect, this operator set the leading
538 4
                 * parameter in the text state. This operator has the same effect as the
539 4
                 * code:
540
                 * -ty TL
541
                 * tx ty Td
542
                 */
543
                case 'TD':
544
                    $extractedData[] = $command;
545 4
                    break;
546
547
                /*
548
                 * a b c d e f Tm
549
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
550 4
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
551
                 * [1 0 0 1 0 0]
552
                 */
553
                case 'Tm':
554
                    $extractedData[] = $command;
555 4
                    break;
556 4
557 4
                /*
558 4
                 * T*
559
                 * Move to the start of the next line. This operator has the same effect
560 4
                 * as the code:
561 4
                 * 0 Tl Td
562
                 * Where Tl is the current leading parameter in the text state.
563 4
                 */
564 4
                case 'T*':
565 4
                    $extractedData[] = $command;
566 4
                    break;
567 4
568
                /*
569
                 * string Tj
570
                 * Show a Text String
571
                 */
572 4
                case 'Tj':
573 4
                    $extractedData[] = $command;
574 4
                    break;
575 4
576 4
                /*
577 4
                 * string '
578
                 * Move to the next line and show a text string. This operator has the
579
                 * same effect as the code:
580
                 * T*
581
                 * string Tj
582
                 */
583 4
                case "'":
584
                    $extractedData[] = $command;
585
                    break;
586
587
                /*
588
                 * aw ac string "
589
                 * Move to the next lkine and show a text string, using aw as the word
590
                 * spacing and ac as the character spacing. This operator has the same
591
                 * effect as the code:
592
                 * aw Tw
593
                 * ac Tc
594
                 * string '
595 4
                 * Tw set the word spacing, Tw, to wordSpace.
596 2
                 * Tc Set the character spacing, Tc, to charsSpace.
597 2
                 */
598
                case '"':
599
                    $extractedData[] = $command;
600
                    break;
601
602
                /*
603
                 * array TJ
604 4
                 * Show one or more text strings allow individual glyph positioning.
605 4
                 * Each lement of array con be a string or a number. If the element is
606 4
                 * a string, this operator shows the string. If it is a number, the
607 4
                 * operator adjust the text position by that amount; that is, it translates
608 4
                 * the text matrix, Tm. This amount is substracted form the current
609 4
                 * horizontal or vertical coordinate, depending on the writing mode.
610 4
                 * in the default coordinate system, a positive adjustment has the effect
611
                 * of moving the next glyph painted either to the left or down by the given
612
                 * amount.
613
                 */
614
                case 'TJ':
615
                    $extractedData[] = $command;
616
                    break;
617
                default:
618
            }
619
        }
620
621 4
        return $extractedData;
622
    }
623
624
    /**
625
     * Gets the Text Matrix of the text in the page
626
     *
627
     * Return an array where every item is an array where the first item is the
628
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
629
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
630
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
631
     *
632
     * @param array $dataCommands the data extracted by getDataCommands
633
     *                            if null getDataCommands is called
634
     *
635
     * @return array an array with the data of the page including the Tm information
636 4
     *               of any text in the page
637 2
     */
638 2
    public function getDataTm(array $dataCommands = null): array
639 2
    {
640 2
        if (!isset($dataCommands) || !$dataCommands) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $dataCommands of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
641
            $dataCommands = $this->getDataCommands();
642
        }
643
644
        /*
645
         * At the beginning of a text object Tm is the identity matrix
646
         */
647
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
648
649 4
        /*
650 2
         *  Set the text leading used by T*, ' and " operators
651 2
         */
652 2
        $defaultTl = 0;
653
654
        /*
655
         * Setting where are the X and Y coordinates in the matrix (Tm)
656
         */
657
        $x = 4;
658 4
        $y = 5;
659 3
        $Tx = 0;
660 3
        $Ty = 0;
661
662
        $Tm = $defaultTm;
663
        $Tl = $defaultTl;
664
665
        $extractedTexts = $this->getTextArray();
666
        $extractedData = [];
667
        foreach ($dataCommands as $command) {
668
            $currentText = $extractedTexts[\count($extractedData)];
669 4
            switch ($command['o']) {
670
                /*
671
                 * BT
672
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
673
                 */
674
                case 'BT':
675
                    $Tm = $defaultTm;
676
                    $Tl = $defaultTl; //review this.
677
                    $Tx = 0;
678
                    $Ty = 0;
679
                    break;
680
681
                /*
682
                 * ET
683
                 * End a text object, discarding the text matrix
684
                 */
685
                case 'ET':
686 4
                    $Tm = $defaultTm;
687
                    $Tl = $defaultTl;  //review this
688
                    $Tx = 0;
689
                    $Ty = 0;
690
                    break;
691
692
                /*
693
                 * leading TL
694
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
695
                 * Initial value: 0
696
                 */
697
                case 'TL':
698
                    $Tl = (float) $command['c'];
699
                    break;
700
701
                /*
702
                 * tx ty Td
703
                 * Move to the start of the next line, offset form the start of the
704
                 * current line by tx, ty.
705 4
                 */
706 4
                case 'Td':
707 4
                    $coord = explode(' ', $command['c']);
708
                    $Tx += (float) $coord[0];
709
                    $Ty += (float) $coord[1];
710
                    $Tm[$x] = (string) $Tx;
711 4
                    $Tm[$y] = (string) $Ty;
712
                    break;
713 4
714
                /*
715
                 * tx ty TD
716
                 * Move to the start of the next line, offset form the start of the
717
                 * current line by tx, ty. As a side effect, this operator set the leading
718
                 * parameter in the text state. This operator has the same effect as the
719
                 * code:
720
                 * -ty TL
721
                 * tx ty Td
722
                 */
723
                case 'TD':
724
                    $coord = explode(' ', $command['c']);
725
                    $Tl = (float) $coord[1];
726
                    $Tx += (float) $coord[0];
727
                    $Ty -= (float) $coord[1];
728
                    $Tm[$x] = (string) $Tx;
729
                    $Tm[$y] = (string) $Ty;
730
                    break;
731
732
                /*
733
                 * a b c d e f Tm
734 1
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
735
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
736 1
                 * [1 0 0 1 0 0]
737 1
                 */
738
                case 'Tm':
739
                    $Tm = explode(' ', $command['c']);
740 1
                    $Tx = (float) $Tm[$x];
741 1
                    $Ty = (float) $Tm[$y];
742
                    break;
743
744 1
                /*
745 1
                 * T*
746
                 * Move to the start of the next line. This operator has the same effect
747
                 * as the code:
748 1
                 * 0 Tl Td
749
                 * Where Tl is the current leading parameter in the text state.
750
                 */
751
                case 'T*':
752 1
                    $Ty -= $Tl;
753 1
                    $Tm[$y] = (string) $Ty;
754
                    break;
755 1
756 1
                /*
757 1
                 * string Tj
758 1
                 * Show a Text String
759 1
                 */
760 1
                case 'Tj':
761 1
                    $extractedData[] = [$Tm, $currentText];
762
                    break;
763
764
                /*
765
                 * string '
766
                 * Move to the next line and show a text string. This operator has the
767
                 * same effect as the code:
768 1
                 * T*
769
                 * string Tj
770
                 */
771
                case "'":
772
                    $Ty -= $Tl;
773
                    $Tm[$y] = (string) $Ty;
774
                    $extractedData[] = [$Tm, $currentText];
775 1
                    break;
776 1
777 1
                /*
778 1
                 * aw ac string "
779 1
                 * Move to the next line and show a text string, using aw as the word
780 1
                 * spacing and ac as the character spacing. This operator has the same
781
                 * effect as the code:
782
                 * aw Tw
783
                 * ac Tc
784 1
                 * string '
785
                 * Tw set the word spacing, Tw, to wordSpace.
786
                 * Tc Set the character spacing, Tc, to charsSpace.
787
                 */
788
                case '"':
789
                    $data = explode(' ', $currentText);
790
                    $Ty -= $Tl;
791
                    $Tm[$y] = (string) $Ty;
792
                    $extractedData[] = [$Tm, $data[2]]; //Verify
793
                    break;
794
795
                /*
796
                 * array TJ
797
                 * Show one or more text strings allow individual glyph positioning.
798
                 * Each lement of array con be a string or a number. If the element is
799
                 * a string, this operator shows the string. If it is a number, the
800
                 * operator adjust the text position by that amount; that is, it translates
801
                 * the text matrix, Tm. This amount is substracted form the current
802
                 * horizontal or vertical coordinate, depending on the writing mode.
803
                 * in the default coordinate system, a positive adjustment has the effect
804
                 * of moving the next glyph painted either to the left or down by the given
805
                 * amount.
806
                 */
807
                case 'TJ':
808
                    $extractedData[] = [$Tm, $currentText];
809
                    break;
810
                default:
811
            }
812
        }
813
        $this->dataTm = $extractedData;
814
815
        return $extractedData;
816
    }
817
818
    /**
819
     * Gets text data that are around the given coordinates (X,Y)
820
     *
821
     * If the text is in near the given coordinates (X,Y) (or the TM info),
822
     * the text is returned.  The extractedData return by getDataTm, could be use to see
823
     * where is the coordinates of a given text, using the TM info for it.
824
     *
825
     * @param float $x      The X value of the coordinate to search for. if null
826
     *                      just the Y value is considered (same Row)
827
     * @param float $y      The Y value of the coordinate to search for
828
     *                      just the X value is considered (same column)
829
     * @param float $xError The value less or more to consider an X to be "near"
830
     * @param float $yError The value less or more to consider an Y to be "near"
831
     *
832
     * @return array An array of text that are near the given coordinates. If no text
833
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
834
     *               and y coordinates are null, null is returned.
835
     */
836
    public function getTextXY(float $x = null, float $y = null, float $xError = 0, float $yError = 0): array
837
    {
838
        if (!isset($this->dataTm) || !$this->dataTm) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->dataTm of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
839
            $this->getDataTm();
840
        }
841
842
        if (null !== $x) {
843
            $x = (float) $x;
844
        }
845
846
        if (null !== $y) {
847
            $y = (float) $y;
848
        }
849
850
        if (null === $x && null === $y) {
851
            return [];
852
        }
853
854
        $xError = (float) $xError;
855
        $yError = (float) $yError;
856
857
        $extractedData = [];
858
        foreach ($this->dataTm as $item) {
859
            $tm = $item[0];
860
            $xTm = (float) $tm[4];
861
            $yTm = (float) $tm[5];
862
            $text = $item[1];
863
            if (null === $y) {
864
                if (($xTm >= ($x - $xError)) &&
865
                    ($xTm <= ($x + $xError))) {
866
                    $extractedData[] = [$tm, $text];
867
                    continue;
868
                }
869
            }
870
            if (null === $x) {
871
                if (($yTm >= ($y - $yError)) &&
872
                    ($yTm <= ($y + $yError))) {
873
                    $extractedData[] = [$tm, $text];
874
                    continue;
875
                }
876
            }
877
            if (($xTm >= ($x - $xError)) &&
878
                ($xTm <= ($x + $xError)) &&
879
                ($yTm >= ($y - $yError)) &&
880
                ($yTm <= ($y + $yError))) {
881
                $extractedData[] = [$tm, $text];
882
                continue;
883
            }
884
        }
885
886
        return $extractedData;
887
    }
888
}
889