Passed
Pull Request — master (#455)
by
unknown
01:56
created

Page::getDataTm()   C

Complexity

Conditions 15
Paths 26

Size

Total Lines 178
Code Lines 74

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 53
CRAP Score 20.6769

Importance

Changes 2
Bugs 0 Features 0
Metric Value
cc 15
eloc 74
c 2
b 0
f 0
nc 26
nop 1
dl 0
loc 178
ccs 53
cts 75
cp 0.7067
crap 20.6769
rs 5.3006

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementMissing;
35
use Smalot\PdfParser\Element\ElementNull;
36
use Smalot\PdfParser\Element\ElementXRef;
37
38
class Page extends PDFObject
39
{
40
    /**
41
     * @var Font[]
42
     */
43
    protected $fonts = null;
44
45
    /**
46
     * @var PDFObject[]
47
     */
48
    protected $xobjects = null;
49
50
    /**
51
     * @var array
52
     */
53
    protected $dataTm = null;
54
55
    /**
56
     * @return Font[]
57
     */
58 24
    public function getFonts()
59
    {
60 24
        if (null !== $this->fonts) {
61 20
            return $this->fonts;
62
        }
63
64 24
        $resources = $this->get('Resources');
65
66 24
        if (method_exists($resources, 'has') && $resources->has('Font')) {
67 21
            if ($resources->get('Font') instanceof ElementMissing) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

67
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof ElementMissing) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
68 1
                return [];
69
            }
70
71 20
            if ($resources->get('Font') instanceof Header) {
72 14
                $fonts = $resources->get('Font')->getElements();
73
            } else {
74 9
                $fonts = $resources->get('Font')->getHeader()->getElements();
75
            }
76
77 20
            $table = [];
78
79 20
            foreach ($fonts as $id => $font) {
80 20
                if ($font instanceof Font) {
81 20
                    $table[$id] = $font;
82
83
                    // Store too on cleaned id value (only numeric)
84 20
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
85 20
                    if ('' != $id) {
86 20
                        $table[$id] = $font;
87
                    }
88
                }
89
            }
90
91 20
            return $this->fonts = $table;
92
        }
93
94 5
        return [];
95
    }
96
97 22
    public function getFont(string $id): ?Font
98
    {
99 22
        $fonts = $this->getFonts();
100
101 22
        if (isset($fonts[$id])) {
102 19
            return $fonts[$id];
103
        }
104
105
        // According to the PDF specs (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 238)
106
        // "The font resource name presented to the Tf operator is arbitrary, as are the names for all kinds of resources"
107
        // Instead, we search for the unfiltered name first and then do this cleaning as a fallback, so all tests still pass.
108
109 4
        if (isset($fonts[$id])) {
110
            return $fonts[$id];
111
        } else {
112 4
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
113 4
            if (isset($fonts[$id])) {
114 1
                return $fonts[$id];
115
            }
116
        }
117
118 3
        return null;
119
    }
120
121
    /**
122
     * Support for XObject
123
     *
124
     * @return PDFObject[]
125
     */
126 5
    public function getXObjects()
127
    {
128 5
        if (null !== $this->xobjects) {
129 4
            return $this->xobjects;
130
        }
131
132 5
        $resources = $this->get('Resources');
133
134 5
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
135 5
            if ($resources->get('XObject') instanceof Header) {
136 5
                $xobjects = $resources->get('XObject')->getElements();
137
            } else {
138
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
139
            }
140
141 5
            $table = [];
142
143 5
            foreach ($xobjects as $id => $xobject) {
144 5
                $table[$id] = $xobject;
145
146
                // Store too on cleaned id value (only numeric)
147 5
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
148 5
                if ('' != $id) {
149 5
                    $table[$id] = $xobject;
150
                }
151
            }
152
153 5
            return $this->xobjects = $table;
154
        }
155
156
        return [];
157
    }
158
159 4
    public function getXObject(string $id): ?PDFObject
160
    {
161 4
        $xobjects = $this->getXObjects();
162
163 4
        if (isset($xobjects[$id])) {
164 4
            return $xobjects[$id];
165
        }
166
167
        return null;
168
        /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
169
170
        if (isset($xobjects[$id])) {
171
            return $xobjects[$id];
172
        } else {
173
            return null;
174
        }*/
175
    }
176
177 13
    public function getText(self $page = null): string
178
    {
179 13
        if ($contents = $this->get('Contents')) {
180 13
            if ($contents instanceof ElementMissing) {
181
                return '';
182 13
            } elseif ($contents instanceof ElementNull) {
183
                return '';
184 13
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
185 10
                $elements = $contents->getHeader()->getElements();
186
187 10
                if (is_numeric(key($elements))) {
188
                    $new_content = '';
189
190
                    foreach ($elements as $element) {
191
                        if ($element instanceof ElementXRef) {
192
                            $new_content .= $element->getObject()->getContent();
193
                        } else {
194
                            $new_content .= $element->getContent();
195
                        }
196
                    }
197
198
                    $header = new Header([], $this->document);
199 10
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
200
                }
201 3
            } elseif ($contents instanceof ElementArray) {
202
                // Create a virtual global content.
203 3
                $new_content = '';
204
205 3
                foreach ($contents->getContent() as $content) {
206 3
                    $new_content .= $content->getContent()."\n";
207
                }
208
209 3
                $header = new Header([], $this->document);
210 3
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
211
            }
212
213 13
            return $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

213
            return $contents->/** @scrutinizer ignore-call */ getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
214
        }
215
216
        return '';
217
    }
218
219
    /**
220
     * Return True if the current page is a (setasign\Fpdi\Fpdi) FPDI/FPDF document
221
     * 
222
     * @return bool true is the current page is a FPDI/FPDF document
223
     */
224 10
    public function isFpdf(): Bool
225
    {
226 10
        if (array_key_exists("Producer", $this->document->getDetails(true)) and 
0 ignored issues
show
Unused Code introduced by
The call to Smalot\PdfParser\Document::getDetails() has too many arguments starting with true. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

226
        if (array_key_exists("Producer", $this->document->/** @scrutinizer ignore-call */ getDetails(true)) and 

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
227 10
            is_string($this->document->getDetails(true)["Producer"]) and 
228 10
            str_starts_with($this->document->getDetails(true)["Producer"], "FPDF")) {
229 2
                return true;
230
            }
231 9
        return false;
232
    }
233
234
    /**
235
     * Return the page number of the PDF document of the page object
236
     * 
237
     * @return int the page number
238
    */
239 2
    public function getPageNumber(): int 
240
    {
241 2
        $pages = $this->document->getPages();
242 2
        $numOfPages = count($pages);
243 2
        for ($pageNum = 0; $pageNum < $numOfPages; $pageNum++) {
244 2
            if ($pages[$pageNum] === $this) {
245 2
                break;
246
            }
247
        }
248 2
        return $pageNum;
249
    }
250
251 5
    public function getTextArray(self $page = null): array
252
    {
253 5
        if ($this->isFpdf()) {
254
            /** 
255
             * This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents. 
256
             * The page number is important for getting the PDF Commands and Text Matrix 
257
             */
258 1
            $pageNum = $this->getPageNumber();
259 1
            $xObjects = $this->getXObjects();
260
            /** The correct page info is in $xObject[$pageNum] */
261 1
            $xObject = $xObjects[$pageNum];
262 1
            $new_content = $xObject->getContent();
263 1
            $header = $xObject->getHeader();
264 1
            $config = $xObject->config;
265
            /** Now we create the PDFObject object with the correct info */
266 1
            $contents = new PDFObject($xObject->document, $header, $new_content, $config);
267 1
            return $contents->getTextArray($xObject);
268
        }
269 4
        if ($contents = $this->get('Contents')) {
270 4
            if ($contents instanceof ElementMissing) {
271
                return [];
272 4
            } elseif ($contents instanceof ElementNull) {
273
                return [];
274 4
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
275 4
                $elements = $contents->getHeader()->getElements();
276
277 4
                if (is_numeric(key($elements))) {
278
                    $new_content = '';
279
280
                    /** @var PDFObject $element */
281
                    foreach ($elements as $element) {
282
                        if ($element instanceof ElementXRef) {
283
                            $new_content .= $element->getObject()->getContent();
284
                        } else {
285
                            $new_content .= $element->getContent();
286
                        }
287
                    }
288
289
                    $header = new Header([], $this->document);
290
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
291
                } else {
292
                    try {
293 4
                        $contents->getTextArray($this);
294 1
                    } catch (\Throwable $e) {
295 4
                        return $contents->getTextArray();
296
                    }
297
                }
298
            } elseif ($contents instanceof ElementArray) {
299
                // Create a virtual global content.
300
                $new_content = '';
301
302
                /** @var PDFObject $content */
303
                foreach ($contents->getContent() as $content) {
304
                    $new_content .= $content->getContent()."\n";
305
                }
306
307
                $header = new Header([], $this->document);
308
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
309
            }
310
311 3
            return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

311
            return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
312
        }
313
314
        return [];
315
    }
316
317
    /**
318
     * Gets all the text data with its internal representation of the page.
319
     *
320
     * @return array An array with the data and the internal representation
321
     */
322 9
    public function extractRawData(): array
323
    {
324
        /*
325
         * Now you can get the complete content of the object with the text on it
326
         */
327 9
        $extractedData = [];
328 9
        $content = $this->get('Contents');
329 9
        $values = $content->getContent();
330 9
        if (isset($values) && \is_array($values)) {
331
            $text = '';
332
            foreach ($values as $section) {
333
                $text .= $section->getContent();
334
            }
335
            $sectionsText = $this->getSectionsText($text);
336
            foreach ($sectionsText as $sectionText) {
337
                $commandsText = $this->getCommandsText($sectionText);
338
                foreach ($commandsText as $command) {
339
                    $extractedData[] = $command;
340
                }
341
            }
342
        } else {
343 9
            if ($this->isFpdf()) {
344
                /** 
345
                 * This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents. 
346
                 * The page number is important for getting the PDF Commands and Text Matrix 
347
                 */
348 1
                    $pageNum = $this->getPageNumber();
349 1
                    $xObjects = $this->getXObjects();
350
                    /** The correct page info is in $xObject[$pageNum] */
351 1
                    $content = $xObjects[$pageNum];
352
            }
353 9
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

353
            /** @scrutinizer ignore-call */ 
354
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
354 9
            foreach ($sectionsText as $sectionText) {
355 9
                $extractedData[] = ['t' => '', 'o' => 'BT', 'c' => ''];
356
357 9
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

357
                /** @scrutinizer ignore-call */ 
358
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
358 9
                foreach ($commandsText as $command) {
359 9
                    $extractedData[] = $command;
360
                }
361
            }
362
        }
363
364 9
        return $extractedData;
365
    }
366
367
    /**
368
     * Gets all the decoded text data with it internal representation from a page.
369
     *
370
     * @param array $extractedRawData the extracted data return by extractRawData or
371
     *                                null if extractRawData should be called
372
     *
373
     * @return array An array with the data and the internal representation
374
     */
375 8
    public function extractDecodedRawData(array $extractedRawData = null): array
376
    {
377 8
        if (!isset($extractedRawData) || !$extractedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
378 8
            $extractedRawData = $this->extractRawData();
379
        }
380 8
        $currentFont = null; /** @var Font $currentFont */
381 8
        $clippedFont = null;
382 8
        $xObject = null;
383 8
        if ($this->isFpdf()) {
384
            /** This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents. 
385
             * The page number is important for getting the PDF Commands and Text Matrix 
386
             */
387 1
            $pageNum = $this->getPageNumber();
388 1
            $xObjects = $this->getXObjects();
389
            /** The correct font page info is in $xObject[$pageNum] */
390 1
            $xObject = $xObjects[$pageNum];
391
        }
392 8
        foreach ($extractedRawData as &$command) {
393 8
            if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
394 8
                $data = $command['c'];
395 8
                if (!\is_array($data)) {
396 6
                    $tmpText = '';
397 6
                    if (isset($currentFont)) {
398 6
                        $tmpText = $currentFont->decodeOctal($data);
399
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
400
                    }
401 6
                    $tmpText = str_replace(
402 6
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
403 6
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
404
                            $tmpText
405
                    );
406 6
                    $tmpText = utf8_encode($tmpText);
407 6
                    if (isset($currentFont)) {
408 6
                        $tmpText = $currentFont->decodeContent($tmpText);
409
                    }
410 6
                    $command['c'] = $tmpText;
411 6
                    continue;
412
                }
413 8
                $numText = \count($data);
414 8
                for ($i = 0; $i < $numText; ++$i) {
415 8
                    if (0 != ($i % 2)) {
416 6
                        continue;
417
                    }
418 8
                    $tmpText = $data[$i]['c'];
419 8
                    $decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
420 8
                    $decodedText = str_replace(
421 8
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
422 8
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
423
                            $decodedText
424
                    );
425 8
                    $decodedText = utf8_encode($decodedText);
426 8
                    if (isset($currentFont)) {
427 6
                        $decodedText = $currentFont->decodeContent($decodedText);
428
                    }
429 8
                    $command['c'][$i]['c'] = $decodedText;
430 8
                    continue;
431
                }
432 8
            } elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
433 8
                $fontId = explode(' ', $command['c'])[0];
434
                /** If document is a FPDI/FPDF the $xObject has the correct font */
435 8
                $currentFont = isset($xObject) ? $xObject->getFont($fontId) : $this->getFont($fontId);
0 ignored issues
show
Bug introduced by
The method getFont() does not exist on Smalot\PdfParser\PDFObject. It seems like you code against a sub-type of Smalot\PdfParser\PDFObject such as Smalot\PdfParser\Page. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

435
                $currentFont = isset($xObject) ? $xObject->/** @scrutinizer ignore-call */ getFont($fontId) : $this->getFont($fontId);
Loading history...
436 8
                continue;
437 8
            } elseif ('Q' == $command['o']) {
438
                $currentFont = $clippedFont;
439 8
            } elseif ('q' == $command['o']) {
440
                $clippedFont = $currentFont;
441
            }
442
        }
443
444 8
        return $extractedRawData;
445
    }
446
447
    /**
448
     * Gets just the Text commands that are involved in text positions and
449
     * Text Matrix (Tm)
450
     *
451
     * It extract just the PDF commands that are involved with text positions, and
452
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
453
     *
454
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData.
455
     *                                       If it is null, the method extractDecodeRawData is called.
456
     *
457
     * @return array An array with the text command of the page
458
     */
459 6
    public function getDataCommands(array $extractedDecodedRawData = null): array
460
    {
461 6
        if (!isset($extractedDecodedRawData) || !$extractedDecodedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedDecodedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
462 6
            $extractedDecodedRawData = $this->extractDecodedRawData();
463
        }
464 6
        $extractedData = [];
465 6
        foreach ($extractedDecodedRawData as $command) {
466 6
            switch ($command['o']) {
467
                /*
468
                 * BT
469
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
470
                 */
471 6
                case 'BT':
472 6
                    $extractedData[] = $command;
473 6
                    break;
474
475
                /*
476
                 * ET
477
                 * End a text object, discarding the text matrix
478
                 */
479 6
                case 'ET':
480
                    $extractedData[] = $command;
481
                    break;
482
483
                /*
484
                 * leading TL
485
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
486
                 * Initial value: 0
487
                 */
488 6
                case 'TL':
489 4
                    $extractedData[] = $command;
490 4
                    break;
491
492
                /*
493
                 * tx ty Td
494
                 * Move to the start of the next line, offset form the start of the
495
                 * current line by tx, ty.
496
                 */
497 6
                case 'Td':
498 6
                    $extractedData[] = $command;
499 6
                    break;
500
501
                /*
502
                 * tx ty TD
503
                 * Move to the start of the next line, offset form the start of the
504
                 * current line by tx, ty. As a side effect, this operator set the leading
505
                 * parameter in the text state. This operator has the same effect as the
506
                 * code:
507
                 * -ty TL
508
                 * tx ty Td
509
                 */
510 6
                case 'TD':
511
                    $extractedData[] = $command;
512
                    break;
513
514
                /*
515
                 * a b c d e f Tm
516
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
517
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
518
                 * [1 0 0 1 0 0]
519
                 */
520 6
                case 'Tm':
521 4
                    $extractedData[] = $command;
522 4
                    break;
523
524
                /*
525
                 * T*
526
                 * Move to the start of the next line. This operator has the same effect
527
                 * as the code:
528
                 * 0 Tl Td
529
                 * Where Tl is the current leading parameter in the text state.
530
                 */
531 6
                case 'T*':
532 4
                    $extractedData[] = $command;
533 4
                    break;
534
535
                /*
536
                 * string Tj
537
                 * Show a Text String
538
                 */
539 6
                case 'Tj':
540 5
                    $extractedData[] = $command;
541 5
                    break;
542
543
                /*
544
                 * string '
545
                 * Move to the next line and show a text string. This operator has the
546
                 * same effect as the code:
547
                 * T*
548
                 * string Tj
549
                 */
550 6
                case "'":
551
                    $extractedData[] = $command;
552
                    break;
553
554
                /*
555
                 * aw ac string "
556
                 * Move to the next lkine and show a text string, using aw as the word
557
                 * spacing and ac as the character spacing. This operator has the same
558
                 * effect as the code:
559
                 * aw Tw
560
                 * ac Tc
561
                 * string '
562
                 * Tw set the word spacing, Tw, to wordSpace.
563
                 * Tc Set the character spacing, Tc, to charsSpace.
564
                 */
565 6
                case '"':
566
                    $extractedData[] = $command;
567
                    break;
568
569
                /*
570
                 * array TJ
571
                 * Show one or more text strings allow individual glyph positioning.
572
                 * Each lement of array con be a string or a number. If the element is
573
                 * a string, this operator shows the string. If it is a number, the
574
                 * operator adjust the text position by that amount; that is, it translates
575
                 * the text matrix, Tm. This amount is substracted form the current
576
                 * horizontal or vertical coordinate, depending on the writing mode.
577
                 * in the default coordinate system, a positive adjustment has the effect
578
                 * of moving the next glyph painted either to the left or down by the given
579
                 * amount.
580
                 */
581 6
                case 'TJ':
582 6
                    $extractedData[] = $command;
583 6
                    break;
584
                default:
585
            }
586
        }
587
588 6
        return $extractedData;
589
    }
590
591
    /**
592
     * Gets the Text Matrix of the text in the page
593
     *
594
     * Return an array where every item is an array where the first item is the
595
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
596
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
597
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
598
     *
599
     * @param array $dataCommands the data extracted by getDataCommands
600
     *                            if null getDataCommands is called
601
     *
602
     * @return array an array with the data of the page including the Tm information
603
     *               of any text in the page
604
     */
605 5
    public function getDataTm(array $dataCommands = null): array
606
    {
607 5
        if (!isset($dataCommands) || !$dataCommands) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $dataCommands of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
608 5
            $dataCommands = $this->getDataCommands();
609
        }
610
611
        /*
612
         * At the beginning of a text object Tm is the identity matrix
613
         */
614 5
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
615
616
        /*
617
         *  Set the text leading used by T*, ' and " operators
618
         */
619 5
        $defaultTl = 0;
620
621
        /*
622
         * Setting where are the X and Y coordinates in the matrix (Tm)
623
         */
624 5
        $x = 4;
625 5
        $y = 5;
626 5
        $Tx = 0;
627 5
        $Ty = 0;
628
629 5
        $Tm = $defaultTm;
630 5
        $Tl = $defaultTl;
631
632 5
        $extractedTexts = $this->getTextArray();
633 5
        $extractedData = [];
634 5
        foreach ($dataCommands as $command) {
635 5
            $currentText = $extractedTexts[\count($extractedData)];
636 5
            switch ($command['o']) {
637
                /*
638
                 * BT
639
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
640
                 */
641 5
                case 'BT':
642 5
                    $Tm = $defaultTm;
643 5
                    $Tl = $defaultTl; //review this.
644 5
                    $Tx = 0;
645 5
                    $Ty = 0;
646 5
                    break;
647
648
                /*
649
                 * ET
650
                 * End a text object, discarding the text matrix
651
                 */
652 5
                case 'ET':
653
                    $Tm = $defaultTm;
654
                    $Tl = $defaultTl;  //review this
655
                    $Tx = 0;
656
                    $Ty = 0;
657
                    break;
658
659
                /*
660
                 * leading TL
661
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
662
                 * Initial value: 0
663
                 */
664 5
                case 'TL':
665 3
                    $Tl = (float) $command['c'];
666 3
                    break;
667
668
                /*
669
                 * tx ty Td
670
                 * Move to the start of the next line, offset form the start of the
671
                 * current line by tx, ty.
672
                 */
673 5
                case 'Td':
674 5
                    $coord = explode(' ', $command['c']);
675 5
                    $Tx += (float) $coord[0];
676 5
                    $Ty += (float) $coord[1];
677 5
                    $Tm[$x] = (string) $Tx;
678 5
                    $Tm[$y] = (string) $Ty;
679 5
                    break;
680
681
                /*
682
                 * tx ty TD
683
                 * Move to the start of the next line, offset form the start of the
684
                 * current line by tx, ty. As a side effect, this operator set the leading
685
                 * parameter in the text state. This operator has the same effect as the
686
                 * code:
687
                 * -ty TL
688
                 * tx ty Td
689
                 */
690 5
                case 'TD':
691
                    $coord = explode(' ', $command['c']);
692
                    $Tl = (float) $coord[1];
693
                    $Tx += (float) $coord[0];
694
                    $Ty -= (float) $coord[1];
695
                    $Tm[$x] = (string) $Tx;
696
                    $Tm[$y] = (string) $Ty;
697
                    break;
698
699
                /*
700
                 * a b c d e f Tm
701
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
702
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
703
                 * [1 0 0 1 0 0]
704
                 */
705 5
                case 'Tm':
706 3
                    $Tm = explode(' ', $command['c']);
707 3
                    $Tx = (float) $Tm[$x];
708 3
                    $Ty = (float) $Tm[$y];
709 3
                    break;
710
711
                /*
712
                 * T*
713
                 * Move to the start of the next line. This operator has the same effect
714
                 * as the code:
715
                 * 0 Tl Td
716
                 * Where Tl is the current leading parameter in the text state.
717
                 */
718 5
                case 'T*':
719 3
                    $Ty -= $Tl;
720 3
                    $Tm[$y] = (string) $Ty;
721 3
                    break;
722
723
                /*
724
                 * string Tj
725
                 * Show a Text String
726
                 */
727 5
                case 'Tj':
728 4
                    $extractedData[] = [$Tm, $currentText];
729 4
                    break;
730
731
                /*
732
                 * string '
733
                 * Move to the next line and show a text string. This operator has the
734
                 * same effect as the code:
735
                 * T*
736
                 * string Tj
737
                 */
738 5
                case "'":
739
                    $Ty -= $Tl;
740
                    $Tm[$y] = (string) $Ty;
741
                    $extractedData[] = [$Tm, $currentText];
742
                    break;
743
744
                /*
745
                 * aw ac string "
746
                 * Move to the next line and show a text string, using aw as the word
747
                 * spacing and ac as the character spacing. This operator has the same
748
                 * effect as the code:
749
                 * aw Tw
750
                 * ac Tc
751
                 * string '
752
                 * Tw set the word spacing, Tw, to wordSpace.
753
                 * Tc Set the character spacing, Tc, to charsSpace.
754
                 */
755 5
                case '"':
756
                    $data = explode(' ', $currentText);
757
                    $Ty -= $Tl;
758
                    $Tm[$y] = (string) $Ty;
759
                    $extractedData[] = [$Tm, $data[2]]; //Verify
760
                    break;
761
762
                /*
763
                 * array TJ
764
                 * Show one or more text strings allow individual glyph positioning.
765
                 * Each lement of array con be a string or a number. If the element is
766
                 * a string, this operator shows the string. If it is a number, the
767
                 * operator adjust the text position by that amount; that is, it translates
768
                 * the text matrix, Tm. This amount is substracted form the current
769
                 * horizontal or vertical coordinate, depending on the writing mode.
770
                 * in the default coordinate system, a positive adjustment has the effect
771
                 * of moving the next glyph painted either to the left or down by the given
772
                 * amount.
773
                 */
774 5
                case 'TJ':
775 5
                    $extractedData[] = [$Tm, $currentText];
776 5
                    break;
777
                default:
778
            }
779
        }
780 5
        $this->dataTm = $extractedData;
781
782 5
        return $extractedData;
783
    }
784
785
    /**
786
     * Gets text data that are around the given coordinates (X,Y)
787
     *
788
     * If the text is in near the given coordinates (X,Y) (or the TM info),
789
     * the text is returned.  The extractedData return by getDataTm, could be use to see
790
     * where is the coordinates of a given text, using the TM info for it.
791
     *
792
     * @param float $x      The X value of the coordinate to search for. if null
793
     *                      just the Y value is considered (same Row)
794
     * @param float $y      The Y value of the coordinate to search for
795
     *                      just the X value is considered (same column)
796
     * @param float $xError The value less or more to consider an X to be "near"
797
     * @param float $yError The value less or more to consider an Y to be "near"
798
     *
799
     * @return array An array of text that are near the given coordinates. If no text
800
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
801
     *               and y coordinates are null, null is returned.
802
     */
803 2
    public function getTextXY(float $x = null, float $y = null, float $xError = 0, float $yError = 0): array
804
    {
805 2
        if (!isset($this->dataTm) || !$this->dataTm) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->dataTm of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
806 1
            $this->getDataTm();
807
        }
808
809 2
        if (null !== $x) {
810 2
            $x = (float) $x;
811
        }
812
813 2
        if (null !== $y) {
814 2
            $y = (float) $y;
815
        }
816
817 2
        if (null === $x && null === $y) {
818
            return [];
819
        }
820
821 2
        $xError = (float) $xError;
822 2
        $yError = (float) $yError;
823
824 2
        $extractedData = [];
825 2
        foreach ($this->dataTm as $item) {
826 2
            $tm = $item[0];
827 2
            $xTm = (float) $tm[4];
828 2
            $yTm = (float) $tm[5];
829 2
            $text = $item[1];
830 2
            if (null === $y) {
831
                if (($xTm >= ($x - $xError)) &&
832
                    ($xTm <= ($x + $xError))) {
833
                    $extractedData[] = [$tm, $text];
834
                    continue;
835
                }
836
            }
837 2
            if (null === $x) {
838
                if (($yTm >= ($y - $yError)) &&
839
                    ($yTm <= ($y + $yError))) {
840
                    $extractedData[] = [$tm, $text];
841
                    continue;
842
                }
843
            }
844 2
            if (($xTm >= ($x - $xError)) &&
845 2
                ($xTm <= ($x + $xError)) &&
846 2
                ($yTm >= ($y - $yError)) &&
847 2
                ($yTm <= ($y + $yError))) {
848 2
                $extractedData[] = [$tm, $text];
849 2
                continue;
850
            }
851
        }
852
853 2
        return $extractedData;
854
    }
855
}
856