Passed
Pull Request — master (#457)
by
unknown
02:33
created

Page   F

Complexity

Total Complexity 116

Size/Duplication

Total Lines 750
Duplicated Lines 0 %

Test Coverage

Coverage 75%

Importance

Changes 15
Bugs 3 Features 2
Metric Value
eloc 319
c 15
b 3
f 2
dl 0
loc 750
ccs 237
cts 316
cp 0.75
rs 2
wmc 116

11 Methods

Rating   Name   Duplication   Size   Complexity  
B getFonts() 0 37 9
A getFont() 0 22 4
A getXObject() 0 9 2
B getXObjects() 0 31 7
C getDataCommands() 0 130 15
B getText() 0 43 10
B getTextArray() 0 48 11
C extractDecodedRawData() 0 59 17
B extractRawData() 0 33 8
C getDataTm() 0 178 15
D getTextXY() 0 51 18

How to fix   Complexity   

Complex Class

Complex classes like Page often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use Page, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementMissing;
35
use Smalot\PdfParser\Element\ElementNull;
36
use Smalot\PdfParser\Element\ElementXRef;
37
38
class Page extends PDFObject
39
{
40
    /**
41
     * @var Font[]
42
     */
43
    protected $fonts = null;
44
45
    /**
46
     * @var PDFObject[]
47
     */
48
    protected $xobjects = null;
49
50
    /**
51
     * @var array
52
     */
53
    protected $dataTm = null;
54
55
    /**
56
     * @return Font[]
57
     */
58 23
    public function getFonts()
59
    {
60 23
        if (null !== $this->fonts) {
61 19
            return $this->fonts;
62
        }
63
64 23
        $resources = $this->get('Resources');
65
66 23
        if (method_exists($resources, 'has') && $resources->has('Font')) {
67 20
            if ($resources->get('Font') instanceof ElementMissing) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

67
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof ElementMissing) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
68 1
                return [];
69
            }
70
71 19
            if ($resources->get('Font') instanceof Header) {
72 13
                $fonts = $resources->get('Font')->getElements();
73
            } else {
74 8
                $fonts = $resources->get('Font')->getHeader()->getElements();
75
            }
76
77 19
            $table = [];
78
79 19
            foreach ($fonts as $id => $font) {
80 19
                if ($font instanceof Font) {
81 19
                    $table[$id] = $font;
82
83
                    // Store too on cleaned id value (only numeric)
84 19
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
85 19
                    if ('' != $id) {
86 19
                        $table[$id] = $font;
87
                    }
88
                }
89
            }
90
91 19
            return $this->fonts = $table;
92
        }
93
94 5
        return [];
95
    }
96
97 21
    public function getFont(string $id): ?Font
98
    {
99 21
        $fonts = $this->getFonts();
100
101 21
        if (isset($fonts[$id])) {
102 18
            return $fonts[$id];
103
        }
104
105
        // According to the PDF specs (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 238)
106
        // "The font resource name presented to the Tf operator is arbitrary, as are the names for all kinds of resources"
107
        // Instead, we search for the unfiltered name first and then do this cleaning as a fallback, so all tests still pass.
108
109 4
        if (isset($fonts[$id])) {
110
            return $fonts[$id];
111
        } else {
112 4
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
113 4
            if (isset($fonts[$id])) {
114 1
                return $fonts[$id];
115
            }
116
        }
117
118 3
        return null;
119
    }
120
121
    /**
122
     * Support for XObject
123
     *
124
     * @return PDFObject[]
125
     */
126 4
    public function getXObjects()
127
    {
128 4
        if (null !== $this->xobjects) {
129 3
            return $this->xobjects;
130
        }
131
132 4
        $resources = $this->get('Resources');
133
134 4
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
135 4
            if ($resources->get('XObject') instanceof Header) {
136 4
                $xobjects = $resources->get('XObject')->getElements();
137
            } else {
138
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
139
            }
140
141 4
            $table = [];
142
143 4
            foreach ($xobjects as $id => $xobject) {
144 4
                $table[$id] = $xobject;
145
146
                // Store too on cleaned id value (only numeric)
147 4
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
148 4
                if ('' != $id) {
149 4
                    $table[$id] = $xobject;
150
                }
151
            }
152
153 4
            return $this->xobjects = $table;
154
        }
155
156
        return [];
157
    }
158
159 4
    public function getXObject(string $id): ?PDFObject
160
    {
161 4
        $xobjects = $this->getXObjects();
162
163 4
        if (isset($xobjects[$id])) {
164 4
            return $xobjects[$id];
165
        }
166
167
        return null;
168
        /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
169
170
        if (isset($xobjects[$id])) {
171
            return $xobjects[$id];
172
        } else {
173
            return null;
174
        }*/
175
    }
176
177 13
    public function getText(self $page = null): string
178
    {
179 13
        if ($contents = $this->get('Contents')) {
180 13
            if ($contents instanceof ElementMissing) {
181
                return '';
182 13
            } elseif ($contents instanceof ElementNull) {
183
                return '';
184 13
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
185 10
                $elements = $contents->getHeader()->getElements();
186
187 10
                if (is_numeric(key($elements))) {
188
                    $new_content = '';
189
190
                    foreach ($elements as $element) {
191
                        if ($element instanceof ElementXRef) {
192
                            $new_content .= $element->getObject()->getContent();
193
                        } else {
194
                            $new_content .= $element->getContent();
195
                        }
196
                    }
197
198
                    $header = new Header([], $this->document);
199 10
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
200
                }
201 3
            } elseif ($contents instanceof ElementArray) {
202
                // Create a virtual global content.
203 3
                $new_content = '';
204
205 3
                foreach ($contents->getContent() as $content) {
206 3
                    $new_content .= $content->getContent()."\n";
207
                }
208
209 3
                $header = new Header([], $this->document);
210 3
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
211
            }
212
213 13
            $contentsText = $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

213
            /** @scrutinizer ignore-call */ 
214
            $contentsText = $contents->getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
214 13
            PDFObject::$recursionStack = [];
215
216 13
            return $contentsText;
217
        }
218
219
        return '';
220
    }
221
222 4
    public function getTextArray(self $page = null): array
223
    {
224 4
        if ($contents = $this->get('Contents')) {
225 4
            if ($contents instanceof ElementMissing) {
226
                return [];
227 4
            } elseif ($contents instanceof ElementNull) {
228
                return [];
229 4
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
230 4
                $elements = $contents->getHeader()->getElements();
231
232 4
                if (is_numeric(key($elements))) {
233
                    $new_content = '';
234
235
                    /** @var PDFObject $element */
236
                    foreach ($elements as $element) {
237
                        if ($element instanceof ElementXRef) {
238
                            $new_content .= $element->getObject()->getContent();
239
                        } else {
240
                            $new_content .= $element->getContent();
241
                        }
242
                    }
243
244
                    $header = new Header([], $this->document);
245
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
246
                } else {
247
                    try {
248 4
                        $contents->getTextArray($this);
249 1
                    } catch (\Throwable $e) {
250 4
                        return $contents->getTextArray();
251
                    }
252
                }
253
            } elseif ($contents instanceof ElementArray) {
254
                // Create a virtual global content.
255
                $new_content = '';
256
257
                /** @var PDFObject $content */
258
                foreach ($contents->getContent() as $content) {
259
                    $new_content .= $content->getContent()."\n";
260
                }
261
262
                $header = new Header([], $this->document);
263
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
264
            }
265
266 3
            return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

266
            return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
267
        }
268
269
        return [];
270
    }
271
272
    /**
273
     * Gets all the text data with its internal representation of the page.
274
     *
275
     * Returns an array with the data and the internal representation
276
     */
277 8
    public function extractRawData(): array
278
    {
279
        /*
280
         * Now you can get the complete content of the object with the text on it
281
         */
282 8
        $extractedData = [];
283 8
        $content = $this->get('Contents');
284 8
        $values = $content->getContent();
285 8
        if (isset($values) && \is_array($values)) {
286
            $text = '';
287
            foreach ($values as $section) {
288
                $text .= $section->getContent();
289
            }
290
            $sectionsText = $this->getSectionsText($text);
291
            foreach ($sectionsText as $sectionText) {
292
                $commandsText = $this->getCommandsText($sectionText);
293
                foreach ($commandsText as $command) {
294
                    $extractedData[] = $command;
295
                }
296
            }
297
        } else {
298 8
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

298
            /** @scrutinizer ignore-call */ 
299
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
299 8
            foreach ($sectionsText as $sectionText) {
300 8
                $extractedData[] = ['t' => '', 'o' => 'BT', 'c' => ''];
301
302 8
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

302
                /** @scrutinizer ignore-call */ 
303
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
303 8
                foreach ($commandsText as $command) {
304 8
                    $extractedData[] = $command;
305
                }
306
            }
307
        }
308
309 8
        return $extractedData;
310
    }
311
312
    /**
313
     * Gets all the decoded text data with it internal representation from a page.
314
     *
315
     * @param array $extractedRawData the extracted data return by extractRawData or
316
     *                                null if extractRawData should be called
317
     *
318
     * @return array An array with the data and the internal representation
319
     */
320 7
    public function extractDecodedRawData(array $extractedRawData = null): array
321
    {
322 7
        if (!isset($extractedRawData) || !$extractedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
323 7
            $extractedRawData = $this->extractRawData();
324
        }
325 7
        $currentFont = null; /** @var Font $currentFont */
326 7
        $clippedFont = null;
327 7
        foreach ($extractedRawData as &$command) {
328 7
            if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
329 7
                $data = $command['c'];
330 7
                if (!\is_array($data)) {
331 5
                    $tmpText = '';
332 5
                    if (isset($currentFont)) {
333 5
                        $tmpText = $currentFont->decodeOctal($data);
334
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
335
                    }
336 5
                    $tmpText = str_replace(
337 5
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
338 5
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
339
                            $tmpText
340
                    );
341 5
                    $tmpText = utf8_encode($tmpText);
342 5
                    if (isset($currentFont)) {
343 5
                        $tmpText = $currentFont->decodeContent($tmpText);
344
                    }
345 5
                    $command['c'] = $tmpText;
346 5
                    continue;
347
                }
348 7
                $numText = \count($data);
349 7
                for ($i = 0; $i < $numText; ++$i) {
350 7
                    if (0 != ($i % 2)) {
351 5
                        continue;
352
                    }
353 7
                    $tmpText = $data[$i]['c'];
354 7
                    $decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
355 7
                    $decodedText = str_replace(
356 7
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
357 7
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
358
                            $decodedText
359
                    );
360 7
                    $decodedText = utf8_encode($decodedText);
361 7
                    if (isset($currentFont)) {
362 5
                        $decodedText = $currentFont->decodeContent($decodedText);
363
                    }
364 7
                    $command['c'][$i]['c'] = $decodedText;
365 7
                    continue;
366
                }
367 7
            } elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
368 7
                $fontId = explode(' ', $command['c'])[0];
369 7
                $currentFont = $this->getFont($fontId);
370 7
                continue;
371 7
            } elseif ('Q' == $command['o']) {
372
                $currentFont = $clippedFont;
373 7
            } elseif ('q' == $command['o']) {
374
                $clippedFont = $currentFont;
375
            }
376
        }
377
378 7
        return $extractedRawData;
379
    }
380
381
    /**
382
     * Gets just the Text commands that are involved in text positions and
383
     * Text Matrix (Tm)
384
     *
385
     * It extract just the PDF commands that are involved with text positions, and
386
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
387
     *
388
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData.
389
     *                                       If it is null, the method extractDecodeRawData is called.
390
     *
391
     * @return array An array with the text command of the page
392
     */
393 5
    public function getDataCommands(array $extractedDecodedRawData = null): array
394
    {
395 5
        if (!isset($extractedDecodedRawData) || !$extractedDecodedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedDecodedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
396 5
            $extractedDecodedRawData = $this->extractDecodedRawData();
397
        }
398 5
        $extractedData = [];
399 5
        foreach ($extractedDecodedRawData as $command) {
400 5
            switch ($command['o']) {
401
                /*
402
                 * BT
403
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
404
                 */
405 5
                case 'BT':
406 5
                    $extractedData[] = $command;
407 5
                    break;
408
409
                /*
410
                 * ET
411
                 * End a text object, discarding the text matrix
412
                 */
413 5
                case 'ET':
414
                    $extractedData[] = $command;
415
                    break;
416
417
                /*
418
                 * leading TL
419
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
420
                 * Initial value: 0
421
                 */
422 5
                case 'TL':
423 3
                    $extractedData[] = $command;
424 3
                    break;
425
426
                /*
427
                 * tx ty Td
428
                 * Move to the start of the next line, offset form the start of the
429
                 * current line by tx, ty.
430
                 */
431 5
                case 'Td':
432 5
                    $extractedData[] = $command;
433 5
                    break;
434
435
                /*
436
                 * tx ty TD
437
                 * Move to the start of the next line, offset form the start of the
438
                 * current line by tx, ty. As a side effect, this operator set the leading
439
                 * parameter in the text state. This operator has the same effect as the
440
                 * code:
441
                 * -ty TL
442
                 * tx ty Td
443
                 */
444 5
                case 'TD':
445
                    $extractedData[] = $command;
446
                    break;
447
448
                /*
449
                 * a b c d e f Tm
450
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
451
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
452
                 * [1 0 0 1 0 0]
453
                 */
454 5
                case 'Tm':
455 3
                    $extractedData[] = $command;
456 3
                    break;
457
458
                /*
459
                 * T*
460
                 * Move to the start of the next line. This operator has the same effect
461
                 * as the code:
462
                 * 0 Tl Td
463
                 * Where Tl is the current leading parameter in the text state.
464
                 */
465 5
                case 'T*':
466 3
                    $extractedData[] = $command;
467 3
                    break;
468
469
                /*
470
                 * string Tj
471
                 * Show a Text String
472
                 */
473 5
                case 'Tj':
474 4
                    $extractedData[] = $command;
475 4
                    break;
476
477
                /*
478
                 * string '
479
                 * Move to the next line and show a text string. This operator has the
480
                 * same effect as the code:
481
                 * T*
482
                 * string Tj
483
                 */
484 5
                case "'":
485
                    $extractedData[] = $command;
486
                    break;
487
488
                /*
489
                 * aw ac string "
490
                 * Move to the next lkine and show a text string, using aw as the word
491
                 * spacing and ac as the character spacing. This operator has the same
492
                 * effect as the code:
493
                 * aw Tw
494
                 * ac Tc
495
                 * string '
496
                 * Tw set the word spacing, Tw, to wordSpace.
497
                 * Tc Set the character spacing, Tc, to charsSpace.
498
                 */
499 5
                case '"':
500
                    $extractedData[] = $command;
501
                    break;
502
503
                /*
504
                 * array TJ
505
                 * Show one or more text strings allow individual glyph positioning.
506
                 * Each lement of array con be a string or a number. If the element is
507
                 * a string, this operator shows the string. If it is a number, the
508
                 * operator adjust the text position by that amount; that is, it translates
509
                 * the text matrix, Tm. This amount is substracted form the current
510
                 * horizontal or vertical coordinate, depending on the writing mode.
511
                 * in the default coordinate system, a positive adjustment has the effect
512
                 * of moving the next glyph painted either to the left or down by the given
513
                 * amount.
514
                 */
515 5
                case 'TJ':
516 5
                    $extractedData[] = $command;
517 5
                    break;
518
                default:
519
            }
520
        }
521
522 5
        return $extractedData;
523
    }
524
525
    /**
526
     * Gets the Text Matrix of the text in the page
527
     *
528
     * Return an array where every item is an array where the first item is the
529
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
530
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
531
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
532
     *
533
     * @param array $dataCommands the data extracted by getDataCommands
534
     *                            if null getDataCommands is called
535
     *
536
     * @return array an array with the data of the page including the Tm information
537
     *               of any text in the page
538
     */
539 4
    public function getDataTm(array $dataCommands = null): array
540
    {
541 4
        if (!isset($dataCommands) || !$dataCommands) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $dataCommands of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
542 4
            $dataCommands = $this->getDataCommands();
543
        }
544
545
        /*
546
         * At the beginning of a text object Tm is the identity matrix
547
         */
548 4
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
549
550
        /*
551
         *  Set the text leading used by T*, ' and " operators
552
         */
553 4
        $defaultTl = 0;
554
555
        /*
556
         * Setting where are the X and Y coordinates in the matrix (Tm)
557
         */
558 4
        $x = 4;
559 4
        $y = 5;
560 4
        $Tx = 0;
561 4
        $Ty = 0;
562
563 4
        $Tm = $defaultTm;
564 4
        $Tl = $defaultTl;
565
566 4
        $extractedTexts = $this->getTextArray();
567 4
        $extractedData = [];
568 4
        foreach ($dataCommands as $command) {
569 4
            $currentText = $extractedTexts[\count($extractedData)];
570 4
            switch ($command['o']) {
571
                /*
572
                 * BT
573
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
574
                 */
575 4
                case 'BT':
576 4
                    $Tm = $defaultTm;
577 4
                    $Tl = $defaultTl; //review this.
578 4
                    $Tx = 0;
579 4
                    $Ty = 0;
580 4
                    break;
581
582
                /*
583
                 * ET
584
                 * End a text object, discarding the text matrix
585
                 */
586 4
                case 'ET':
587
                    $Tm = $defaultTm;
588
                    $Tl = $defaultTl;  //review this
589
                    $Tx = 0;
590
                    $Ty = 0;
591
                    break;
592
593
                /*
594
                 * leading TL
595
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
596
                 * Initial value: 0
597
                 */
598 4
                case 'TL':
599 2
                    $Tl = (float) $command['c'];
600 2
                    break;
601
602
                /*
603
                 * tx ty Td
604
                 * Move to the start of the next line, offset form the start of the
605
                 * current line by tx, ty.
606
                 */
607 4
                case 'Td':
608 4
                    $coord = explode(' ', $command['c']);
609 4
                    $Tx += (float) $coord[0];
610 4
                    $Ty += (float) $coord[1];
611 4
                    $Tm[$x] = (string) $Tx;
612 4
                    $Tm[$y] = (string) $Ty;
613 4
                    break;
614
615
                /*
616
                 * tx ty TD
617
                 * Move to the start of the next line, offset form the start of the
618
                 * current line by tx, ty. As a side effect, this operator set the leading
619
                 * parameter in the text state. This operator has the same effect as the
620
                 * code:
621
                 * -ty TL
622
                 * tx ty Td
623
                 */
624 4
                case 'TD':
625
                    $coord = explode(' ', $command['c']);
626
                    $Tl = (float) $coord[1];
627
                    $Tx += (float) $coord[0];
628
                    $Ty -= (float) $coord[1];
629
                    $Tm[$x] = (string) $Tx;
630
                    $Tm[$y] = (string) $Ty;
631
                    break;
632
633
                /*
634
                 * a b c d e f Tm
635
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
636
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
637
                 * [1 0 0 1 0 0]
638
                 */
639 4
                case 'Tm':
640 2
                    $Tm = explode(' ', $command['c']);
641 2
                    $Tx = (float) $Tm[$x];
642 2
                    $Ty = (float) $Tm[$y];
643 2
                    break;
644
645
                /*
646
                 * T*
647
                 * Move to the start of the next line. This operator has the same effect
648
                 * as the code:
649
                 * 0 Tl Td
650
                 * Where Tl is the current leading parameter in the text state.
651
                 */
652 4
                case 'T*':
653 2
                    $Ty -= $Tl;
654 2
                    $Tm[$y] = (string) $Ty;
655 2
                    break;
656
657
                /*
658
                 * string Tj
659
                 * Show a Text String
660
                 */
661 4
                case 'Tj':
662 3
                    $extractedData[] = [$Tm, $currentText];
663 3
                    break;
664
665
                /*
666
                 * string '
667
                 * Move to the next line and show a text string. This operator has the
668
                 * same effect as the code:
669
                 * T*
670
                 * string Tj
671
                 */
672 4
                case "'":
673
                    $Ty -= $Tl;
674
                    $Tm[$y] = (string) $Ty;
675
                    $extractedData[] = [$Tm, $currentText];
676
                    break;
677
678
                /*
679
                 * aw ac string "
680
                 * Move to the next line and show a text string, using aw as the word
681
                 * spacing and ac as the character spacing. This operator has the same
682
                 * effect as the code:
683
                 * aw Tw
684
                 * ac Tc
685
                 * string '
686
                 * Tw set the word spacing, Tw, to wordSpace.
687
                 * Tc Set the character spacing, Tc, to charsSpace.
688
                 */
689 4
                case '"':
690
                    $data = explode(' ', $currentText);
691
                    $Ty -= $Tl;
692
                    $Tm[$y] = (string) $Ty;
693
                    $extractedData[] = [$Tm, $data[2]]; //Verify
694
                    break;
695
696
                /*
697
                 * array TJ
698
                 * Show one or more text strings allow individual glyph positioning.
699
                 * Each lement of array con be a string or a number. If the element is
700
                 * a string, this operator shows the string. If it is a number, the
701
                 * operator adjust the text position by that amount; that is, it translates
702
                 * the text matrix, Tm. This amount is substracted form the current
703
                 * horizontal or vertical coordinate, depending on the writing mode.
704
                 * in the default coordinate system, a positive adjustment has the effect
705
                 * of moving the next glyph painted either to the left or down by the given
706
                 * amount.
707
                 */
708 4
                case 'TJ':
709 4
                    $extractedData[] = [$Tm, $currentText];
710 4
                    break;
711
                default:
712
            }
713
        }
714 4
        $this->dataTm = $extractedData;
715
716 4
        return $extractedData;
717
    }
718
719
    /**
720
     * Gets text data that are around the given coordinates (X,Y)
721
     *
722
     * If the text is in near the given coordinates (X,Y) (or the TM info),
723
     * the text is returned.  The extractedData return by getDataTm, could be use to see
724
     * where is the coordinates of a given text, using the TM info for it.
725
     *
726
     * @param float $x      The X value of the coordinate to search for. if null
727
     *                      just the Y value is considered (same Row)
728
     * @param float $y      The Y value of the coordinate to search for
729
     *                      just the X value is considered (same column)
730
     * @param float $xError The value less or more to consider an X to be "near"
731
     * @param float $yError The value less or more to consider an Y to be "near"
732
     *
733
     * @return array An array of text that are near the given coordinates. If no text
734
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
735
     *               and y coordinates are null, null is returned.
736
     */
737 1
    public function getTextXY(float $x = null, float $y = null, float $xError = 0, float $yError = 0): array
738
    {
739 1
        if (!isset($this->dataTm) || !$this->dataTm) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->dataTm of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
740 1
            $this->getDataTm();
741
        }
742
743 1
        if (null !== $x) {
744 1
            $x = (float) $x;
745
        }
746
747 1
        if (null !== $y) {
748 1
            $y = (float) $y;
749
        }
750
751 1
        if (null === $x && null === $y) {
752
            return [];
753
        }
754
755 1
        $xError = (float) $xError;
756 1
        $yError = (float) $yError;
757
758 1
        $extractedData = [];
759 1
        foreach ($this->dataTm as $item) {
760 1
            $tm = $item[0];
761 1
            $xTm = (float) $tm[4];
762 1
            $yTm = (float) $tm[5];
763 1
            $text = $item[1];
764 1
            if (null === $y) {
765
                if (($xTm >= ($x - $xError)) &&
766
                    ($xTm <= ($x + $xError))) {
767
                    $extractedData[] = [$tm, $text];
768
                    continue;
769
                }
770
            }
771 1
            if (null === $x) {
772
                if (($yTm >= ($y - $yError)) &&
773
                    ($yTm <= ($y + $yError))) {
774
                    $extractedData[] = [$tm, $text];
775
                    continue;
776
                }
777
            }
778 1
            if (($xTm >= ($x - $xError)) &&
779 1
                ($xTm <= ($x + $xError)) &&
780 1
                ($yTm >= ($y - $yError)) &&
781 1
                ($yTm <= ($y + $yError))) {
782 1
                $extractedData[] = [$tm, $text];
783 1
                continue;
784
            }
785
        }
786
787 1
        return $extractedData;
788
    }
789
}
790