Passed
Pull Request — master (#457)
by
unknown
02:20
created

Page::getText()   B

Complexity

Conditions 10
Paths 7

Size

Total Lines 42
Code Lines 25

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 16
CRAP Score 14.6656

Importance

Changes 7
Bugs 1 Features 0
Metric Value
cc 10
eloc 25
c 7
b 1
f 0
nc 7
nop 1
dl 0
loc 42
ccs 16
cts 25
cp 0.64
crap 14.6656
rs 7.6666

How to fix   Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementMissing;
35
use Smalot\PdfParser\Element\ElementNull;
36
use Smalot\PdfParser\Element\ElementXRef;
37
38
class Page extends PDFObject
39
{
40
    /**
41
     * @var Font[]
42
     */
43
    protected $fonts = null;
44
45
    /**
46
     * @var PDFObject[]
47
     */
48
    protected $xobjects = null;
49
50
    /**
51
     * @var array
52
     */
53
    protected $dataTm = null;
54
55
    /**
56
     * @return Font[]
57
     */
58 23
    public function getFonts()
59
    {
60 23
        if (null !== $this->fonts) {
61 19
            return $this->fonts;
62
        }
63
64 23
        $resources = $this->get('Resources');
65
66 23
        if (method_exists($resources, 'has') && $resources->has('Font')) {
67 20
            if ($resources->get('Font') instanceof ElementMissing) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

67
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof ElementMissing) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
68 1
                return [];
69
            }
70
71 19
            if ($resources->get('Font') instanceof Header) {
72 13
                $fonts = $resources->get('Font')->getElements();
73
            } else {
74 8
                $fonts = $resources->get('Font')->getHeader()->getElements();
75
            }
76
77 19
            $table = [];
78
79 19
            foreach ($fonts as $id => $font) {
80 19
                if ($font instanceof Font) {
81 19
                    $table[$id] = $font;
82
83
                    // Store too on cleaned id value (only numeric)
84 19
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
85 19
                    if ('' != $id) {
86 19
                        $table[$id] = $font;
87
                    }
88
                }
89
            }
90
91 19
            return $this->fonts = $table;
92
        }
93
94 5
        return [];
95
    }
96
97 21
    public function getFont(string $id): ?Font
98
    {
99 21
        $fonts = $this->getFonts();
100
101 21
        if (isset($fonts[$id])) {
102 18
            return $fonts[$id];
103
        }
104
105
        // According to the PDF specs (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 238)
106
        // "The font resource name presented to the Tf operator is arbitrary, as are the names for all kinds of resources"
107
        // Instead, we search for the unfiltered name first and then do this cleaning as a fallback, so all tests still pass.
108
109 4
        if (isset($fonts[$id])) {
110
            return $fonts[$id];
111
        } else {
112 4
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
113 4
            if (isset($fonts[$id])) {
114 1
                return $fonts[$id];
115
            }
116
        }
117
118 3
        return null;
119
    }
120
121
    /**
122
     * Support for XObject
123
     *
124
     * @return PDFObject[]
125
     */
126 4
    public function getXObjects()
127
    {
128 4
        if (null !== $this->xobjects) {
129 3
            return $this->xobjects;
130
        }
131
132 4
        $resources = $this->get('Resources');
133
134 4
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
135 4
            if ($resources->get('XObject') instanceof Header) {
136 4
                $xobjects = $resources->get('XObject')->getElements();
137
            } else {
138
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
139
            }
140
141 4
            $table = [];
142
143 4
            foreach ($xobjects as $id => $xobject) {
144 4
                $table[$id] = $xobject;
145
146
                // Store too on cleaned id value (only numeric)
147 4
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
148 4
                if ('' != $id) {
149 4
                    $table[$id] = $xobject;
150
                }
151
            }
152
153 4
            return $this->xobjects = $table;
154
        }
155
156
        return [];
157
    }
158
159 4
    public function getXObject(string $id): ?PDFObject
160
    {
161 4
        $xobjects = $this->getXObjects();
162
163 4
        if (isset($xobjects[$id])) {
164 4
            return $xobjects[$id];
165
        }
166
167
        return null;
168
        /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
169
170
        if (isset($xobjects[$id])) {
171
            return $xobjects[$id];
172
        } else {
173
            return null;
174
        }*/
175
    }
176
177 13
    public function getText(self $page = null): string
178
    {
179 13
        PDFObject::$recursionStack = [];
180
        
181 13
        if ($contents = $this->get('Contents')) {
182 13
            if ($contents instanceof ElementMissing) {
183
                return '';
184 13
            } elseif ($contents instanceof ElementNull) {
185
                return '';
186 13
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
187 10
                $elements = $contents->getHeader()->getElements();
188
189 10
                if (is_numeric(key($elements))) {
190
                    $new_content = '';
191
192
                    foreach ($elements as $element) {
193
                        if ($element instanceof ElementXRef) {
194
                            $new_content .= $element->getObject()->getContent();
195
                        } else {
196
                            $new_content .= $element->getContent();
197
                        }
198
                    }
199
200
                    $header = new Header([], $this->document);
201 10
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
202
                }
203 3
            } elseif ($contents instanceof ElementArray) {
204
                // Create a virtual global content.
205 3
                $new_content = '';
206
207 3
                foreach ($contents->getContent() as $content) {
208 3
                    $new_content .= $content->getContent()."\n";
209
                }
210
211 3
                $header = new Header([], $this->document);
212 3
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
213
            }
214
215 13
            return $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

215
            return $contents->/** @scrutinizer ignore-call */ getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
216
        }
217
218
        return '';
219
    }
220
221 4
    public function getTextArray(self $page = null): array
222
    {
223 4
        if ($contents = $this->get('Contents')) {
224 4
            if ($contents instanceof ElementMissing) {
225
                return [];
226 4
            } elseif ($contents instanceof ElementNull) {
227
                return [];
228 4
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
229 4
                $elements = $contents->getHeader()->getElements();
230
231 4
                if (is_numeric(key($elements))) {
232
                    $new_content = '';
233
234
                    /** @var PDFObject $element */
235
                    foreach ($elements as $element) {
236
                        if ($element instanceof ElementXRef) {
237
                            $new_content .= $element->getObject()->getContent();
238
                        } else {
239
                            $new_content .= $element->getContent();
240
                        }
241
                    }
242
243
                    $header = new Header([], $this->document);
244
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
245
                } else {
246
                    try {
247 4
                        $contents->getTextArray($this);
248 1
                    } catch (\Throwable $e) {
249 4
                        return $contents->getTextArray();
250
                    }
251
                }
252
            } elseif ($contents instanceof ElementArray) {
253
                // Create a virtual global content.
254
                $new_content = '';
255
256
                /** @var PDFObject $content */
257
                foreach ($contents->getContent() as $content) {
258
                    $new_content .= $content->getContent()."\n";
259
                }
260
261
                $header = new Header([], $this->document);
262
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
263
            }
264
265 3
            return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

265
            return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
266
        }
267
268
        return [];
269
    }
270
271
    /**
272
     * Gets all the text data with its internal representation of the page.
273
     *
274
     * Returns an array with the data and the internal representation
275
     */
276 8
    public function extractRawData(): array
277
    {
278
        /*
279
         * Now you can get the complete content of the object with the text on it
280
         */
281 8
        $extractedData = [];
282 8
        $content = $this->get('Contents');
283 8
        $values = $content->getContent();
284 8
        if (isset($values) && \is_array($values)) {
285
            $text = '';
286
            foreach ($values as $section) {
287
                $text .= $section->getContent();
288
            }
289
            $sectionsText = $this->getSectionsText($text);
290
            foreach ($sectionsText as $sectionText) {
291
                $commandsText = $this->getCommandsText($sectionText);
292
                foreach ($commandsText as $command) {
293
                    $extractedData[] = $command;
294
                }
295
            }
296
        } else {
297 8
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

297
            /** @scrutinizer ignore-call */ 
298
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
298 8
            foreach ($sectionsText as $sectionText) {
299 8
                $extractedData[] = ['t' => '', 'o' => 'BT', 'c' => ''];
300
301 8
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

301
                /** @scrutinizer ignore-call */ 
302
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
302 8
                foreach ($commandsText as $command) {
303 8
                    $extractedData[] = $command;
304
                }
305
            }
306
        }
307
308 8
        return $extractedData;
309
    }
310
311
    /**
312
     * Gets all the decoded text data with it internal representation from a page.
313
     *
314
     * @param array $extractedRawData the extracted data return by extractRawData or
315
     *                                null if extractRawData should be called
316
     *
317
     * @return array An array with the data and the internal representation
318
     */
319 7
    public function extractDecodedRawData(array $extractedRawData = null): array
320
    {
321 7
        if (!isset($extractedRawData) || !$extractedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
322 7
            $extractedRawData = $this->extractRawData();
323
        }
324 7
        $currentFont = null; /** @var Font $currentFont */
325 7
        $clippedFont = null;
326 7
        foreach ($extractedRawData as &$command) {
327 7
            if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
328 7
                $data = $command['c'];
329 7
                if (!\is_array($data)) {
330 5
                    $tmpText = '';
331 5
                    if (isset($currentFont)) {
332 5
                        $tmpText = $currentFont->decodeOctal($data);
333
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
334
                    }
335 5
                    $tmpText = str_replace(
336 5
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
337 5
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
338
                            $tmpText
339
                    );
340 5
                    $tmpText = utf8_encode($tmpText);
341 5
                    if (isset($currentFont)) {
342 5
                        $tmpText = $currentFont->decodeContent($tmpText);
343
                    }
344 5
                    $command['c'] = $tmpText;
345 5
                    continue;
346
                }
347 7
                $numText = \count($data);
348 7
                for ($i = 0; $i < $numText; ++$i) {
349 7
                    if (0 != ($i % 2)) {
350 5
                        continue;
351
                    }
352 7
                    $tmpText = $data[$i]['c'];
353 7
                    $decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
354 7
                    $decodedText = str_replace(
355 7
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
356 7
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
357
                            $decodedText
358
                    );
359 7
                    $decodedText = utf8_encode($decodedText);
360 7
                    if (isset($currentFont)) {
361 5
                        $decodedText = $currentFont->decodeContent($decodedText);
362
                    }
363 7
                    $command['c'][$i]['c'] = $decodedText;
364 7
                    continue;
365
                }
366 7
            } elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
367 7
                $fontId = explode(' ', $command['c'])[0];
368 7
                $currentFont = $this->getFont($fontId);
369 7
                continue;
370 7
            } elseif ('Q' == $command['o']) {
371
                $currentFont = $clippedFont;
372 7
            } elseif ('q' == $command['o']) {
373
                $clippedFont = $currentFont;
374
            }
375
        }
376
377 7
        return $extractedRawData;
378
    }
379
380
    /**
381
     * Gets just the Text commands that are involved in text positions and
382
     * Text Matrix (Tm)
383
     *
384
     * It extract just the PDF commands that are involved with text positions, and
385
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
386
     *
387
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData.
388
     *                                       If it is null, the method extractDecodeRawData is called.
389
     *
390
     * @return array An array with the text command of the page
391
     */
392 5
    public function getDataCommands(array $extractedDecodedRawData = null): array
393
    {
394 5
        if (!isset($extractedDecodedRawData) || !$extractedDecodedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedDecodedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
395 5
            $extractedDecodedRawData = $this->extractDecodedRawData();
396
        }
397 5
        $extractedData = [];
398 5
        foreach ($extractedDecodedRawData as $command) {
399 5
            switch ($command['o']) {
400
                /*
401
                 * BT
402
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
403
                 */
404 5
                case 'BT':
405 5
                    $extractedData[] = $command;
406 5
                    break;
407
408
                /*
409
                 * ET
410
                 * End a text object, discarding the text matrix
411
                 */
412 5
                case 'ET':
413
                    $extractedData[] = $command;
414
                    break;
415
416
                /*
417
                 * leading TL
418
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
419
                 * Initial value: 0
420
                 */
421 5
                case 'TL':
422 3
                    $extractedData[] = $command;
423 3
                    break;
424
425
                /*
426
                 * tx ty Td
427
                 * Move to the start of the next line, offset form the start of the
428
                 * current line by tx, ty.
429
                 */
430 5
                case 'Td':
431 5
                    $extractedData[] = $command;
432 5
                    break;
433
434
                /*
435
                 * tx ty TD
436
                 * Move to the start of the next line, offset form the start of the
437
                 * current line by tx, ty. As a side effect, this operator set the leading
438
                 * parameter in the text state. This operator has the same effect as the
439
                 * code:
440
                 * -ty TL
441
                 * tx ty Td
442
                 */
443 5
                case 'TD':
444
                    $extractedData[] = $command;
445
                    break;
446
447
                /*
448
                 * a b c d e f Tm
449
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
450
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
451
                 * [1 0 0 1 0 0]
452
                 */
453 5
                case 'Tm':
454 3
                    $extractedData[] = $command;
455 3
                    break;
456
457
                /*
458
                 * T*
459
                 * Move to the start of the next line. This operator has the same effect
460
                 * as the code:
461
                 * 0 Tl Td
462
                 * Where Tl is the current leading parameter in the text state.
463
                 */
464 5
                case 'T*':
465 3
                    $extractedData[] = $command;
466 3
                    break;
467
468
                /*
469
                 * string Tj
470
                 * Show a Text String
471
                 */
472 5
                case 'Tj':
473 4
                    $extractedData[] = $command;
474 4
                    break;
475
476
                /*
477
                 * string '
478
                 * Move to the next line and show a text string. This operator has the
479
                 * same effect as the code:
480
                 * T*
481
                 * string Tj
482
                 */
483 5
                case "'":
484
                    $extractedData[] = $command;
485
                    break;
486
487
                /*
488
                 * aw ac string "
489
                 * Move to the next lkine and show a text string, using aw as the word
490
                 * spacing and ac as the character spacing. This operator has the same
491
                 * effect as the code:
492
                 * aw Tw
493
                 * ac Tc
494
                 * string '
495
                 * Tw set the word spacing, Tw, to wordSpace.
496
                 * Tc Set the character spacing, Tc, to charsSpace.
497
                 */
498 5
                case '"':
499
                    $extractedData[] = $command;
500
                    break;
501
502
                /*
503
                 * array TJ
504
                 * Show one or more text strings allow individual glyph positioning.
505
                 * Each lement of array con be a string or a number. If the element is
506
                 * a string, this operator shows the string. If it is a number, the
507
                 * operator adjust the text position by that amount; that is, it translates
508
                 * the text matrix, Tm. This amount is substracted form the current
509
                 * horizontal or vertical coordinate, depending on the writing mode.
510
                 * in the default coordinate system, a positive adjustment has the effect
511
                 * of moving the next glyph painted either to the left or down by the given
512
                 * amount.
513
                 */
514 5
                case 'TJ':
515 5
                    $extractedData[] = $command;
516 5
                    break;
517
                default:
518
            }
519
        }
520
521 5
        return $extractedData;
522
    }
523
524
    /**
525
     * Gets the Text Matrix of the text in the page
526
     *
527
     * Return an array where every item is an array where the first item is the
528
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
529
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
530
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
531
     *
532
     * @param array $dataCommands the data extracted by getDataCommands
533
     *                            if null getDataCommands is called
534
     *
535
     * @return array an array with the data of the page including the Tm information
536
     *               of any text in the page
537
     */
538 4
    public function getDataTm(array $dataCommands = null): array
539
    {
540 4
        if (!isset($dataCommands) || !$dataCommands) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $dataCommands of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
541 4
            $dataCommands = $this->getDataCommands();
542
        }
543
544
        /*
545
         * At the beginning of a text object Tm is the identity matrix
546
         */
547 4
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
548
549
        /*
550
         *  Set the text leading used by T*, ' and " operators
551
         */
552 4
        $defaultTl = 0;
553
554
        /*
555
         * Setting where are the X and Y coordinates in the matrix (Tm)
556
         */
557 4
        $x = 4;
558 4
        $y = 5;
559 4
        $Tx = 0;
560 4
        $Ty = 0;
561
562 4
        $Tm = $defaultTm;
563 4
        $Tl = $defaultTl;
564
565 4
        $extractedTexts = $this->getTextArray();
566 4
        $extractedData = [];
567 4
        foreach ($dataCommands as $command) {
568 4
            $currentText = $extractedTexts[\count($extractedData)];
569 4
            switch ($command['o']) {
570
                /*
571
                 * BT
572
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
573
                 */
574 4
                case 'BT':
575 4
                    $Tm = $defaultTm;
576 4
                    $Tl = $defaultTl; //review this.
577 4
                    $Tx = 0;
578 4
                    $Ty = 0;
579 4
                    break;
580
581
                /*
582
                 * ET
583
                 * End a text object, discarding the text matrix
584
                 */
585 4
                case 'ET':
586
                    $Tm = $defaultTm;
587
                    $Tl = $defaultTl;  //review this
588
                    $Tx = 0;
589
                    $Ty = 0;
590
                    break;
591
592
                /*
593
                 * leading TL
594
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
595
                 * Initial value: 0
596
                 */
597 4
                case 'TL':
598 2
                    $Tl = (float) $command['c'];
599 2
                    break;
600
601
                /*
602
                 * tx ty Td
603
                 * Move to the start of the next line, offset form the start of the
604
                 * current line by tx, ty.
605
                 */
606 4
                case 'Td':
607 4
                    $coord = explode(' ', $command['c']);
608 4
                    $Tx += (float) $coord[0];
609 4
                    $Ty += (float) $coord[1];
610 4
                    $Tm[$x] = (string) $Tx;
611 4
                    $Tm[$y] = (string) $Ty;
612 4
                    break;
613
614
                /*
615
                 * tx ty TD
616
                 * Move to the start of the next line, offset form the start of the
617
                 * current line by tx, ty. As a side effect, this operator set the leading
618
                 * parameter in the text state. This operator has the same effect as the
619
                 * code:
620
                 * -ty TL
621
                 * tx ty Td
622
                 */
623 4
                case 'TD':
624
                    $coord = explode(' ', $command['c']);
625
                    $Tl = (float) $coord[1];
626
                    $Tx += (float) $coord[0];
627
                    $Ty -= (float) $coord[1];
628
                    $Tm[$x] = (string) $Tx;
629
                    $Tm[$y] = (string) $Ty;
630
                    break;
631
632
                /*
633
                 * a b c d e f Tm
634
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
635
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
636
                 * [1 0 0 1 0 0]
637
                 */
638 4
                case 'Tm':
639 2
                    $Tm = explode(' ', $command['c']);
640 2
                    $Tx = (float) $Tm[$x];
641 2
                    $Ty = (float) $Tm[$y];
642 2
                    break;
643
644
                /*
645
                 * T*
646
                 * Move to the start of the next line. This operator has the same effect
647
                 * as the code:
648
                 * 0 Tl Td
649
                 * Where Tl is the current leading parameter in the text state.
650
                 */
651 4
                case 'T*':
652 2
                    $Ty -= $Tl;
653 2
                    $Tm[$y] = (string) $Ty;
654 2
                    break;
655
656
                /*
657
                 * string Tj
658
                 * Show a Text String
659
                 */
660 4
                case 'Tj':
661 3
                    $extractedData[] = [$Tm, $currentText];
662 3
                    break;
663
664
                /*
665
                 * string '
666
                 * Move to the next line and show a text string. This operator has the
667
                 * same effect as the code:
668
                 * T*
669
                 * string Tj
670
                 */
671 4
                case "'":
672
                    $Ty -= $Tl;
673
                    $Tm[$y] = (string) $Ty;
674
                    $extractedData[] = [$Tm, $currentText];
675
                    break;
676
677
                /*
678
                 * aw ac string "
679
                 * Move to the next line and show a text string, using aw as the word
680
                 * spacing and ac as the character spacing. This operator has the same
681
                 * effect as the code:
682
                 * aw Tw
683
                 * ac Tc
684
                 * string '
685
                 * Tw set the word spacing, Tw, to wordSpace.
686
                 * Tc Set the character spacing, Tc, to charsSpace.
687
                 */
688 4
                case '"':
689
                    $data = explode(' ', $currentText);
690
                    $Ty -= $Tl;
691
                    $Tm[$y] = (string) $Ty;
692
                    $extractedData[] = [$Tm, $data[2]]; //Verify
693
                    break;
694
695
                /*
696
                 * array TJ
697
                 * Show one or more text strings allow individual glyph positioning.
698
                 * Each lement of array con be a string or a number. If the element is
699
                 * a string, this operator shows the string. If it is a number, the
700
                 * operator adjust the text position by that amount; that is, it translates
701
                 * the text matrix, Tm. This amount is substracted form the current
702
                 * horizontal or vertical coordinate, depending on the writing mode.
703
                 * in the default coordinate system, a positive adjustment has the effect
704
                 * of moving the next glyph painted either to the left or down by the given
705
                 * amount.
706
                 */
707 4
                case 'TJ':
708 4
                    $extractedData[] = [$Tm, $currentText];
709 4
                    break;
710
                default:
711
            }
712
        }
713 4
        $this->dataTm = $extractedData;
714
715 4
        return $extractedData;
716
    }
717
718
    /**
719
     * Gets text data that are around the given coordinates (X,Y)
720
     *
721
     * If the text is in near the given coordinates (X,Y) (or the TM info),
722
     * the text is returned.  The extractedData return by getDataTm, could be use to see
723
     * where is the coordinates of a given text, using the TM info for it.
724
     *
725
     * @param float $x      The X value of the coordinate to search for. if null
726
     *                      just the Y value is considered (same Row)
727
     * @param float $y      The Y value of the coordinate to search for
728
     *                      just the X value is considered (same column)
729
     * @param float $xError The value less or more to consider an X to be "near"
730
     * @param float $yError The value less or more to consider an Y to be "near"
731
     *
732
     * @return array An array of text that are near the given coordinates. If no text
733
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
734
     *               and y coordinates are null, null is returned.
735
     */
736 1
    public function getTextXY(float $x = null, float $y = null, float $xError = 0, float $yError = 0): array
737
    {
738 1
        if (!isset($this->dataTm) || !$this->dataTm) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->dataTm of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
739 1
            $this->getDataTm();
740
        }
741
742 1
        if (null !== $x) {
743 1
            $x = (float) $x;
744
        }
745
746 1
        if (null !== $y) {
747 1
            $y = (float) $y;
748
        }
749
750 1
        if (null === $x && null === $y) {
751
            return [];
752
        }
753
754 1
        $xError = (float) $xError;
755 1
        $yError = (float) $yError;
756
757 1
        $extractedData = [];
758 1
        foreach ($this->dataTm as $item) {
759 1
            $tm = $item[0];
760 1
            $xTm = (float) $tm[4];
761 1
            $yTm = (float) $tm[5];
762 1
            $text = $item[1];
763 1
            if (null === $y) {
764
                if (($xTm >= ($x - $xError)) &&
765
                    ($xTm <= ($x + $xError))) {
766
                    $extractedData[] = [$tm, $text];
767
                    continue;
768
                }
769
            }
770 1
            if (null === $x) {
771
                if (($yTm >= ($y - $yError)) &&
772
                    ($yTm <= ($y + $yError))) {
773
                    $extractedData[] = [$tm, $text];
774
                    continue;
775
                }
776
            }
777 1
            if (($xTm >= ($x - $xError)) &&
778 1
                ($xTm <= ($x + $xError)) &&
779 1
                ($yTm >= ($y - $yError)) &&
780 1
                ($yTm <= ($y + $yError))) {
781 1
                $extractedData[] = [$tm, $text];
782 1
                continue;
783
            }
784
        }
785
786 1
        return $extractedData;
787
    }
788
}
789