Passed
Push — feature/switch-to-phpunit ( f5d302...e3df65 )
by Konrad
03:58
created

Page::getTextXY()   F

Complexity

Conditions 20
Paths 392

Size

Total Lines 55
Code Lines 39

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 27
CRAP Score 29.7052

Importance

Changes 2
Bugs 0 Features 0
Metric Value
eloc 39
dl 0
loc 55
rs 0.9333
c 2
b 0
f 0
ccs 27
cts 38
cp 0.7105
cc 20
nc 392
nop 4
crap 29.7052

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementMissing;
35
use Smalot\PdfParser\Element\ElementNull;
36
use Smalot\PdfParser\Element\ElementXRef;
37
38
class Page extends PDFObject
39
{
40
    /**
41
     * @var Font[]
42
     */
43
    protected $fonts = null;
44
45
    /**
46
     * @var PDFObject[]
47
     */
48
    protected $xobjects = null;
49
50
    /**
51
     * @var[]
52
     */
53
    protected $dataTm = null;
54
55
    /**
56
     * @return Font[]
57
     */
58 9
    public function getFonts()
59
    {
60 9
        if (null !== $this->fonts) {
61 8
            return $this->fonts;
62
        }
63
64 9
        $resources = $this->get('Resources');
65
66 9
        if (method_exists($resources, 'has') && $resources->has('Font')) {
67 9
            if ($resources->get('Font') instanceof Header) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

67
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof Header) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
68 4
                $fonts = $resources->get('Font')->getElements();
69
            } else {
70 7
                $fonts = $resources->get('Font')->getHeader()->getElements();
71
            }
72
73 9
            $table = [];
74
75 9
            foreach ($fonts as $id => $font) {
76 9
                if ($font instanceof Font) {
77 9
                    $table[$id] = $font;
78
79
                    // Store too on cleaned id value (only numeric)
80 9
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
81 9
                    if ('' != $id) {
82 9
                        $table[$id] = $font;
83
                    }
84
                }
85
            }
86
87 9
            return $this->fonts = $table;
88
        } else {
89 1
            return [];
90
        }
91
    }
92
93
    /**
94
     * @param string $id
95
     *
96
     * @return Font
97
     */
98 8
    public function getFont($id)
99
    {
100 8
        $fonts = $this->getFonts();
101
102 8
        if (isset($fonts[$id])) {
103 8
            return $fonts[$id];
104
        } else {
105 2
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
106
107 2
            if (isset($fonts[$id])) {
108 1
                return $fonts[$id];
109
            } else {
110 1
                return null;
111
            }
112
        }
113
    }
114
115
    /**
116
     * Support for XObject
117
     *
118
     * @return PDFObject[]
119
     */
120
    public function getXObjects()
121
    {
122
        if (null !== $this->xobjects) {
123
            return $this->xobjects;
124
        }
125
126
        $resources = $this->get('Resources');
127
128
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
129
            if ($resources->get('XObject') instanceof Header) {
130
                $xobjects = $resources->get('XObject')->getElements();
131
            } else {
132
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
133
            }
134
135
            $table = [];
136
137
            foreach ($xobjects as $id => $xobject) {
138
                $table[$id] = $xobject;
139
140
                // Store too on cleaned id value (only numeric)
141
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
142
                if ('' != $id) {
143
                    $table[$id] = $xobject;
144
                }
145
            }
146
147
            return $this->xobjects = $table;
148
        } else {
149
            return [];
150
        }
151
    }
152
153
    /**
154
     * @param string $id
155
     *
156
     * @return PDFObject
157
     */
158
    public function getXObject($id)
159
    {
160
        $xobjects = $this->getXObjects();
161
162
        if (isset($xobjects[$id])) {
163
            return $xobjects[$id];
164
        } else {
165
            return null;
166
            /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
167
168
            if (isset($xobjects[$id])) {
169
                return $xobjects[$id];
170
            } else {
171
                return null;
172
            }*/
173
        }
174
    }
175
176
    /**
177
     * @param Page
178
     *
179
     * @return string
180
     */
181 3
    public function getText(self $page = null)
182
    {
183 3
        if ($contents = $this->get('Contents')) {
184 3
            if ($contents instanceof ElementMissing) {
185
                return '';
186 3
            } elseif ($contents instanceof ElementNull) {
187
                return '';
188 3
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
189 3
                $elements = $contents->getHeader()->getElements();
190
191 3
                if (is_numeric(key($elements))) {
192
                    $new_content = '';
193
194
                    foreach ($elements as $element) {
195
                        if ($element instanceof ElementXRef) {
196
                            $new_content .= $element->getObject()->getContent();
197
                        } else {
198
                            $new_content .= $element->getContent();
199
                        }
200
                    }
201
202
                    $header = new Header([], $this->document);
203 3
                    $contents = new PDFObject($this->document, $header, $new_content);
204
                }
205 1
            } elseif ($contents instanceof ElementArray) {
206
                // Create a virtual global content.
207 1
                $new_content = '';
208
209 1
                foreach ($contents->getContent() as $content) {
210 1
                    $new_content .= $content->getContent()."\n";
211
                }
212
213 1
                $header = new Header([], $this->document);
214 1
                $contents = new PDFObject($this->document, $header, $new_content);
215
            }
216
217 3
            return $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

217
            return $contents->/** @scrutinizer ignore-call */ getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
218
        }
219
220
        return '';
221
    }
222
223
    /**
224
     * @param Page
225
     *
226
     * @return array
227
     */
228
    public function getTextArray(self $page = null)
229
    {
230
        if ($contents = $this->get('Contents')) {
231
            if ($contents instanceof ElementMissing) {
232
                return [];
233
            } elseif ($contents instanceof ElementNull) {
234
                return [];
235
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
236
                $elements = $contents->getHeader()->getElements();
237
238
                if (is_numeric(key($elements))) {
239
                    $new_content = '';
240
241
                    /** @var PDFObject $element */
242
                    foreach ($elements as $element) {
243
                        if ($element instanceof ElementXRef) {
244
                            $new_content .= $element->getObject()->getContent();
245
                        } else {
246
                            $new_content .= $element->getContent();
247
                        }
248
                    }
249
250
                    $header = new Header([], $this->document);
251
                    $contents = new PDFObject($this->document, $header, $new_content);
252
                }
253
            } elseif ($contents instanceof ElementArray) {
254
                // Create a virtual global content.
255
                $new_content = '';
256
257
                /** @var PDFObject $content */
258
                foreach ($contents->getContent() as $content) {
259
                    $new_content .= $content->getContent()."\n";
260
                }
261
262
                $header = new Header([], $this->document);
263
                $contents = new PDFObject($this->document, $header, $new_content);
264
            }
265
266
            return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

266
            return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
267
        }
268
269
        return [];
270
    }
271
272
    /**
273
     * Gets all the text data with its internal representation of the page.
274
     *
275
     * @return array An array with the data and the internal representation
276
     */
277 5
    public function extractRawData()
278
    {
279
        /*
280
         * Now you can get the complete content of the object with the text on it
281
         */
282 5
        $extractedData = [];
283 5
        $content = $this->get('Contents');
284 5
        $values = $content->getContent();
285 5
        if (isset($values) and \is_array($values)) {
286
            $text = '';
287
            foreach ($values as $section) {
288
                $text .= $section->getContent();
289
            }
290
            $sectionsText = $this->getSectionsText($text);
291
            foreach ($sectionsText as $sectionText) {
292
                $commandsText = $this->getCommandsText($sectionText);
293
                foreach ($commandsText as $command) {
294
                    $extractedData[] = $command;
295
                }
296
            }
297
        } else {
298 5
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

298
            /** @scrutinizer ignore-call */ 
299
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
299 5
            foreach ($sectionsText as $sectionText) {
300 5
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

300
                /** @scrutinizer ignore-call */ 
301
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
301 5
                foreach ($commandsText as $command) {
302 5
                    $extractedData[] = $command;
303
                }
304
            }
305
        }
306
307 5
        return $extractedData;
308
    }
309
310
    /**
311
     * Gets all the decoded text data with it internal representation from a page.
312
     *
313
     * @param array $extractedRawData the extracted data return by extractRawData or
314
     *                                null if extractRawData should be called
315
     *
316
     * @return array An array with the data and the internal representation
317
     */
318 4
    public function extractDecodedRawData($extractedRawData = null)
319
    {
320 4
        if (!isset($extractedRawData) or !$extractedRawData) {
321 4
            $extractedRawData = $this->extractRawData();
322
        }
323 4
        $unicode = true;
324 4
        $currentFont = null;
325 4
        foreach ($extractedRawData as &$command) {
326 4
            if ('Tj' == $command['o'] or 'TJ' == $command['o']) {
327 4
                $data = $command['c'];
328 4
                if (!\is_array($data)) {
329 4
                    if (isset($currentFont)) {
330 4
                        $tmpText = $currentFont->decodeOctal($data);
331
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
332
                    }
333 4
                    $tmpText = str_replace(
334 4
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
335 4
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
336 4
                            $tmpText
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $tmpText seems to be defined later in this foreach loop on line 330. Are you sure it is defined here?
Loading history...
337
                    );
338 4
                    $tmpText = utf8_encode($tmpText);
339 4
                    if (isset($currentFont)) {
340 4
                        $tmpText = $currentFont->decodeContent($tmpText, $unicode);
341
                    }
342 4
                    $command['c'] = $tmpText;
343 4
                    continue;
344
                }
345 4
                $numText = \count($data);
346 4
                for ($i = 0; $i < $numText; ++$i) {
347 4
                    if (0 != ($i % 2)) {
348 4
                        continue;
349
                    }
350 4
                    $tmpText = $data[$i]['c'];
351 4
                    if (isset($currentFont)) {
352 4
                        $decodedText = $currentFont->decodeOctal($tmpText);
353
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
354
                    }
355 4
                    $decodedText = str_replace(
356 4
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
357 4
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
358 4
                            $decodedText
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable $decodedText does not seem to be defined for all execution paths leading up to this point.
Loading history...
359
                    );
360 4
                    $decodedText = utf8_encode($decodedText);
361 4
                    if (isset($currentFont)) {
362 4
                        $decodedText = $currentFont->decodeContent($decodedText, $unicode);
363
                    }
364 4
                    $command['c'][$i]['c'] = $decodedText;
365 4
                    continue;
366
                }
367 4
            } elseif ('Tf' == $command['o'] or 'TF' == $command['o']) {
368 4
                $fontId = explode(' ', $command['c'])[0];
369 4
                $currentFont = $this->getFont($fontId);
370 4
                continue;
371
            }
372
        }
373
374 4
        return $extractedRawData;
375
    }
376
377
    /**
378
     * Gets just the Text commands that are involved in text positions and
379
     * Text Matrix (Tm)
380
     *
381
     * It extract just the PDF commands that are involved with text positions, and
382
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
383
     *
384
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData
385
     *
386
     * @return array An array with the text command of the page
387
     */
388 3
    public function getDataCommands($extractedDecodedRawData = null)
389
    {
390 3
        if (!isset($extractedDecodedRawData) or !$extractedDecodedRawData) {
391 3
            $extractedDecodedRawData = $this->extractDecodedRawData();
392
        }
393 3
        $extractedData = [];
394 3
        foreach ($extractedDecodedRawData as $command) {
395 3
            switch ($command['o']) {
396
                /*
397
                 * BT
398
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
399
                 */
400 3
                case 'BT':
401
                    $extractedData[] = $command;
402
                    break;
403
404
                /*
405
                 * ET
406
                 * End a text object, discarding the text matrix
407
                 */
408 3
                case 'ET':
409
                    $extractedData[] = $command;
410
                    break;
411
412
                /*
413
                 * leading TL
414
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
415
                 * Initial value: 0
416
                 */
417 3
                case 'TL':
418 3
                    $extractedData[] = $command;
419 3
                    break;
420
421
                /*
422
                 * tx ty Td
423
                 * Move to the start of the next line, offset form the start of the
424
                 * current line by tx, ty.
425
                 */
426 3
                case 'Td':
427 3
                    $extractedData[] = $command;
428 3
                    break;
429
430
                /*
431
                 * tx ty TD
432
                 * Move to the start of the next line, offset form the start of the
433
                 * current line by tx, ty. As a side effect, this operator set the leading
434
                 * parameter in the text state. This operator has the same effect as the
435
                 * code:
436
                 * -ty TL
437
                 * tx ty Td
438
                 */
439 3
                case 'TD':
440
                    $extractedData[] = $command;
441
                    break;
442
443
                /*
444
                 * a b c d e f Tm
445
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
446
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
447
                 * [1 0 0 1 0 0]
448
                 */
449 3
                case 'Tm':
450 3
                    $extractedData[] = $command;
451 3
                    break;
452
453
                /*
454
                 * T*
455
                 * Move to the start of the next line. This operator has the same effect
456
                 * as the code:
457
                 * 0 Tl Td
458
                 * Where Tl is the current leading parameter in the text state.
459
                 */
460 3
                case 'T*':
461 3
                    $extractedData[] = $command;
462 3
                    break;
463
464
                /*
465
                 * string Tj
466
                 * Show a Text String
467
                 */
468 3
                case 'Tj':
469 3
                    $extractedData[] = $command;
470 3
                    break;
471
472
                /*
473
                 * string '
474
                 * Move to the next line and show a text string. This operator has the
475
                 * same effect as the code:
476
                 * T*
477
                 * string Tj
478
                 */
479 3
                case "'":
480
                    $extractedData[] = $command;
481
                    break;
482
483
                /*
484
                 * aw ac string "
485
                 * Move to the next lkine and show a text string, using aw as the word
486
                 * spacing and ac as the character spacing. This operator has the same
487
                 * effect as the code:
488
                 * aw Tw
489
                 * ac Tc
490
                 * string '
491
                 * Tw set the word spacing, Tw, to wordSpace.
492
                 * Tc Set the character spacing, Tc, to charsSpace.
493
                 */
494 3
                case '"':
495
                    $extractedData[] = $command;
496
                    break;
497
498
                /*
499
                 * array TJ
500
                 * Show one or more text strings allow individual glyph positioning.
501
                 * Each lement of array con be a string or a number. If the element is
502
                 * a string, this operator shows the string. If it is a number, the
503
                 * operator adjust the text position by that amount; that is, it translates
504
                 * the text matrix, Tm. This amount is substracted form the current
505
                 * horizontal or vertical coordinate, depending on the writing mode.
506
                 * in the default coordinate system, a positive adjustment has the effect
507
                 * of moving the next glyph painted either to the left or down by the given
508
                 * amount.
509
                 */
510 3
                case 'TJ':
511 3
                    $extractedData[] = $command;
512 3
                    break;
513 3
                default:
514
            }
515
        }
516
517 3
        return $extractedData;
518
    }
519
520
    /**
521
     * Gets the Text Matrix of the text in the page
522
     *
523
     * Return an array where every item is an array where the first item is the
524
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
525
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
526
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
527
     *
528
     * @param array $dataCommands the data extracted by getDataCommands
529
     *                            if null getDataCommands is called
530
     *
531
     * @return array an array with the data of the page including the Tm information
532
     *               of any text in the page
533
     */
534 2
    public function getDataTm($dataCommands = null)
535
    {
536 2
        if (!isset($dataCommands) or !$dataCommands) {
537 2
            $dataCommands = $this->getDataCommands();
538
        }
539
540
        /*
541
         * At the beginning of a text object Tm is the identity matrix
542
         */
543 2
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
544
545
        /*
546
         *  Set the text leading used by T*, ' and " operators
547
         */
548 2
        $defaultTl = 0;
549
550
        /*
551
         * Setting where are the X and Y coordinates in the matrix (Tm)
552
         */
553 2
        $x = 4;
554 2
        $y = 5;
555 2
        $Tx = 0;
556 2
        $Ty = 0;
557
558 2
        $Tm = $defaultTm;
559 2
        $Tl = $defaultTl;
560
561 2
        $extractedData = [];
562 2
        foreach ($dataCommands as $command) {
563 2
            switch ($command['o']) {
564
                /*
565
                 * BT
566
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
567
                 */
568 2
                case 'BT':
569
                    $Tm = $defaultTl;
570
                    $Tl = $defaultTl; //review this.
571
                    $Tx = 0;
572
                    $Ty = 0;
573
                    break;
574
575
                /*
576
                 * ET
577
                 * End a text object, discarding the text matrix
578
                 */
579 2
                case 'ET':
580
                    $Tm = $defaultTl;
581
                    $Tl = $defaultTl;  //review this
582
                    $Tx = 0;
583
                    $Ty = 0;
584
                    break;
585
586
                /*
587
                 * leading TL
588
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
589
                 * Initial value: 0
590
                 */
591 2
                case 'TL':
592 2
                    $Tl = (float) $command['c'];
593 2
                    break;
594
595
                /*
596
                 * tx ty Td
597
                 * Move to the start of the next line, offset form the start of the
598
                 * current line by tx, ty.
599
                 */
600 2
                case 'Td':
601 2
                    $coord = explode(' ', $command['c']);
602 2
                    $Tx += (float) $coord[0];
603 2
                    $Ty += (float) $coord[1];
604 2
                    $Tm[$x] = (string) $Tx;
605 2
                    $Tm[$y] = (string) $Ty;
606 2
                    break;
607
608
                /*
609
                 * tx ty TD
610
                 * Move to the start of the next line, offset form the start of the
611
                 * current line by tx, ty. As a side effect, this operator set the leading
612
                 * parameter in the text state. This operator has the same effect as the
613
                 * code:
614
                 * -ty TL
615
                 * tx ty Td
616
                 */
617 2
                case 'TD':
618
                    $coord = explode(' ', $command['c']);
619
                    $Tl = (float) $coord[1];
620
                    $Tx += (float) $coord[0];
621
                    $Ty -= (float) $coord[1];
622
                    $Tm[$x] = (string) $Tx;
623
                    $Tm[$y] = (string) $Ty;
624
                    break;
625
626
                /*
627
                 * a b c d e f Tm
628
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
629
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
630
                 * [1 0 0 1 0 0]
631
                 */
632 2
                case 'Tm':
633 2
                    $Tm = explode(' ', $command['c']);
634 2
                    $Tx = (float) $Tm[$x];
635 2
                    $Ty = (float) $Tm[$y];
636 2
                    break;
637
638
                /*
639
                 * T*
640
                 * Move to the start of the next line. This operator has the same effect
641
                 * as the code:
642
                 * 0 Tl Td
643
                 * Where Tl is the current leading parameter in the text state.
644
                 */
645 2
                case 'T*':
646 2
                    $Ty -= $Tl;
647 2
                    $Tm[$y] = (string) $Ty;
648 2
                    break;
649
650
                /*
651
                 * string Tj
652
                 * Show a Text String
653
                 */
654 2
                case 'Tj':
655 2
                    $extractedData[] = [$Tm, $command['c']];
656 2
                    break;
657
658
                /*
659
                 * string '
660
                 * Move to the next line and show a text string. This operator has the
661
                 * same effect as the code:
662
                 * T*
663
                 * string Tj
664
                 */
665 2
                case "'":
666
                    $Ty -= Tl;
0 ignored issues
show
Bug introduced by
The constant Smalot\PdfParser\Tl was not found. Maybe you did not declare it correctly or list all dependencies?
Loading history...
667
                    $Tm[$y] = (string) $Ty;
668
                    $extractedData[] = [$Tm, $command['c']];
669
                    break;
670
671
                /*
672
                 * aw ac string "
673
                 * Move to the next line and show a text string, using aw as the word
674
                 * spacing and ac as the character spacing. This operator has the same
675
                 * effect as the code:
676
                 * aw Tw
677
                 * ac Tc
678
                 * string '
679
                 * Tw set the word spacing, Tw, to wordSpace.
680
                 * Tc Set the character spacing, Tc, to charsSpace.
681
                 */
682 2
                case '"':
683
                    $data = explode(' ', $command['c']);
684
                    $Ty -= Tl;
685
                    $Tm[$y] = (string) $Ty;
686
                    $extractedData[] = [$Tm, $data[2]]; //Verify
687
                    break;
688
689
                /*
690
                 * array TJ
691
                 * Show one or more text strings allow individual glyph positioning.
692
                 * Each lement of array con be a string or a number. If the element is
693
                 * a string, this operator shows the string. If it is a number, the
694
                 * operator adjust the text position by that amount; that is, it translates
695
                 * the text matrix, Tm. This amount is substracted form the current
696
                 * horizontal or vertical coordinate, depending on the writing mode.
697
                 * in the default coordinate system, a positive adjustment has the effect
698
                 * of moving the next glyph painted either to the left or down by the given
699
                 * amount.
700
                 */
701 2
                case 'TJ':
702 2
                    $text = [];
703 2
                    $data = $command['c'];
704 2
                    $numText = \count($data);
705 2
                    for ($i = 0; $i < $numText; ++$i) {
706 2
                        if ('n' == $data[$i]['t']) {
707 2
                            continue;
708
                        }
709 2
                        $tmpText = $data[$i]['c'];
710 2
                        $text[] = $tmpText;
711
                    }
712 2
                    $tjText = ''.implode('', $text);
713 2
                    $extractedData[] = [$Tm, $tjText];
714 2
                    break;
715 2
                default:
716
            }
717
        }
718 2
        $this->dataTm = $extractedData;
719
720 2
        return $extractedData;
721
    }
722
723
    /**
724
     * Gets text data that are around the given coordinates (X,Y)
725
     *
726
     * If the text is in near the given coordinates (X,Y) (or the TM info),
727
     * the text is returned.  The extractedData return by getDataTm, could be use to see
728
     * where is the coordinates of a given text, using the TM info for it.
729
     *
730
     * @param float $x      The X value of the coordinate to search for. if null
731
     *                      just the Y value is considered (same Row)
732
     * @param float $y      The Y value of the coordinate to search for
733
     *                      just the X value is considered (same column)
734
     * @param float $xError The value less or more to consider an X to be "near"
735
     * @param float $yError The value less or more to consider an Y to be "near"
736
     *
737
     * @return array An array of text that are near the given coordinates. If no text
738
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
739
     *               and y coordinates are null, null is returned.
740
     */
741 1
    public function getTextXY($x, $y, $xError = 0, $yError = 0)
742
    {
743 1
        if (!isset($this->dataTm) or !$this->dataTm) {
744 1
            $this->getDataTm();
745
        }
746 1
        if (isset($x)) {
747 1
            $x = (float) $x;
748
        }
749 1
        if (isset($y)) {
750 1
            $y = (float) $y;
751
        }
752 1
        if (!isset($x) and !isset($y)) {
753
            return null;
754
        }
755
756 1
        if (!isset($xError)) {
757
            $xError = 0;
758
        } else {
759 1
            $xError = (float) $xError;
760
        }
761 1
        if (!isset($yError)) {
762
            $yError = 0;
763
        } else {
764 1
            $yError = (float) $yError;
765
        }
766 1
        $extractedData = [];
767 1
        foreach ($this->dataTm as $item) {
768 1
            $tm = $item[0];
769 1
            $xTm = (float) $tm[4];
770 1
            $yTm = (float) $tm[5];
771 1
            $text = $item[1];
772 1
            if (!isset($y)) {
773
                if (($xTm >= ($x - $xError)) and
774
                    ($xTm <= ($x + $xError))) {
775
                    $extractedData[] = [$tm, $text];
776
                    continue;
777
                }
778
            }
779 1
            if (!isset($x)) {
780
                if (($yTm >= ($y - $yError)) and
781
                    ($yTm <= ($y + $yError))) {
782
                    $extractedData[] = [$tm, $text];
783
                    continue;
784
                }
785
            }
786 1
            if (($xTm >= ($x - $xError)) and
787 1
                ($xTm <= ($x + $xError)) and
788 1
                ($yTm >= ($y - $yError)) and
789 1
                ($yTm <= ($y + $yError))) {
790 1
                $extractedData[] = [$tm, $text];
791 1
                continue;
792
            }
793
        }
794
795 1
        return $extractedData;
796
    }
797
}
798