Test Failed
Pull Request — master (#455)
by
unknown
02:08
created

Page::getTextArray()   C

Complexity

Conditions 12
Paths 9

Size

Total Lines 65
Code Lines 38

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 20
CRAP Score 16.3023

Importance

Changes 4
Bugs 2 Features 1
Metric Value
cc 12
eloc 38
c 4
b 2
f 1
nc 9
nop 1
dl 0
loc 65
ccs 20
cts 29
cp 0.6897
crap 16.3023
rs 6.9666

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementMissing;
35
use Smalot\PdfParser\Element\ElementNull;
36
use Smalot\PdfParser\Element\ElementXRef;
37
38
class Page extends PDFObject
39
{
40
    /**
41
     * @var Font[]
42
     */
43
    protected $fonts = null;
44
45
    /**
46
     * @var PDFObject[]
47
     */
48
    protected $xobjects = null;
49
50
    /**
51
     * @var array
52
     */
53
    protected $dataTm = null;
54
55
    /**
56
     * @return Font[]
57
     */
58 23
    public function getFonts()
59
    {
60 23
        if (null !== $this->fonts) {
61 19
            return $this->fonts;
62
        }
63
64 23
        $resources = $this->get('Resources');
65
66 23
        if (method_exists($resources, 'has') && $resources->has('Font')) {
67 20
            if ($resources->get('Font') instanceof ElementMissing) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

67
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof ElementMissing) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
68 1
                return [];
69
            }
70
71 19
            if ($resources->get('Font') instanceof Header) {
72 13
                $fonts = $resources->get('Font')->getElements();
73
            } else {
74 8
                $fonts = $resources->get('Font')->getHeader()->getElements();
75
            }
76
77 19
            $table = [];
78
79 19
            foreach ($fonts as $id => $font) {
80 19
                if ($font instanceof Font) {
81 19
                    $table[$id] = $font;
82
83
                    // Store too on cleaned id value (only numeric)
84 19
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
85 19
                    if ('' != $id) {
86 19
                        $table[$id] = $font;
87
                    }
88
                }
89
            }
90
91 19
            return $this->fonts = $table;
92
        }
93
94 5
        return [];
95
    }
96
97 21
    public function getFont(string $id): ?Font
98
    {
99 21
        $fonts = $this->getFonts();
100
101 21
        if (isset($fonts[$id])) {
102 18
            return $fonts[$id];
103
        }
104
105
        // According to the PDF specs (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 238)
106
        // "The font resource name presented to the Tf operator is arbitrary, as are the names for all kinds of resources"
107
        // Instead, we search for the unfiltered name first and then do this cleaning as a fallback, so all tests still pass.
108
109 4
        if (isset($fonts[$id])) {
110
            return $fonts[$id];
111
        } else {
112 4
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
113 4
            if (isset($fonts[$id])) {
114 1
                return $fonts[$id];
115
            }
116
        }
117
118 3
        return null;
119
    }
120
121
    /**
122
     * Support for XObject
123
     *
124
     * @return PDFObject[]
125
     */
126 4
    public function getXObjects()
127
    {
128 4
        if (null !== $this->xobjects) {
129 3
            return $this->xobjects;
130
        }
131
132 4
        $resources = $this->get('Resources');
133
134 4
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
135 4
            if ($resources->get('XObject') instanceof Header) {
136 4
                $xobjects = $resources->get('XObject')->getElements();
137
            } else {
138
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
139
            }
140
141 4
            $table = [];
142
143 4
            foreach ($xobjects as $id => $xobject) {
144 4
                $table[$id] = $xobject;
145
146
                // Store too on cleaned id value (only numeric)
147 4
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
148 4
                if ('' != $id) {
149 4
                    $table[$id] = $xobject;
150
                }
151
            }
152
153 4
            return $this->xobjects = $table;
154
        }
155
156
        return [];
157
    }
158
159 4
    public function getXObject(string $id): ?PDFObject
160
    {
161 4
        $xobjects = $this->getXObjects();
162
163 4
        if (isset($xobjects[$id])) {
164 4
            return $xobjects[$id];
165
        }
166
167
        return null;
168
        /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
169
170
        if (isset($xobjects[$id])) {
171
            return $xobjects[$id];
172
        } else {
173
            return null;
174
        }*/
175
    }
176
177 13
    public function getText(self $page = null): string
178
    {
179 13
        if ($contents = $this->get('Contents')) {
180 13
            if ($contents instanceof ElementMissing) {
181
                return '';
182 13
            } elseif ($contents instanceof ElementNull) {
183
                return '';
184 13
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
185 10
                $elements = $contents->getHeader()->getElements();
186
187 10
                if (is_numeric(key($elements))) {
188
                    $new_content = '';
189
190
                    foreach ($elements as $element) {
191
                        if ($element instanceof ElementXRef) {
192
                            $new_content .= $element->getObject()->getContent();
193
                        } else {
194
                            $new_content .= $element->getContent();
195
                        }
196
                    }
197
198
                    $header = new Header([], $this->document);
199 10
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
200
                }
201 3
            } elseif ($contents instanceof ElementArray) {
202
                // Create a virtual global content.
203 3
                $new_content = '';
204
205 3
                foreach ($contents->getContent() as $content) {
206 3
                    $new_content .= $content->getContent()."\n";
207
                }
208
209 3
                $header = new Header([], $this->document);
210 3
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
211
            }
212
213 13
            /*
214
             * Elements referencing each other on the same page can cause endless loops during text parsing.
215
             * To combat this we keep a recursionStack containing already parsed elements on the page.
216
             * The stack is only emptied here after getting text from a page.
217
             */
218
            $contentsText = $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

218
            /** @scrutinizer ignore-call */ 
219
            $contentsText = $contents->getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
219 4
            PDFObject::$recursionStack = [];
220
221 4
            return $contentsText;
222 4
        }
223
224 4
        return '';
225
    }
226 4
227 4
    /**
228
     * Return true if the current page is a (setasign\Fpdi\Fpdi) FPDI/FPDF document
229 4
     *
230
     * @return bool true is the current page is a FPDI/FPDF document
231
     */
232
    public function isFpdf(): bool
233
    {
234
        if (\array_key_exists('Producer', $this->document->getDetails()) &&
235
            \is_string($this->document->getDetails()['Producer']) &&
236
            str_starts_with($this->document->getDetails()['Producer'], 'FPDF')) {
237
            return true;
238
        }
239
240
        return false;
241
    }
242
243
    /**
244
     * Return the page number of the PDF document of the page object
245 4
     *
246 1
     * @return int the page number
247 4
     */
248
    public function getPageNumber(): int
249
    {
250
        $pages = $this->document->getPages();
251
        $numOfPages = \count($pages);
252
        for ($pageNum = 0; $pageNum < $numOfPages; ++$pageNum) {
253
            if ($pages[$pageNum] === $this) {
254
                break;
255
            }
256
        }
257
258
        return $pageNum;
259
    }
260
261
    public function getTextArray(self $page = null): array
262
    {
263 3
        if ($this->isFpdf()) {
264
            /**
265
             * This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents.
266
             * The page number is important for getting the PDF Commands and Text Matrix
267
             */
268
            $pageNum = $this->getPageNumber();
269
            $xObjects = $this->getXObjects();
270
            /** The correct page info is in $xObject[$pageNum] */
271
            $xObject = $xObjects[$pageNum];
272
            $new_content = $xObject->getContent();
273
            $header = $xObject->getHeader();
274 8
            $config = $xObject->config;
275
            /** Now we create the PDFObject object with the correct info */
276
            $contents = new PDFObject($xObject->document, $header, $new_content, $config);
277
278
            return $contents->getTextArray($xObject);
279 8
        }
280 8
        if ($contents = $this->get('Contents')) {
281 8
            if ($contents instanceof ElementMissing) {
282 8
                return [];
283
            } elseif ($contents instanceof ElementNull) {
284
                return [];
285
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
286
                $elements = $contents->getHeader()->getElements();
287
288
                if (is_numeric(key($elements))) {
289
                    $new_content = '';
290
291
                    /** @var PDFObject $element */
292
                    foreach ($elements as $element) {
293
                        if ($element instanceof ElementXRef) {
294
                            $new_content .= $element->getObject()->getContent();
295 8
                        } else {
296 8
                            $new_content .= $element->getContent();
297 8
                        }
298
                    }
299 8
300 8
                    $header = new Header([], $this->document);
301 8
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
302
                } else {
303
                    try {
304
                        $contents->getTextArray($this);
305
                    } catch (\Throwable $e) {
306 8
                        return $contents->getTextArray();
307
                    }
308
                }
309
            } elseif ($contents instanceof ElementArray) {
310
                // Create a virtual global content.
311
                $new_content = '';
312
313
                /** @var PDFObject $content */
314
                foreach ($contents->getContent() as $content) {
315
                    $new_content .= $content->getContent()."\n";
316
                }
317 7
318
                $header = new Header([], $this->document);
319 7
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
320 7
            }
321
322 7
            return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

322
            return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
323 7
        }
324 7
325 7
        return [];
326 7
    }
327 7
328 5
    /**
329 5
     * Gets all the text data with its internal representation of the page.
330 5
     *
331
     * Returns an array with the data and the internal representation
332
     */
333 5
    public function extractRawData(): array
334 5
    {
335 5
        /*
336
         * Now you can get the complete content of the object with the text on it
337
         */
338 5
        $extractedData = [];
339 5
        $content = $this->get('Contents');
340 5
        $values = $content->getContent();
341
        if (isset($values) && \is_array($values)) {
342 5
            $text = '';
343 5
            foreach ($values as $section) {
344
                $text .= $section->getContent();
345 7
            }
346 7
            $sectionsText = $this->getSectionsText($text);
347 7
            foreach ($sectionsText as $sectionText) {
348 5
                $commandsText = $this->getCommandsText($sectionText);
349
                foreach ($commandsText as $command) {
350 7
                    $extractedData[] = $command;
351 7
                }
352 7
            }
353 7
        } else {
354 7
            if ($this->isFpdf()) {
355
                /*
356
                 * This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents.
357 7
                 * The page number is important for getting the PDF Commands and Text Matrix
358 7
                 */
359 5
                $pageNum = $this->getPageNumber();
360
                $xObjects = $this->getXObjects();
361 7
                // The correct page info is in $xObject[$pageNum]
362 7
                $content = $xObjects[$pageNum];
363
            }
364 7
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

364
            /** @scrutinizer ignore-call */ 
365
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
365 7
            foreach ($sectionsText as $sectionText) {
366 7
                $extractedData[] = ['t' => '', 'o' => 'BT', 'c' => ''];
367 7
368 7
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

368
                /** @scrutinizer ignore-call */ 
369
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
369
                foreach ($commandsText as $command) {
370 7
                    $extractedData[] = $command;
371
                }
372
            }
373
        }
374
375 7
        return $extractedData;
376
    }
377
378
    /**
379
     * Gets all the decoded text data with it internal representation from a page.
380
     *
381
     * @param array $extractedRawData the extracted data return by extractRawData or
382
     *                                null if extractRawData should be called
383
     *
384
     * @return array An array with the data and the internal representation
385
     */
386
    public function extractDecodedRawData(array $extractedRawData = null): array
387
    {
388
        if (!isset($extractedRawData) || !$extractedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
389
            $extractedRawData = $this->extractRawData();
390 5
        }
391
        $currentFont = null; /** @var Font $currentFont */
392 5
        $clippedFont = null;
393 5
        $xObject = null;
0 ignored issues
show
Unused Code introduced by
The assignment to $xObject is dead and can be removed.
Loading history...
394
        $page = null;
395 5
        if ($this->isFpdf()) {
396 5
            /*
397 5
             * This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents.
398
             * The page number is important for getting the PDF Commands and Text Matrix
399
             */
400
            $pageNum = $this->getPageNumber();
401
            $xObjects = $this->getXObjects();
402 5
            // The correct font page info is in $xObject[$pageNum]
403 5
            $xObject = $xObjects[$pageNum];
404 5
            // For using instead of $xObject
405
            $new_content = $xObject->getContent();
406
            $header = $xObject->getHeader();
407
            $config = $xObject->config;
408
            // Now we create the Page object with the correct info
409
            $page = new self($xObject->document, $header, $new_content, $config);
410 5
        }
411
        foreach ($extractedRawData as &$command) {
412
            if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
413
                $data = $command['c'];
414
                if (!\is_array($data)) {
415
                    $tmpText = '';
416
                    if (isset($currentFont)) {
417
                        $tmpText = $currentFont->decodeOctal($data);
418
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
419 5
                    }
420 3
                    $tmpText = str_replace(
421 3
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
422
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
423
                            $tmpText
424
                    );
425
                    $tmpText = utf8_encode($tmpText);
426
                    if (isset($currentFont)) {
427
                        $tmpText = $currentFont->decodeContent($tmpText);
428 5
                    }
429 5
                    $command['c'] = $tmpText;
430 5
                    continue;
431
                }
432
                $numText = \count($data);
433
                for ($i = 0; $i < $numText; ++$i) {
434
                    if (0 != ($i % 2)) {
435
                        continue;
436
                    }
437
                    $tmpText = $data[$i]['c'];
438
                    $decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
439
                    $decodedText = str_replace(
440
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
441 5
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
442
                            $decodedText
443
                    );
444
                    $decodedText = utf8_encode($decodedText);
445
                    if (isset($currentFont)) {
446
                        $decodedText = $currentFont->decodeContent($decodedText);
447
                    }
448
                    $command['c'][$i]['c'] = $decodedText;
449
                    continue;
450
                }
451 5
            } elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
452 3
                $fontId = explode(' ', $command['c'])[0];
453 3
                /** If document is a FPDI/FPDF the $page has the correct font */
454
                if (isset($page)) {
455
                    $currentFont = $page->getFont($fontId);
456
                } else {
457
                    $currentFont = $this->getFont($fontId);
458
                }
459
                continue;
460
            } elseif ('Q' == $command['o']) {
461
                $currentFont = $clippedFont;
462 5
            } elseif ('q' == $command['o']) {
463 3
                $clippedFont = $currentFont;
464 3
            }
465
        }
466
467
        return $extractedRawData;
468
    }
469
470 5
    /**
471 4
     * Gets just the Text commands that are involved in text positions and
472 4
     * Text Matrix (Tm)
473
     *
474
     * It extract just the PDF commands that are involved with text positions, and
475
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
476
     *
477
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData.
478
     *                                       If it is null, the method extractDecodeRawData is called.
479
     *
480
     * @return array An array with the text command of the page
481 5
     */
482
    public function getDataCommands(array $extractedDecodedRawData = null): array
483
    {
484
        if (!isset($extractedDecodedRawData) || !$extractedDecodedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedDecodedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
485
            $extractedDecodedRawData = $this->extractDecodedRawData();
486
        }
487
        $extractedData = [];
488
        foreach ($extractedDecodedRawData as $command) {
489
            switch ($command['o']) {
490
                /*
491
                 * BT
492
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
493
                 */
494
                case 'BT':
495
                    $extractedData[] = $command;
496 5
                    break;
497
498
                /*
499
                 * ET
500
                 * End a text object, discarding the text matrix
501
                 */
502
                case 'ET':
503
                    $extractedData[] = $command;
504
                    break;
505
506
                /*
507
                 * leading TL
508
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
509
                 * Initial value: 0
510
                 */
511
                case 'TL':
512 5
                    $extractedData[] = $command;
513 5
                    break;
514 5
515
                /*
516
                 * tx ty Td
517
                 * Move to the start of the next line, offset form the start of the
518
                 * current line by tx, ty.
519 5
                 */
520
                case 'Td':
521
                    $extractedData[] = $command;
522
                    break;
523
524
                /*
525
                 * tx ty TD
526
                 * Move to the start of the next line, offset form the start of the
527
                 * current line by tx, ty. As a side effect, this operator set the leading
528
                 * parameter in the text state. This operator has the same effect as the
529
                 * code:
530
                 * -ty TL
531
                 * tx ty Td
532
                 */
533
                case 'TD':
534
                    $extractedData[] = $command;
535
                    break;
536 4
537
                /*
538 4
                 * a b c d e f Tm
539 4
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
540
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
541
                 * [1 0 0 1 0 0]
542
                 */
543
                case 'Tm':
544
                    $extractedData[] = $command;
545 4
                    break;
546
547
                /*
548
                 * T*
549
                 * Move to the start of the next line. This operator has the same effect
550 4
                 * as the code:
551
                 * 0 Tl Td
552
                 * Where Tl is the current leading parameter in the text state.
553
                 */
554
                case 'T*':
555 4
                    $extractedData[] = $command;
556 4
                    break;
557 4
558 4
                /*
559
                 * string Tj
560 4
                 * Show a Text String
561 4
                 */
562
                case 'Tj':
563 4
                    $extractedData[] = $command;
564 4
                    break;
565 4
566 4
                /*
567 4
                 * string '
568
                 * Move to the next line and show a text string. This operator has the
569
                 * same effect as the code:
570
                 * T*
571
                 * string Tj
572 4
                 */
573 4
                case "'":
574 4
                    $extractedData[] = $command;
575 4
                    break;
576 4
577 4
                /*
578
                 * aw ac string "
579
                 * Move to the next lkine and show a text string, using aw as the word
580
                 * spacing and ac as the character spacing. This operator has the same
581
                 * effect as the code:
582
                 * aw Tw
583 4
                 * ac Tc
584
                 * string '
585
                 * Tw set the word spacing, Tw, to wordSpace.
586
                 * Tc Set the character spacing, Tc, to charsSpace.
587
                 */
588
                case '"':
589
                    $extractedData[] = $command;
590
                    break;
591
592
                /*
593
                 * array TJ
594
                 * Show one or more text strings allow individual glyph positioning.
595 4
                 * Each lement of array con be a string or a number. If the element is
596 2
                 * a string, this operator shows the string. If it is a number, the
597 2
                 * operator adjust the text position by that amount; that is, it translates
598
                 * the text matrix, Tm. This amount is substracted form the current
599
                 * horizontal or vertical coordinate, depending on the writing mode.
600
                 * in the default coordinate system, a positive adjustment has the effect
601
                 * of moving the next glyph painted either to the left or down by the given
602
                 * amount.
603
                 */
604 4
                case 'TJ':
605 4
                    $extractedData[] = $command;
606 4
                    break;
607 4
                default:
608 4
            }
609 4
        }
610 4
611
        return $extractedData;
612
    }
613
614
    /**
615
     * Gets the Text Matrix of the text in the page
616
     *
617
     * Return an array where every item is an array where the first item is the
618
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
619
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
620
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
621 4
     *
622
     * @param array $dataCommands the data extracted by getDataCommands
623
     *                            if null getDataCommands is called
624
     *
625
     * @return array an array with the data of the page including the Tm information
626
     *               of any text in the page
627
     */
628
    public function getDataTm(array $dataCommands = null): array
629
    {
630
        if (!isset($dataCommands) || !$dataCommands) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $dataCommands of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
631
            $dataCommands = $this->getDataCommands();
632
        }
633
634
        /*
635
         * At the beginning of a text object Tm is the identity matrix
636 4
         */
637 2
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
638 2
639 2
        /*
640 2
         *  Set the text leading used by T*, ' and " operators
641
         */
642
        $defaultTl = 0;
643
644
        /*
645
         * Setting where are the X and Y coordinates in the matrix (Tm)
646
         */
647
        $x = 4;
648
        $y = 5;
649 4
        $Tx = 0;
650 2
        $Ty = 0;
651 2
652 2
        $Tm = $defaultTm;
653
        $Tl = $defaultTl;
654
655
        $extractedTexts = $this->getTextArray();
656
        $extractedData = [];
657
        foreach ($dataCommands as $command) {
658 4
            $currentText = $extractedTexts[\count($extractedData)];
659 3
            switch ($command['o']) {
660 3
                /*
661
                 * BT
662
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
663
                 */
664
                case 'BT':
665
                    $Tm = $defaultTm;
666
                    $Tl = $defaultTl; //review this.
667
                    $Tx = 0;
668
                    $Ty = 0;
669 4
                    break;
670
671
                /*
672
                 * ET
673
                 * End a text object, discarding the text matrix
674
                 */
675
                case 'ET':
676
                    $Tm = $defaultTm;
677
                    $Tl = $defaultTl;  //review this
678
                    $Tx = 0;
679
                    $Ty = 0;
680
                    break;
681
682
                /*
683
                 * leading TL
684
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
685
                 * Initial value: 0
686 4
                 */
687
                case 'TL':
688
                    $Tl = (float) $command['c'];
689
                    break;
690
691
                /*
692
                 * tx ty Td
693
                 * Move to the start of the next line, offset form the start of the
694
                 * current line by tx, ty.
695
                 */
696
                case 'Td':
697
                    $coord = explode(' ', $command['c']);
698
                    $Tx += (float) $coord[0];
699
                    $Ty += (float) $coord[1];
700
                    $Tm[$x] = (string) $Tx;
701
                    $Tm[$y] = (string) $Ty;
702
                    break;
703
704
                /*
705 4
                 * tx ty TD
706 4
                 * Move to the start of the next line, offset form the start of the
707 4
                 * current line by tx, ty. As a side effect, this operator set the leading
708
                 * parameter in the text state. This operator has the same effect as the
709
                 * code:
710
                 * -ty TL
711 4
                 * tx ty Td
712
                 */
713 4
                case 'TD':
714
                    $coord = explode(' ', $command['c']);
715
                    $Tl = (float) $coord[1];
716
                    $Tx += (float) $coord[0];
717
                    $Ty -= (float) $coord[1];
718
                    $Tm[$x] = (string) $Tx;
719
                    $Tm[$y] = (string) $Ty;
720
                    break;
721
722
                /*
723
                 * a b c d e f Tm
724
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
725
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
726
                 * [1 0 0 1 0 0]
727
                 */
728
                case 'Tm':
729
                    $Tm = explode(' ', $command['c']);
730
                    $Tx = (float) $Tm[$x];
731
                    $Ty = (float) $Tm[$y];
732
                    break;
733
734 1
                /*
735
                 * T*
736 1
                 * Move to the start of the next line. This operator has the same effect
737 1
                 * as the code:
738
                 * 0 Tl Td
739
                 * Where Tl is the current leading parameter in the text state.
740 1
                 */
741 1
                case 'T*':
742
                    $Ty -= $Tl;
743
                    $Tm[$y] = (string) $Ty;
744 1
                    break;
745 1
746
                /*
747
                 * string Tj
748 1
                 * Show a Text String
749
                 */
750
                case 'Tj':
751
                    $extractedData[] = [$Tm, $currentText];
752 1
                    break;
753 1
754
                /*
755 1
                 * string '
756 1
                 * Move to the next line and show a text string. This operator has the
757 1
                 * same effect as the code:
758 1
                 * T*
759 1
                 * string Tj
760 1
                 */
761 1
                case "'":
762
                    $Ty -= $Tl;
763
                    $Tm[$y] = (string) $Ty;
764
                    $extractedData[] = [$Tm, $currentText];
765
                    break;
766
767
                /*
768 1
                 * aw ac string "
769
                 * Move to the next line and show a text string, using aw as the word
770
                 * spacing and ac as the character spacing. This operator has the same
771
                 * effect as the code:
772
                 * aw Tw
773
                 * ac Tc
774
                 * string '
775 1
                 * Tw set the word spacing, Tw, to wordSpace.
776 1
                 * Tc Set the character spacing, Tc, to charsSpace.
777 1
                 */
778 1
                case '"':
779 1
                    $data = explode(' ', $currentText);
780 1
                    $Ty -= $Tl;
781
                    $Tm[$y] = (string) $Ty;
782
                    $extractedData[] = [$Tm, $data[2]]; //Verify
783
                    break;
784 1
785
                /*
786
                 * array TJ
787
                 * Show one or more text strings allow individual glyph positioning.
788
                 * Each lement of array con be a string or a number. If the element is
789
                 * a string, this operator shows the string. If it is a number, the
790
                 * operator adjust the text position by that amount; that is, it translates
791
                 * the text matrix, Tm. This amount is substracted form the current
792
                 * horizontal or vertical coordinate, depending on the writing mode.
793
                 * in the default coordinate system, a positive adjustment has the effect
794
                 * of moving the next glyph painted either to the left or down by the given
795
                 * amount.
796
                 */
797
                case 'TJ':
798
                    $extractedData[] = [$Tm, $currentText];
799
                    break;
800
                default:
801
            }
802
        }
803
        $this->dataTm = $extractedData;
804
805
        return $extractedData;
806
    }
807
808
    /**
809
     * Gets text data that are around the given coordinates (X,Y)
810
     *
811
     * If the text is in near the given coordinates (X,Y) (or the TM info),
812
     * the text is returned.  The extractedData return by getDataTm, could be use to see
813
     * where is the coordinates of a given text, using the TM info for it.
814
     *
815
     * @param float $x      The X value of the coordinate to search for. if null
816
     *                      just the Y value is considered (same Row)
817
     * @param float $y      The Y value of the coordinate to search for
818
     *                      just the X value is considered (same column)
819
     * @param float $xError The value less or more to consider an X to be "near"
820
     * @param float $yError The value less or more to consider an Y to be "near"
821
     *
822
     * @return array An array of text that are near the given coordinates. If no text
823
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
824
     *               and y coordinates are null, null is returned.
825
     */
826
    public function getTextXY(float $x = null, float $y = null, float $xError = 0, float $yError = 0): array
827
    {
828
        if (!isset($this->dataTm) || !$this->dataTm) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->dataTm of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
829
            $this->getDataTm();
830
        }
831
832
        if (null !== $x) {
833
            $x = (float) $x;
834
        }
835
836
        if (null !== $y) {
837
            $y = (float) $y;
838
        }
839
840
        if (null === $x && null === $y) {
841
            return [];
842
        }
843
844
        $xError = (float) $xError;
845
        $yError = (float) $yError;
846
847
        $extractedData = [];
848
        foreach ($this->dataTm as $item) {
849
            $tm = $item[0];
850
            $xTm = (float) $tm[4];
851
            $yTm = (float) $tm[5];
852
            $text = $item[1];
853
            if (null === $y) {
854
                if (($xTm >= ($x - $xError)) &&
855
                    ($xTm <= ($x + $xError))) {
856
                    $extractedData[] = [$tm, $text];
857
                    continue;
858
                }
859
            }
860
            if (null === $x) {
861
                if (($yTm >= ($y - $yError)) &&
862
                    ($yTm <= ($y + $yError))) {
863
                    $extractedData[] = [$tm, $text];
864
                    continue;
865
                }
866
            }
867
            if (($xTm >= ($x - $xError)) &&
868
                ($xTm <= ($x + $xError)) &&
869
                ($yTm >= ($y - $yError)) &&
870
                ($yTm <= ($y + $yError))) {
871
                $extractedData[] = [$tm, $text];
872
                continue;
873
            }
874
        }
875
876
        return $extractedData;
877
    }
878
}
879