Test Failed
Pull Request — master (#455)
by
unknown
07:09
created

Page::extractDecodedRawData()   D

Complexity

Conditions 19
Paths 44

Size

Total Lines 71
Code Lines 49

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 20
CRAP Score 20.6722

Importance

Changes 4
Bugs 1 Features 0
Metric Value
cc 19
eloc 49
c 4
b 1
f 0
nc 44
nop 1
dl 0
loc 71
ccs 20
cts 24
cp 0.8333
crap 20.6722
rs 4.5166

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementMissing;
35
use Smalot\PdfParser\Element\ElementNull;
36
use Smalot\PdfParser\Element\ElementXRef;
37
38
class Page extends PDFObject
39
{
40
    /**
41
     * @var Font[]
42
     */
43
    protected $fonts = null;
44
45
    /**
46
     * @var PDFObject[]
47
     */
48
    protected $xobjects = null;
49
50
    /**
51
     * @var array
52
     */
53
    protected $dataTm = null;
54
55
    /**
56
     * @return Font[]
57
     */
58 23
    public function getFonts()
59
    {
60 23
        if (null !== $this->fonts) {
61 19
            return $this->fonts;
62
        }
63
64 23
        $resources = $this->get('Resources');
65
66 23
        if (method_exists($resources, 'has') && $resources->has('Font')) {
67 20
            if ($resources->get('Font') instanceof ElementMissing) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

67
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof ElementMissing) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
68 1
                return [];
69
            }
70
71 19
            if ($resources->get('Font') instanceof Header) {
72 13
                $fonts = $resources->get('Font')->getElements();
73
            } else {
74 8
                $fonts = $resources->get('Font')->getHeader()->getElements();
75
            }
76
77 19
            $table = [];
78
79 19
            foreach ($fonts as $id => $font) {
80 19
                if ($font instanceof Font) {
81 19
                    $table[$id] = $font;
82
83
                    // Store too on cleaned id value (only numeric)
84 19
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
85 19
                    if ('' != $id) {
86 19
                        $table[$id] = $font;
87
                    }
88
                }
89
            }
90
91 19
            return $this->fonts = $table;
92
        }
93
94 5
        return [];
95
    }
96
97 21
    public function getFont(string $id): ?Font
98
    {
99 21
        $fonts = $this->getFonts();
100
101 21
        if (isset($fonts[$id])) {
102 18
            return $fonts[$id];
103
        }
104
105
        // According to the PDF specs (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 238)
106
        // "The font resource name presented to the Tf operator is arbitrary, as are the names for all kinds of resources"
107
        // Instead, we search for the unfiltered name first and then do this cleaning as a fallback, so all tests still pass.
108
109 4
        if (isset($fonts[$id])) {
110
            return $fonts[$id];
111
        } else {
112 4
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
113 4
            if (isset($fonts[$id])) {
114 1
                return $fonts[$id];
115
            }
116
        }
117
118 3
        return null;
119
    }
120
121
    /**
122
     * Support for XObject
123
     *
124
     * @return PDFObject[]
125
     */
126 4
    public function getXObjects()
127
    {
128 4
        if (null !== $this->xobjects) {
129 3
            return $this->xobjects;
130
        }
131
132 4
        $resources = $this->get('Resources');
133
134 4
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
135 4
            if ($resources->get('XObject') instanceof Header) {
136 4
                $xobjects = $resources->get('XObject')->getElements();
137
            } else {
138
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
139
            }
140
141 4
            $table = [];
142
143 4
            foreach ($xobjects as $id => $xobject) {
144 4
                $table[$id] = $xobject;
145
146
                // Store too on cleaned id value (only numeric)
147 4
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
148 4
                if ('' != $id) {
149 4
                    $table[$id] = $xobject;
150
                }
151
            }
152
153 4
            return $this->xobjects = $table;
154
        }
155
156
        return [];
157
    }
158
159 4
    public function getXObject(string $id): ?PDFObject
160
    {
161 4
        $xobjects = $this->getXObjects();
162
163 4
        if (isset($xobjects[$id])) {
164 4
            return $xobjects[$id];
165
        }
166
167
        return null;
168
        /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
169
170
        if (isset($xobjects[$id])) {
171
            return $xobjects[$id];
172
        } else {
173
            return null;
174
        }*/
175
    }
176
177 13
    public function getText(self $page = null): string
178
    {
179 13
        if ($contents = $this->get('Contents')) {
180 13
            if ($contents instanceof ElementMissing) {
181
                return '';
182 13
            } elseif ($contents instanceof ElementNull) {
183
                return '';
184 13
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
185 10
                $elements = $contents->getHeader()->getElements();
186
187 10
                if (is_numeric(key($elements))) {
188
                    $new_content = '';
189
190
                    foreach ($elements as $element) {
191
                        if ($element instanceof ElementXRef) {
192
                            $new_content .= $element->getObject()->getContent();
193
                        } else {
194
                            $new_content .= $element->getContent();
195
                        }
196
                    }
197
198
                    $header = new Header([], $this->document);
199 10
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
200
                }
201 3
            } elseif ($contents instanceof ElementArray) {
202
                // Create a virtual global content.
203 3
                $new_content = '';
204
205 3
                foreach ($contents->getContent() as $content) {
206 3
                    $new_content .= $content->getContent()."\n";
207
                }
208
209 3
                $header = new Header([], $this->document);
210 3
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
211
            }
212
213 13
            /*
214
             * Elements referencing each other on the same page can cause endless loops during text parsing.
215
             * To combat this we keep a recursionStack containing already parsed elements on the page.
216
             * The stack is only emptied here after getting text from a page.
217
             */
218
            $contentsText = $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

218
            /** @scrutinizer ignore-call */ 
219
            $contentsText = $contents->getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
219 4
            PDFObject::$recursionStack = [];
220
221 4
            return $contentsText;
222 4
        }
223
224 4
        return '';
225
    }
226 4
227 4
    /**
228
     * Return true if the current page is a (setasign\Fpdi\Fpdi) FPDI/FPDF document
229 4
     * 
230
     * @return bool true is the current page is a FPDI/FPDF document
231
     */
232
    public function isFpdf(): bool
233
    {
234
        if (array_key_exists("Producer", $this->document->getDetails()) and 
235
            is_string($this->document->getDetails()["Producer"]) and 
236
            str_starts_with($this->document->getDetails()["Producer"], "FPDF")) {
237
                return true;
238
            }
239
        return false;
240
    }
241
242
    /**
243
     * Return the page number of the PDF document of the page object
244
     * 
245 4
     * @return int the page number
246 1
    */
247 4
    public function getPageNumber(): int 
248
    {
249
        $pages = $this->document->getPages();
250
        $numOfPages = count($pages);
251
        for ($pageNum = 0; $pageNum < $numOfPages; $pageNum++) {
252
            if ($pages[$pageNum] === $this) {
253
                break;
254
            }
255
        }
256
        return $pageNum;
257
    }
258
259
    public function getTextArray(self $page = null): array
260
    {
261
        if ($this->isFpdf()) {
262
            /** 
263 3
             * This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents. 
264
             * The page number is important for getting the PDF Commands and Text Matrix 
265
             */
266
            $pageNum = $this->getPageNumber();
267
            $xObjects = $this->getXObjects();
268
            /** The correct page info is in $xObject[$pageNum] */
269
            $xObject = $xObjects[$pageNum];
270
            $new_content = $xObject->getContent();
271
            $header = $xObject->getHeader();
272
            $config = $xObject->config;
273
            /** Now we create the PDFObject object with the correct info */
274 8
            $contents = new PDFObject($xObject->document, $header, $new_content, $config);
275
            return $contents->getTextArray($xObject);
276
        }
277
        if ($contents = $this->get('Contents')) {
278
            if ($contents instanceof ElementMissing) {
279 8
                return [];
280 8
            } elseif ($contents instanceof ElementNull) {
281 8
                return [];
282 8
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
283
                $elements = $contents->getHeader()->getElements();
284
285
                if (is_numeric(key($elements))) {
286
                    $new_content = '';
287
288
                    /** @var PDFObject $element */
289
                    foreach ($elements as $element) {
290
                        if ($element instanceof ElementXRef) {
291
                            $new_content .= $element->getObject()->getContent();
292
                        } else {
293
                            $new_content .= $element->getContent();
294
                        }
295 8
                    }
296 8
297 8
                    $header = new Header([], $this->document);
298
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
299 8
                } else {
300 8
                    try {
301 8
                        $contents->getTextArray($this);
302
                    } catch (\Throwable $e) {
303
                        return $contents->getTextArray();
304
                    }
305
                }
306 8
            } elseif ($contents instanceof ElementArray) {
307
                // Create a virtual global content.
308
                $new_content = '';
309
310
                /** @var PDFObject $content */
311
                foreach ($contents->getContent() as $content) {
312
                    $new_content .= $content->getContent()."\n";
313
                }
314
315
                $header = new Header([], $this->document);
316
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
317 7
            }
318
319 7
            return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

319
            return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
320 7
        }
321
322 7
        return [];
323 7
    }
324 7
325 7
    /**
326 7
     * Gets all the text data with its internal representation of the page.
327 7
     *
328 5
     * Returns an array with the data and the internal representation
329 5
     */
330 5
    public function extractRawData(): array
331
    {
332
        /*
333 5
         * Now you can get the complete content of the object with the text on it
334 5
         */
335 5
        $extractedData = [];
336
        $content = $this->get('Contents');
337
        $values = $content->getContent();
338 5
        if (isset($values) && \is_array($values)) {
339 5
            $text = '';
340 5
            foreach ($values as $section) {
341
                $text .= $section->getContent();
342 5
            }
343 5
            $sectionsText = $this->getSectionsText($text);
344
            foreach ($sectionsText as $sectionText) {
345 7
                $commandsText = $this->getCommandsText($sectionText);
346 7
                foreach ($commandsText as $command) {
347 7
                    $extractedData[] = $command;
348 5
                }
349
            }
350 7
        } else {
351 7
            if ($this->isFpdf()) {
352 7
                /*
353 7
                 * This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents. 
354 7
                 * The page number is important for getting the PDF Commands and Text Matrix 
355
                 */
356
                $pageNum = $this->getPageNumber();
357 7
                $xObjects = $this->getXObjects();
358 7
                // The correct page info is in $xObject[$pageNum]
359 5
                $content = $xObjects[$pageNum];
360
            }
361 7
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

361
            /** @scrutinizer ignore-call */ 
362
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
362 7
            foreach ($sectionsText as $sectionText) {
363
                $extractedData[] = ['t' => '', 'o' => 'BT', 'c' => ''];
364 7
365 7
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

365
                /** @scrutinizer ignore-call */ 
366
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
366 7
                foreach ($commandsText as $command) {
367 7
                    $extractedData[] = $command;
368 7
                }
369
            }
370 7
        }
371
372
        return $extractedData;
373
    }
374
375 7
    /**
376
     * Gets all the decoded text data with it internal representation from a page.
377
     *
378
     * @param array $extractedRawData the extracted data return by extractRawData or
379
     *                                null if extractRawData should be called
380
     *
381
     * @return array An array with the data and the internal representation
382
     */
383
    public function extractDecodedRawData(array $extractedRawData = null): array
384
    {
385
        if (!isset($extractedRawData) || !$extractedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
386
            $extractedRawData = $this->extractRawData();
387
        }
388
        $currentFont = null; /** @var Font $currentFont */
389
        $clippedFont = null;
390 5
        $xObject = null;
391
        if ($this->isFpdf()) {
392 5
            /*
393 5
             * This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents. 
394
             * The page number is important for getting the PDF Commands and Text Matrix 
395 5
             */
396 5
            $pageNum = $this->getPageNumber();
397 5
            $xObjects = $this->getXObjects();
398
            /** The correct font page info is in $xObject[$pageNum] */
399
            $xObject = $xObjects[$pageNum];
400
        }
401
        foreach ($extractedRawData as &$command) {
402 5
            if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
403 5
                $data = $command['c'];
404 5
                if (!\is_array($data)) {
405
                    $tmpText = '';
406
                    if (isset($currentFont)) {
407
                        $tmpText = $currentFont->decodeOctal($data);
408
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
409
                    }
410 5
                    $tmpText = str_replace(
411
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
412
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
413
                            $tmpText
414
                    );
415
                    $tmpText = utf8_encode($tmpText);
416
                    if (isset($currentFont)) {
417
                        $tmpText = $currentFont->decodeContent($tmpText);
418
                    }
419 5
                    $command['c'] = $tmpText;
420 3
                    continue;
421 3
                }
422
                $numText = \count($data);
423
                for ($i = 0; $i < $numText; ++$i) {
424
                    if (0 != ($i % 2)) {
425
                        continue;
426
                    }
427
                    $tmpText = $data[$i]['c'];
428 5
                    $decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
429 5
                    $decodedText = str_replace(
430 5
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
431
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
432
                            $decodedText
433
                    );
434
                    $decodedText = utf8_encode($decodedText);
435
                    if (isset($currentFont)) {
436
                        $decodedText = $currentFont->decodeContent($decodedText);
437
                    }
438
                    $command['c'][$i]['c'] = $decodedText;
439
                    continue;
440
                }
441 5
            } elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
442
                $fontId = explode(' ', $command['c'])[0];
443
                /** If document is a FPDI/FPDF the $xObject has the correct font */
444
                $currentFont = isset($xObject) ? $xObject->getFont($fontId) : $this->getFont($fontId);
0 ignored issues
show
Bug introduced by
The method getFont() does not exist on Smalot\PdfParser\PDFObject. It seems like you code against a sub-type of Smalot\PdfParser\PDFObject such as Smalot\PdfParser\Page. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

444
                $currentFont = isset($xObject) ? $xObject->/** @scrutinizer ignore-call */ getFont($fontId) : $this->getFont($fontId);
Loading history...
445
                continue;
446
            } elseif ('Q' == $command['o']) {
447
                $currentFont = $clippedFont;
448
            } elseif ('q' == $command['o']) {
449
                $clippedFont = $currentFont;
450
            }
451 5
        }
452 3
453 3
        return $extractedRawData;
454
    }
455
456
    /**
457
     * Gets just the Text commands that are involved in text positions and
458
     * Text Matrix (Tm)
459
     *
460
     * It extract just the PDF commands that are involved with text positions, and
461
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
462 5
     *
463 3
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData.
464 3
     *                                       If it is null, the method extractDecodeRawData is called.
465
     *
466
     * @return array An array with the text command of the page
467
     */
468
    public function getDataCommands(array $extractedDecodedRawData = null): array
469
    {
470 5
        if (!isset($extractedDecodedRawData) || !$extractedDecodedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedDecodedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
471 4
            $extractedDecodedRawData = $this->extractDecodedRawData();
472 4
        }
473
        $extractedData = [];
474
        foreach ($extractedDecodedRawData as $command) {
475
            switch ($command['o']) {
476
                /*
477
                 * BT
478
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
479
                 */
480
                case 'BT':
481 5
                    $extractedData[] = $command;
482
                    break;
483
484
                /*
485
                 * ET
486
                 * End a text object, discarding the text matrix
487
                 */
488
                case 'ET':
489
                    $extractedData[] = $command;
490
                    break;
491
492
                /*
493
                 * leading TL
494
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
495
                 * Initial value: 0
496 5
                 */
497
                case 'TL':
498
                    $extractedData[] = $command;
499
                    break;
500
501
                /*
502
                 * tx ty Td
503
                 * Move to the start of the next line, offset form the start of the
504
                 * current line by tx, ty.
505
                 */
506
                case 'Td':
507
                    $extractedData[] = $command;
508
                    break;
509
510
                /*
511
                 * tx ty TD
512 5
                 * Move to the start of the next line, offset form the start of the
513 5
                 * current line by tx, ty. As a side effect, this operator set the leading
514 5
                 * parameter in the text state. This operator has the same effect as the
515
                 * code:
516
                 * -ty TL
517
                 * tx ty Td
518
                 */
519 5
                case 'TD':
520
                    $extractedData[] = $command;
521
                    break;
522
523
                /*
524
                 * a b c d e f Tm
525
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
526
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
527
                 * [1 0 0 1 0 0]
528
                 */
529
                case 'Tm':
530
                    $extractedData[] = $command;
531
                    break;
532
533
                /*
534
                 * T*
535
                 * Move to the start of the next line. This operator has the same effect
536 4
                 * as the code:
537
                 * 0 Tl Td
538 4
                 * Where Tl is the current leading parameter in the text state.
539 4
                 */
540
                case 'T*':
541
                    $extractedData[] = $command;
542
                    break;
543
544
                /*
545 4
                 * string Tj
546
                 * Show a Text String
547
                 */
548
                case 'Tj':
549
                    $extractedData[] = $command;
550 4
                    break;
551
552
                /*
553
                 * string '
554
                 * Move to the next line and show a text string. This operator has the
555 4
                 * same effect as the code:
556 4
                 * T*
557 4
                 * string Tj
558 4
                 */
559
                case "'":
560 4
                    $extractedData[] = $command;
561 4
                    break;
562
563 4
                /*
564 4
                 * aw ac string "
565 4
                 * Move to the next lkine and show a text string, using aw as the word
566 4
                 * spacing and ac as the character spacing. This operator has the same
567 4
                 * effect as the code:
568
                 * aw Tw
569
                 * ac Tc
570
                 * string '
571
                 * Tw set the word spacing, Tw, to wordSpace.
572 4
                 * Tc Set the character spacing, Tc, to charsSpace.
573 4
                 */
574 4
                case '"':
575 4
                    $extractedData[] = $command;
576 4
                    break;
577 4
578
                /*
579
                 * array TJ
580
                 * Show one or more text strings allow individual glyph positioning.
581
                 * Each lement of array con be a string or a number. If the element is
582
                 * a string, this operator shows the string. If it is a number, the
583 4
                 * operator adjust the text position by that amount; that is, it translates
584
                 * the text matrix, Tm. This amount is substracted form the current
585
                 * horizontal or vertical coordinate, depending on the writing mode.
586
                 * in the default coordinate system, a positive adjustment has the effect
587
                 * of moving the next glyph painted either to the left or down by the given
588
                 * amount.
589
                 */
590
                case 'TJ':
591
                    $extractedData[] = $command;
592
                    break;
593
                default:
594
            }
595 4
        }
596 2
597 2
        return $extractedData;
598
    }
599
600
    /**
601
     * Gets the Text Matrix of the text in the page
602
     *
603
     * Return an array where every item is an array where the first item is the
604 4
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
605 4
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
606 4
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
607 4
     *
608 4
     * @param array $dataCommands the data extracted by getDataCommands
609 4
     *                            if null getDataCommands is called
610 4
     *
611
     * @return array an array with the data of the page including the Tm information
612
     *               of any text in the page
613
     */
614
    public function getDataTm(array $dataCommands = null): array
615
    {
616
        if (!isset($dataCommands) || !$dataCommands) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $dataCommands of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
617
            $dataCommands = $this->getDataCommands();
618
        }
619
620
        /*
621 4
         * At the beginning of a text object Tm is the identity matrix
622
         */
623
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
624
625
        /*
626
         *  Set the text leading used by T*, ' and " operators
627
         */
628
        $defaultTl = 0;
629
630
        /*
631
         * Setting where are the X and Y coordinates in the matrix (Tm)
632
         */
633
        $x = 4;
634
        $y = 5;
635
        $Tx = 0;
636 4
        $Ty = 0;
637 2
638 2
        $Tm = $defaultTm;
639 2
        $Tl = $defaultTl;
640 2
641
        $extractedTexts = $this->getTextArray();
642
        $extractedData = [];
643
        foreach ($dataCommands as $command) {
644
            $currentText = $extractedTexts[\count($extractedData)];
645
            switch ($command['o']) {
646
                /*
647
                 * BT
648
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
649 4
                 */
650 2
                case 'BT':
651 2
                    $Tm = $defaultTm;
652 2
                    $Tl = $defaultTl; //review this.
653
                    $Tx = 0;
654
                    $Ty = 0;
655
                    break;
656
657
                /*
658 4
                 * ET
659 3
                 * End a text object, discarding the text matrix
660 3
                 */
661
                case 'ET':
662
                    $Tm = $defaultTm;
663
                    $Tl = $defaultTl;  //review this
664
                    $Tx = 0;
665
                    $Ty = 0;
666
                    break;
667
668
                /*
669 4
                 * leading TL
670
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
671
                 * Initial value: 0
672
                 */
673
                case 'TL':
674
                    $Tl = (float) $command['c'];
675
                    break;
676
677
                /*
678
                 * tx ty Td
679
                 * Move to the start of the next line, offset form the start of the
680
                 * current line by tx, ty.
681
                 */
682
                case 'Td':
683
                    $coord = explode(' ', $command['c']);
684
                    $Tx += (float) $coord[0];
685
                    $Ty += (float) $coord[1];
686 4
                    $Tm[$x] = (string) $Tx;
687
                    $Tm[$y] = (string) $Ty;
688
                    break;
689
690
                /*
691
                 * tx ty TD
692
                 * Move to the start of the next line, offset form the start of the
693
                 * current line by tx, ty. As a side effect, this operator set the leading
694
                 * parameter in the text state. This operator has the same effect as the
695
                 * code:
696
                 * -ty TL
697
                 * tx ty Td
698
                 */
699
                case 'TD':
700
                    $coord = explode(' ', $command['c']);
701
                    $Tl = (float) $coord[1];
702
                    $Tx += (float) $coord[0];
703
                    $Ty -= (float) $coord[1];
704
                    $Tm[$x] = (string) $Tx;
705 4
                    $Tm[$y] = (string) $Ty;
706 4
                    break;
707 4
708
                /*
709
                 * a b c d e f Tm
710
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
711 4
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
712
                 * [1 0 0 1 0 0]
713 4
                 */
714
                case 'Tm':
715
                    $Tm = explode(' ', $command['c']);
716
                    $Tx = (float) $Tm[$x];
717
                    $Ty = (float) $Tm[$y];
718
                    break;
719
720
                /*
721
                 * T*
722
                 * Move to the start of the next line. This operator has the same effect
723
                 * as the code:
724
                 * 0 Tl Td
725
                 * Where Tl is the current leading parameter in the text state.
726
                 */
727
                case 'T*':
728
                    $Ty -= $Tl;
729
                    $Tm[$y] = (string) $Ty;
730
                    break;
731
732
                /*
733
                 * string Tj
734 1
                 * Show a Text String
735
                 */
736 1
                case 'Tj':
737 1
                    $extractedData[] = [$Tm, $currentText];
738
                    break;
739
740 1
                /*
741 1
                 * string '
742
                 * Move to the next line and show a text string. This operator has the
743
                 * same effect as the code:
744 1
                 * T*
745 1
                 * string Tj
746
                 */
747
                case "'":
748 1
                    $Ty -= $Tl;
749
                    $Tm[$y] = (string) $Ty;
750
                    $extractedData[] = [$Tm, $currentText];
751
                    break;
752 1
753 1
                /*
754
                 * aw ac string "
755 1
                 * Move to the next line and show a text string, using aw as the word
756 1
                 * spacing and ac as the character spacing. This operator has the same
757 1
                 * effect as the code:
758 1
                 * aw Tw
759 1
                 * ac Tc
760 1
                 * string '
761 1
                 * Tw set the word spacing, Tw, to wordSpace.
762
                 * Tc Set the character spacing, Tc, to charsSpace.
763
                 */
764
                case '"':
765
                    $data = explode(' ', $currentText);
766
                    $Ty -= $Tl;
767
                    $Tm[$y] = (string) $Ty;
768 1
                    $extractedData[] = [$Tm, $data[2]]; //Verify
769
                    break;
770
771
                /*
772
                 * array TJ
773
                 * Show one or more text strings allow individual glyph positioning.
774
                 * Each lement of array con be a string or a number. If the element is
775 1
                 * a string, this operator shows the string. If it is a number, the
776 1
                 * operator adjust the text position by that amount; that is, it translates
777 1
                 * the text matrix, Tm. This amount is substracted form the current
778 1
                 * horizontal or vertical coordinate, depending on the writing mode.
779 1
                 * in the default coordinate system, a positive adjustment has the effect
780 1
                 * of moving the next glyph painted either to the left or down by the given
781
                 * amount.
782
                 */
783
                case 'TJ':
784 1
                    $extractedData[] = [$Tm, $currentText];
785
                    break;
786
                default:
787
            }
788
        }
789
        $this->dataTm = $extractedData;
790
791
        return $extractedData;
792
    }
793
794
    /**
795
     * Gets text data that are around the given coordinates (X,Y)
796
     *
797
     * If the text is in near the given coordinates (X,Y) (or the TM info),
798
     * the text is returned.  The extractedData return by getDataTm, could be use to see
799
     * where is the coordinates of a given text, using the TM info for it.
800
     *
801
     * @param float $x      The X value of the coordinate to search for. if null
802
     *                      just the Y value is considered (same Row)
803
     * @param float $y      The Y value of the coordinate to search for
804
     *                      just the X value is considered (same column)
805
     * @param float $xError The value less or more to consider an X to be "near"
806
     * @param float $yError The value less or more to consider an Y to be "near"
807
     *
808
     * @return array An array of text that are near the given coordinates. If no text
809
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
810
     *               and y coordinates are null, null is returned.
811
     */
812
    public function getTextXY(float $x = null, float $y = null, float $xError = 0, float $yError = 0): array
813
    {
814
        if (!isset($this->dataTm) || !$this->dataTm) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->dataTm of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
815
            $this->getDataTm();
816
        }
817
818
        if (null !== $x) {
819
            $x = (float) $x;
820
        }
821
822
        if (null !== $y) {
823
            $y = (float) $y;
824
        }
825
826
        if (null === $x && null === $y) {
827
            return [];
828
        }
829
830
        $xError = (float) $xError;
831
        $yError = (float) $yError;
832
833
        $extractedData = [];
834
        foreach ($this->dataTm as $item) {
835
            $tm = $item[0];
836
            $xTm = (float) $tm[4];
837
            $yTm = (float) $tm[5];
838
            $text = $item[1];
839
            if (null === $y) {
840
                if (($xTm >= ($x - $xError)) &&
841
                    ($xTm <= ($x + $xError))) {
842
                    $extractedData[] = [$tm, $text];
843
                    continue;
844
                }
845
            }
846
            if (null === $x) {
847
                if (($yTm >= ($y - $yError)) &&
848
                    ($yTm <= ($y + $yError))) {
849
                    $extractedData[] = [$tm, $text];
850
                    continue;
851
                }
852
            }
853
            if (($xTm >= ($x - $xError)) &&
854
                ($xTm <= ($x + $xError)) &&
855
                ($yTm >= ($y - $yError)) &&
856
                ($yTm <= ($y + $yError))) {
857
                $extractedData[] = [$tm, $text];
858
                continue;
859
            }
860
        }
861
862
        return $extractedData;
863
    }
864
}
865