Passed
Pull Request — master (#455)
by
unknown
01:56
created

Page   F

Complexity

Total Complexity 127

Size/Duplication

Total Lines 816
Duplicated Lines 0 %

Test Coverage

Coverage 77.1%

Importance

Changes 17
Bugs 4 Features 2
Metric Value
eloc 346
c 17
b 4
f 2
dl 0
loc 816
ccs 266
cts 345
cp 0.771
rs 2
wmc 127

13 Methods

Rating   Name   Duplication   Size   Complexity  
B getFonts() 0 37 9
A getFont() 0 22 4
B getText() 0 40 10
A getXObject() 0 9 2
B getXObjects() 0 31 7
C getDataCommands() 0 130 15
A isFpdf() 0 8 4
C getTextArray() 0 64 12
A getPageNumber() 0 10 3
D extractDecodedRawData() 0 70 19
B extractRawData() 0 43 9
C getDataTm() 0 178 15
D getTextXY() 0 51 18

How to fix   Complexity   

Complex Class

Complex classes like Page often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use Page, and based on these observations, apply Extract Interface, too.

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 * @date    2017-01-03
9
 *
10
 * @license LGPLv3
11
 * @url     <https://github.com/smalot/pdfparser>
12
 *
13
 *  PdfParser is a pdf library written in PHP, extraction oriented.
14
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
15
 *
16
 *  This program is free software: you can redistribute it and/or modify
17
 *  it under the terms of the GNU Lesser General Public License as published by
18
 *  the Free Software Foundation, either version 3 of the License, or
19
 *  (at your option) any later version.
20
 *
21
 *  This program is distributed in the hope that it will be useful,
22
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
23
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
24
 *  GNU Lesser General Public License for more details.
25
 *
26
 *  You should have received a copy of the GNU Lesser General Public License
27
 *  along with this program.
28
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
29
 */
30
31
namespace Smalot\PdfParser;
32
33
use Smalot\PdfParser\Element\ElementArray;
34
use Smalot\PdfParser\Element\ElementMissing;
35
use Smalot\PdfParser\Element\ElementNull;
36
use Smalot\PdfParser\Element\ElementXRef;
37
38
class Page extends PDFObject
39
{
40
    /**
41
     * @var Font[]
42
     */
43
    protected $fonts = null;
44
45
    /**
46
     * @var PDFObject[]
47
     */
48
    protected $xobjects = null;
49
50
    /**
51
     * @var array
52
     */
53
    protected $dataTm = null;
54
55
    /**
56
     * @return Font[]
57
     */
58 24
    public function getFonts()
59
    {
60 24
        if (null !== $this->fonts) {
61 20
            return $this->fonts;
62
        }
63
64 24
        $resources = $this->get('Resources');
65
66 24
        if (method_exists($resources, 'has') && $resources->has('Font')) {
67 21
            if ($resources->get('Font') instanceof ElementMissing) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

67
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof ElementMissing) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
68 1
                return [];
69
            }
70
71 20
            if ($resources->get('Font') instanceof Header) {
72 14
                $fonts = $resources->get('Font')->getElements();
73
            } else {
74 9
                $fonts = $resources->get('Font')->getHeader()->getElements();
75
            }
76
77 20
            $table = [];
78
79 20
            foreach ($fonts as $id => $font) {
80 20
                if ($font instanceof Font) {
81 20
                    $table[$id] = $font;
82
83
                    // Store too on cleaned id value (only numeric)
84 20
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
85 20
                    if ('' != $id) {
86 20
                        $table[$id] = $font;
87
                    }
88
                }
89
            }
90
91 20
            return $this->fonts = $table;
92
        }
93
94 5
        return [];
95
    }
96
97 22
    public function getFont(string $id): ?Font
98
    {
99 22
        $fonts = $this->getFonts();
100
101 22
        if (isset($fonts[$id])) {
102 19
            return $fonts[$id];
103
        }
104
105
        // According to the PDF specs (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 238)
106
        // "The font resource name presented to the Tf operator is arbitrary, as are the names for all kinds of resources"
107
        // Instead, we search for the unfiltered name first and then do this cleaning as a fallback, so all tests still pass.
108
109 4
        if (isset($fonts[$id])) {
110
            return $fonts[$id];
111
        } else {
112 4
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
113 4
            if (isset($fonts[$id])) {
114 1
                return $fonts[$id];
115
            }
116
        }
117
118 3
        return null;
119
    }
120
121
    /**
122
     * Support for XObject
123
     *
124
     * @return PDFObject[]
125
     */
126 5
    public function getXObjects()
127
    {
128 5
        if (null !== $this->xobjects) {
129 4
            return $this->xobjects;
130
        }
131
132 5
        $resources = $this->get('Resources');
133
134 5
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
135 5
            if ($resources->get('XObject') instanceof Header) {
136 5
                $xobjects = $resources->get('XObject')->getElements();
137
            } else {
138
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
139
            }
140
141 5
            $table = [];
142
143 5
            foreach ($xobjects as $id => $xobject) {
144 5
                $table[$id] = $xobject;
145
146
                // Store too on cleaned id value (only numeric)
147 5
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
148 5
                if ('' != $id) {
149 5
                    $table[$id] = $xobject;
150
                }
151
            }
152
153 5
            return $this->xobjects = $table;
154
        }
155
156
        return [];
157
    }
158
159 4
    public function getXObject(string $id): ?PDFObject
160
    {
161 4
        $xobjects = $this->getXObjects();
162
163 4
        if (isset($xobjects[$id])) {
164 4
            return $xobjects[$id];
165
        }
166
167
        return null;
168
        /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
169
170
        if (isset($xobjects[$id])) {
171
            return $xobjects[$id];
172
        } else {
173
            return null;
174
        }*/
175
    }
176
177 13
    public function getText(self $page = null): string
178
    {
179 13
        if ($contents = $this->get('Contents')) {
180 13
            if ($contents instanceof ElementMissing) {
181
                return '';
182 13
            } elseif ($contents instanceof ElementNull) {
183
                return '';
184 13
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
185 10
                $elements = $contents->getHeader()->getElements();
186
187 10
                if (is_numeric(key($elements))) {
188
                    $new_content = '';
189
190
                    foreach ($elements as $element) {
191
                        if ($element instanceof ElementXRef) {
192
                            $new_content .= $element->getObject()->getContent();
193
                        } else {
194
                            $new_content .= $element->getContent();
195
                        }
196
                    }
197
198
                    $header = new Header([], $this->document);
199 10
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
200
                }
201 3
            } elseif ($contents instanceof ElementArray) {
202
                // Create a virtual global content.
203 3
                $new_content = '';
204
205 3
                foreach ($contents->getContent() as $content) {
206 3
                    $new_content .= $content->getContent()."\n";
207
                }
208
209 3
                $header = new Header([], $this->document);
210 3
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
211
            }
212
213 13
            return $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

213
            return $contents->/** @scrutinizer ignore-call */ getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
214
        }
215
216
        return '';
217
    }
218
219
    /**
220
     * Return True if the current page is a (setasign\Fpdi\Fpdi) FPDI/FPDF document
221
     * 
222
     * @return bool true is the current page is a FPDI/FPDF document
223
     */
224 10
    public function isFpdf(): Bool
225
    {
226 10
        if (array_key_exists("Producer", $this->document->getDetails(true)) and 
0 ignored issues
show
Unused Code introduced by
The call to Smalot\PdfParser\Document::getDetails() has too many arguments starting with true. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

226
        if (array_key_exists("Producer", $this->document->/** @scrutinizer ignore-call */ getDetails(true)) and 

This check compares calls to functions or methods with their respective definitions. If the call has more arguments than are defined, it raises an issue.

If a function is defined several times with a different number of parameters, the check may pick up the wrong definition and report false positives. One codebase where this has been known to happen is Wordpress. Please note the @ignore annotation hint above.

Loading history...
227 10
            is_string($this->document->getDetails(true)["Producer"]) and 
228 10
            str_starts_with($this->document->getDetails(true)["Producer"], "FPDF")) {
229 2
                return true;
230
            }
231 9
        return false;
232
    }
233
234
    /**
235
     * Return the page number of the PDF document of the page object
236
     * 
237
     * @return int the page number
238
    */
239 2
    public function getPageNumber(): int 
240
    {
241 2
        $pages = $this->document->getPages();
242 2
        $numOfPages = count($pages);
243 2
        for ($pageNum = 0; $pageNum < $numOfPages; $pageNum++) {
244 2
            if ($pages[$pageNum] === $this) {
245 2
                break;
246
            }
247
        }
248 2
        return $pageNum;
249
    }
250
251 5
    public function getTextArray(self $page = null): array
252
    {
253 5
        if ($this->isFpdf()) {
254
            /** 
255
             * This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents. 
256
             * The page number is important for getting the PDF Commands and Text Matrix 
257
             */
258 1
            $pageNum = $this->getPageNumber();
259 1
            $xObjects = $this->getXObjects();
260
            /** The correct page info is in $xObject[$pageNum] */
261 1
            $xObject = $xObjects[$pageNum];
262 1
            $new_content = $xObject->getContent();
263 1
            $header = $xObject->getHeader();
264 1
            $config = $xObject->config;
265
            /** Now we create the PDFObject object with the correct info */
266 1
            $contents = new PDFObject($xObject->document, $header, $new_content, $config);
267 1
            return $contents->getTextArray($xObject);
268
        }
269 4
        if ($contents = $this->get('Contents')) {
270 4
            if ($contents instanceof ElementMissing) {
271
                return [];
272 4
            } elseif ($contents instanceof ElementNull) {
273
                return [];
274 4
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
275 4
                $elements = $contents->getHeader()->getElements();
276
277 4
                if (is_numeric(key($elements))) {
278
                    $new_content = '';
279
280
                    /** @var PDFObject $element */
281
                    foreach ($elements as $element) {
282
                        if ($element instanceof ElementXRef) {
283
                            $new_content .= $element->getObject()->getContent();
284
                        } else {
285
                            $new_content .= $element->getContent();
286
                        }
287
                    }
288
289
                    $header = new Header([], $this->document);
290
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
291
                } else {
292
                    try {
293 4
                        $contents->getTextArray($this);
294 1
                    } catch (\Throwable $e) {
295 4
                        return $contents->getTextArray();
296
                    }
297
                }
298
            } elseif ($contents instanceof ElementArray) {
299
                // Create a virtual global content.
300
                $new_content = '';
301
302
                /** @var PDFObject $content */
303
                foreach ($contents->getContent() as $content) {
304
                    $new_content .= $content->getContent()."\n";
305
                }
306
307
                $header = new Header([], $this->document);
308
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
309
            }
310
311 3
            return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

311
            return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
312
        }
313
314
        return [];
315
    }
316
317
    /**
318
     * Gets all the text data with its internal representation of the page.
319
     *
320
     * @return array An array with the data and the internal representation
321
     */
322 9
    public function extractRawData(): array
323
    {
324
        /*
325
         * Now you can get the complete content of the object with the text on it
326
         */
327 9
        $extractedData = [];
328 9
        $content = $this->get('Contents');
329 9
        $values = $content->getContent();
330 9
        if (isset($values) && \is_array($values)) {
331
            $text = '';
332
            foreach ($values as $section) {
333
                $text .= $section->getContent();
334
            }
335
            $sectionsText = $this->getSectionsText($text);
336
            foreach ($sectionsText as $sectionText) {
337
                $commandsText = $this->getCommandsText($sectionText);
338
                foreach ($commandsText as $command) {
339
                    $extractedData[] = $command;
340
                }
341
            }
342
        } else {
343 9
            if ($this->isFpdf()) {
344
                /** 
345
                 * This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents. 
346
                 * The page number is important for getting the PDF Commands and Text Matrix 
347
                 */
348 1
                    $pageNum = $this->getPageNumber();
349 1
                    $xObjects = $this->getXObjects();
350
                    /** The correct page info is in $xObject[$pageNum] */
351 1
                    $content = $xObjects[$pageNum];
352
            }
353 9
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

353
            /** @scrutinizer ignore-call */ 
354
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
354 9
            foreach ($sectionsText as $sectionText) {
355 9
                $extractedData[] = ['t' => '', 'o' => 'BT', 'c' => ''];
356
357 9
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

357
                /** @scrutinizer ignore-call */ 
358
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
358 9
                foreach ($commandsText as $command) {
359 9
                    $extractedData[] = $command;
360
                }
361
            }
362
        }
363
364 9
        return $extractedData;
365
    }
366
367
    /**
368
     * Gets all the decoded text data with it internal representation from a page.
369
     *
370
     * @param array $extractedRawData the extracted data return by extractRawData or
371
     *                                null if extractRawData should be called
372
     *
373
     * @return array An array with the data and the internal representation
374
     */
375 8
    public function extractDecodedRawData(array $extractedRawData = null): array
376
    {
377 8
        if (!isset($extractedRawData) || !$extractedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
378 8
            $extractedRawData = $this->extractRawData();
379
        }
380 8
        $currentFont = null; /** @var Font $currentFont */
381 8
        $clippedFont = null;
382 8
        $xObject = null;
383 8
        if ($this->isFpdf()) {
384
            /** This code is for the (setasign\Fpdi\Fpdi) FPDI-FPDF documents. 
385
             * The page number is important for getting the PDF Commands and Text Matrix 
386
             */
387 1
            $pageNum = $this->getPageNumber();
388 1
            $xObjects = $this->getXObjects();
389
            /** The correct font page info is in $xObject[$pageNum] */
390 1
            $xObject = $xObjects[$pageNum];
391
        }
392 8
        foreach ($extractedRawData as &$command) {
393 8
            if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
394 8
                $data = $command['c'];
395 8
                if (!\is_array($data)) {
396 6
                    $tmpText = '';
397 6
                    if (isset($currentFont)) {
398 6
                        $tmpText = $currentFont->decodeOctal($data);
399
                        //$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
400
                    }
401 6
                    $tmpText = str_replace(
402 6
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
403 6
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
404
                            $tmpText
405
                    );
406 6
                    $tmpText = utf8_encode($tmpText);
407 6
                    if (isset($currentFont)) {
408 6
                        $tmpText = $currentFont->decodeContent($tmpText);
409
                    }
410 6
                    $command['c'] = $tmpText;
411 6
                    continue;
412
                }
413 8
                $numText = \count($data);
414 8
                for ($i = 0; $i < $numText; ++$i) {
415 8
                    if (0 != ($i % 2)) {
416 6
                        continue;
417
                    }
418 8
                    $tmpText = $data[$i]['c'];
419 8
                    $decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
420 8
                    $decodedText = str_replace(
421 8
                            ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
422 8
                            ['\\', '(', ')', "\n", "\r", "\t", ' '],
423
                            $decodedText
424
                    );
425 8
                    $decodedText = utf8_encode($decodedText);
426 8
                    if (isset($currentFont)) {
427 6
                        $decodedText = $currentFont->decodeContent($decodedText);
428
                    }
429 8
                    $command['c'][$i]['c'] = $decodedText;
430 8
                    continue;
431
                }
432 8
            } elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
433 8
                $fontId = explode(' ', $command['c'])[0];
434
                /** If document is a FPDI/FPDF the $xObject has the correct font */
435 8
                $currentFont = isset($xObject) ? $xObject->getFont($fontId) : $this->getFont($fontId);
0 ignored issues
show
Bug introduced by
The method getFont() does not exist on Smalot\PdfParser\PDFObject. It seems like you code against a sub-type of Smalot\PdfParser\PDFObject such as Smalot\PdfParser\Page. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

435
                $currentFont = isset($xObject) ? $xObject->/** @scrutinizer ignore-call */ getFont($fontId) : $this->getFont($fontId);
Loading history...
436 8
                continue;
437 8
            } elseif ('Q' == $command['o']) {
438
                $currentFont = $clippedFont;
439 8
            } elseif ('q' == $command['o']) {
440
                $clippedFont = $currentFont;
441
            }
442
        }
443
444 8
        return $extractedRawData;
445
    }
446
447
    /**
448
     * Gets just the Text commands that are involved in text positions and
449
     * Text Matrix (Tm)
450
     *
451
     * It extract just the PDF commands that are involved with text positions, and
452
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
453
     *
454
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData.
455
     *                                       If it is null, the method extractDecodeRawData is called.
456
     *
457
     * @return array An array with the text command of the page
458
     */
459 6
    public function getDataCommands(array $extractedDecodedRawData = null): array
460
    {
461 6
        if (!isset($extractedDecodedRawData) || !$extractedDecodedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedDecodedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
462 6
            $extractedDecodedRawData = $this->extractDecodedRawData();
463
        }
464 6
        $extractedData = [];
465 6
        foreach ($extractedDecodedRawData as $command) {
466 6
            switch ($command['o']) {
467
                /*
468
                 * BT
469
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
470
                 */
471 6
                case 'BT':
472 6
                    $extractedData[] = $command;
473 6
                    break;
474
475
                /*
476
                 * ET
477
                 * End a text object, discarding the text matrix
478
                 */
479 6
                case 'ET':
480
                    $extractedData[] = $command;
481
                    break;
482
483
                /*
484
                 * leading TL
485
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
486
                 * Initial value: 0
487
                 */
488 6
                case 'TL':
489 4
                    $extractedData[] = $command;
490 4
                    break;
491
492
                /*
493
                 * tx ty Td
494
                 * Move to the start of the next line, offset form the start of the
495
                 * current line by tx, ty.
496
                 */
497 6
                case 'Td':
498 6
                    $extractedData[] = $command;
499 6
                    break;
500
501
                /*
502
                 * tx ty TD
503
                 * Move to the start of the next line, offset form the start of the
504
                 * current line by tx, ty. As a side effect, this operator set the leading
505
                 * parameter in the text state. This operator has the same effect as the
506
                 * code:
507
                 * -ty TL
508
                 * tx ty Td
509
                 */
510 6
                case 'TD':
511
                    $extractedData[] = $command;
512
                    break;
513
514
                /*
515
                 * a b c d e f Tm
516
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
517
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
518
                 * [1 0 0 1 0 0]
519
                 */
520 6
                case 'Tm':
521 4
                    $extractedData[] = $command;
522 4
                    break;
523
524
                /*
525
                 * T*
526
                 * Move to the start of the next line. This operator has the same effect
527
                 * as the code:
528
                 * 0 Tl Td
529
                 * Where Tl is the current leading parameter in the text state.
530
                 */
531 6
                case 'T*':
532 4
                    $extractedData[] = $command;
533 4
                    break;
534
535
                /*
536
                 * string Tj
537
                 * Show a Text String
538
                 */
539 6
                case 'Tj':
540 5
                    $extractedData[] = $command;
541 5
                    break;
542
543
                /*
544
                 * string '
545
                 * Move to the next line and show a text string. This operator has the
546
                 * same effect as the code:
547
                 * T*
548
                 * string Tj
549
                 */
550 6
                case "'":
551
                    $extractedData[] = $command;
552
                    break;
553
554
                /*
555
                 * aw ac string "
556
                 * Move to the next lkine and show a text string, using aw as the word
557
                 * spacing and ac as the character spacing. This operator has the same
558
                 * effect as the code:
559
                 * aw Tw
560
                 * ac Tc
561
                 * string '
562
                 * Tw set the word spacing, Tw, to wordSpace.
563
                 * Tc Set the character spacing, Tc, to charsSpace.
564
                 */
565 6
                case '"':
566
                    $extractedData[] = $command;
567
                    break;
568
569
                /*
570
                 * array TJ
571
                 * Show one or more text strings allow individual glyph positioning.
572
                 * Each lement of array con be a string or a number. If the element is
573
                 * a string, this operator shows the string. If it is a number, the
574
                 * operator adjust the text position by that amount; that is, it translates
575
                 * the text matrix, Tm. This amount is substracted form the current
576
                 * horizontal or vertical coordinate, depending on the writing mode.
577
                 * in the default coordinate system, a positive adjustment has the effect
578
                 * of moving the next glyph painted either to the left or down by the given
579
                 * amount.
580
                 */
581 6
                case 'TJ':
582 6
                    $extractedData[] = $command;
583 6
                    break;
584
                default:
585
            }
586
        }
587
588 6
        return $extractedData;
589
    }
590
591
    /**
592
     * Gets the Text Matrix of the text in the page
593
     *
594
     * Return an array where every item is an array where the first item is the
595
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
596
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
597
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
598
     *
599
     * @param array $dataCommands the data extracted by getDataCommands
600
     *                            if null getDataCommands is called
601
     *
602
     * @return array an array with the data of the page including the Tm information
603
     *               of any text in the page
604
     */
605 5
    public function getDataTm(array $dataCommands = null): array
606
    {
607 5
        if (!isset($dataCommands) || !$dataCommands) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $dataCommands of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
608 5
            $dataCommands = $this->getDataCommands();
609
        }
610
611
        /*
612
         * At the beginning of a text object Tm is the identity matrix
613
         */
614 5
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
615
616
        /*
617
         *  Set the text leading used by T*, ' and " operators
618
         */
619 5
        $defaultTl = 0;
620
621
        /*
622
         * Setting where are the X and Y coordinates in the matrix (Tm)
623
         */
624 5
        $x = 4;
625 5
        $y = 5;
626 5
        $Tx = 0;
627 5
        $Ty = 0;
628
629 5
        $Tm = $defaultTm;
630 5
        $Tl = $defaultTl;
631
632 5
        $extractedTexts = $this->getTextArray();
633 5
        $extractedData = [];
634 5
        foreach ($dataCommands as $command) {
635 5
            $currentText = $extractedTexts[\count($extractedData)];
636 5
            switch ($command['o']) {
637
                /*
638
                 * BT
639
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
640
                 */
641 5
                case 'BT':
642 5
                    $Tm = $defaultTm;
643 5
                    $Tl = $defaultTl; //review this.
644 5
                    $Tx = 0;
645 5
                    $Ty = 0;
646 5
                    break;
647
648
                /*
649
                 * ET
650
                 * End a text object, discarding the text matrix
651
                 */
652 5
                case 'ET':
653
                    $Tm = $defaultTm;
654
                    $Tl = $defaultTl;  //review this
655
                    $Tx = 0;
656
                    $Ty = 0;
657
                    break;
658
659
                /*
660
                 * leading TL
661
                 * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
662
                 * Initial value: 0
663
                 */
664 5
                case 'TL':
665 3
                    $Tl = (float) $command['c'];
666 3
                    break;
667
668
                /*
669
                 * tx ty Td
670
                 * Move to the start of the next line, offset form the start of the
671
                 * current line by tx, ty.
672
                 */
673 5
                case 'Td':
674 5
                    $coord = explode(' ', $command['c']);
675 5
                    $Tx += (float) $coord[0];
676 5
                    $Ty += (float) $coord[1];
677 5
                    $Tm[$x] = (string) $Tx;
678 5
                    $Tm[$y] = (string) $Ty;
679 5
                    break;
680
681
                /*
682
                 * tx ty TD
683
                 * Move to the start of the next line, offset form the start of the
684
                 * current line by tx, ty. As a side effect, this operator set the leading
685
                 * parameter in the text state. This operator has the same effect as the
686
                 * code:
687
                 * -ty TL
688
                 * tx ty Td
689
                 */
690 5
                case 'TD':
691
                    $coord = explode(' ', $command['c']);
692
                    $Tl = (float) $coord[1];
693
                    $Tx += (float) $coord[0];
694
                    $Ty -= (float) $coord[1];
695
                    $Tm[$x] = (string) $Tx;
696
                    $Tm[$y] = (string) $Ty;
697
                    break;
698
699
                /*
700
                 * a b c d e f Tm
701
                 * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
702
                 * all numbers, and the initial value for Tm and Tlm is the identity matrix
703
                 * [1 0 0 1 0 0]
704
                 */
705 5
                case 'Tm':
706 3
                    $Tm = explode(' ', $command['c']);
707 3
                    $Tx = (float) $Tm[$x];
708 3
                    $Ty = (float) $Tm[$y];
709 3
                    break;
710
711
                /*
712
                 * T*
713
                 * Move to the start of the next line. This operator has the same effect
714
                 * as the code:
715
                 * 0 Tl Td
716
                 * Where Tl is the current leading parameter in the text state.
717
                 */
718 5
                case 'T*':
719 3
                    $Ty -= $Tl;
720 3
                    $Tm[$y] = (string) $Ty;
721 3
                    break;
722
723
                /*
724
                 * string Tj
725
                 * Show a Text String
726
                 */
727 5
                case 'Tj':
728 4
                    $extractedData[] = [$Tm, $currentText];
729 4
                    break;
730
731
                /*
732
                 * string '
733
                 * Move to the next line and show a text string. This operator has the
734
                 * same effect as the code:
735
                 * T*
736
                 * string Tj
737
                 */
738 5
                case "'":
739
                    $Ty -= $Tl;
740
                    $Tm[$y] = (string) $Ty;
741
                    $extractedData[] = [$Tm, $currentText];
742
                    break;
743
744
                /*
745
                 * aw ac string "
746
                 * Move to the next line and show a text string, using aw as the word
747
                 * spacing and ac as the character spacing. This operator has the same
748
                 * effect as the code:
749
                 * aw Tw
750
                 * ac Tc
751
                 * string '
752
                 * Tw set the word spacing, Tw, to wordSpace.
753
                 * Tc Set the character spacing, Tc, to charsSpace.
754
                 */
755 5
                case '"':
756
                    $data = explode(' ', $currentText);
757
                    $Ty -= $Tl;
758
                    $Tm[$y] = (string) $Ty;
759
                    $extractedData[] = [$Tm, $data[2]]; //Verify
760
                    break;
761
762
                /*
763
                 * array TJ
764
                 * Show one or more text strings allow individual glyph positioning.
765
                 * Each lement of array con be a string or a number. If the element is
766
                 * a string, this operator shows the string. If it is a number, the
767
                 * operator adjust the text position by that amount; that is, it translates
768
                 * the text matrix, Tm. This amount is substracted form the current
769
                 * horizontal or vertical coordinate, depending on the writing mode.
770
                 * in the default coordinate system, a positive adjustment has the effect
771
                 * of moving the next glyph painted either to the left or down by the given
772
                 * amount.
773
                 */
774 5
                case 'TJ':
775 5
                    $extractedData[] = [$Tm, $currentText];
776 5
                    break;
777
                default:
778
            }
779
        }
780 5
        $this->dataTm = $extractedData;
781
782 5
        return $extractedData;
783
    }
784
785
    /**
786
     * Gets text data that are around the given coordinates (X,Y)
787
     *
788
     * If the text is in near the given coordinates (X,Y) (or the TM info),
789
     * the text is returned.  The extractedData return by getDataTm, could be use to see
790
     * where is the coordinates of a given text, using the TM info for it.
791
     *
792
     * @param float $x      The X value of the coordinate to search for. if null
793
     *                      just the Y value is considered (same Row)
794
     * @param float $y      The Y value of the coordinate to search for
795
     *                      just the X value is considered (same column)
796
     * @param float $xError The value less or more to consider an X to be "near"
797
     * @param float $yError The value less or more to consider an Y to be "near"
798
     *
799
     * @return array An array of text that are near the given coordinates. If no text
800
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
801
     *               and y coordinates are null, null is returned.
802
     */
803 2
    public function getTextXY(float $x = null, float $y = null, float $xError = 0, float $yError = 0): array
804
    {
805 2
        if (!isset($this->dataTm) || !$this->dataTm) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->dataTm of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
806 1
            $this->getDataTm();
807
        }
808
809 2
        if (null !== $x) {
810 2
            $x = (float) $x;
811
        }
812
813 2
        if (null !== $y) {
814 2
            $y = (float) $y;
815
        }
816
817 2
        if (null === $x && null === $y) {
818
            return [];
819
        }
820
821 2
        $xError = (float) $xError;
822 2
        $yError = (float) $yError;
823
824 2
        $extractedData = [];
825 2
        foreach ($this->dataTm as $item) {
826 2
            $tm = $item[0];
827 2
            $xTm = (float) $tm[4];
828 2
            $yTm = (float) $tm[5];
829 2
            $text = $item[1];
830 2
            if (null === $y) {
831
                if (($xTm >= ($x - $xError)) &&
832
                    ($xTm <= ($x + $xError))) {
833
                    $extractedData[] = [$tm, $text];
834
                    continue;
835
                }
836
            }
837 2
            if (null === $x) {
838
                if (($yTm >= ($y - $yError)) &&
839
                    ($yTm <= ($y + $yError))) {
840
                    $extractedData[] = [$tm, $text];
841
                    continue;
842
                }
843
            }
844 2
            if (($xTm >= ($x - $xError)) &&
845 2
                ($xTm <= ($x + $xError)) &&
846 2
                ($yTm >= ($y - $yError)) &&
847 2
                ($yTm <= ($y + $yError))) {
848 2
                $extractedData[] = [$tm, $text];
849 2
                continue;
850
            }
851
        }
852
853 2
        return $extractedData;
854
    }
855
}
856