Test Failed
Pull Request — master (#667)
by
unknown
02:48
created

Page::getDataTm()   D

Complexity

Conditions 19
Paths 4

Size

Total Lines 225
Code Lines 89

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 80
CRAP Score 19.9863

Importance

Changes 2
Bugs 0 Features 0
Metric Value
cc 19
eloc 89
c 2
b 0
f 0
nc 4
nop 1
dl 0
loc 225
ccs 80
cts 93
cp 0.8602
crap 19.9863
rs 4.5166

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
/**
4
 * @file
5
 *          This file is part of the PdfParser library.
6
 *
7
 * @author  Sébastien MALOT <[email protected]>
8
 *
9
 * @date    2017-01-03
10
 *
11
 * @license LGPLv3
12
 *
13
 * @url     <https://github.com/smalot/pdfparser>
14
 *
15
 *  PdfParser is a pdf library written in PHP, extraction oriented.
16
 *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
17
 *
18
 *  This program is free software: you can redistribute it and/or modify
19
 *  it under the terms of the GNU Lesser General Public License as published by
20
 *  the Free Software Foundation, either version 3 of the License, or
21
 *  (at your option) any later version.
22
 *
23
 *  This program is distributed in the hope that it will be useful,
24
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
25
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
26
 *  GNU Lesser General Public License for more details.
27
 *
28
 *  You should have received a copy of the GNU Lesser General Public License
29
 *  along with this program.
30
 *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
31
 */
32
33
namespace Smalot\PdfParser;
34
35
use Smalot\PdfParser\Element\ElementArray;
36
use Smalot\PdfParser\Element\ElementMissing;
37
use Smalot\PdfParser\Element\ElementNull;
38
use Smalot\PdfParser\Element\ElementXRef;
39
40
class Page extends PDFObject
41
{
42
    /**
43
     * @var Font[]
44
     */
45
    protected $fonts;
46
47
    /**
48
     * @var PDFObject[]
49
     */
50
    protected $xobjects;
51
52
    /**
53
     * @var array
54
     */
55
    protected $dataTm;
56
57
    public function setFonts($fonts)
58
    {
59
        if (empty($this->fonts)) {
60 31
            $this->fonts = $fonts;
61
        }
62 31
    }
63 25
64
    /**
65
     * @return Font[]
66 31
     */
67
    public function getFonts()
68 31
    {
69 26
        if (null !== $this->fonts) {
70 1
            return $this->fonts;
71
        }
72
73 25
        $resources = $this->get('Resources');
74 19
75
        if (method_exists($resources, 'has') && $resources->has('Font')) {
76 10
            if ($resources->get('Font') instanceof ElementMissing) {
0 ignored issues
show
Bug introduced by
The method get() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

76
            if ($resources->/** @scrutinizer ignore-call */ get('Font') instanceof ElementMissing) {

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
77
                return [];
78
            }
79 25
80
            if ($resources->get('Font') instanceof Header) {
81 25
                $fonts = $resources->get('Font')->getElements();
82 25
            } else {
83 25
                $fonts = $resources->get('Font')->getHeader()->getElements();
84
            }
85
86 25
            $table = [];
87 25
88 24
            foreach ($fonts as $id => $font) {
89
                if ($font instanceof Font) {
90
                    $table[$id] = $font;
91
92
                    // Store too on cleaned id value (only numeric)
93 25
                    $id = preg_replace('/[^0-9\.\-_]/', '', $id);
94
                    if ('' != $id) {
95
                        $table[$id] = $font;
96 7
                    }
97
                }
98
            }
99 28
100
            return $this->fonts = $table;
101 28
        }
102
103 28
        return [];
104 24
    }
105
106
    public function getFont(string $id): ?Font
107
    {
108
        $fonts = $this->getFonts();
109
110
        if (isset($fonts[$id])) {
111 5
            return $fonts[$id];
112
        }
113
114 5
        // According to the PDF specs (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf, page 238)
115 5
        // "The font resource name presented to the Tf operator is arbitrary, as are the names for all kinds of resources"
116 1
        // Instead, we search for the unfiltered name first and then do this cleaning as a fallback, so all tests still pass.
117
118
        if (isset($fonts[$id])) {
119
            return $fonts[$id];
120 4
        } else {
121
            $id = preg_replace('/[^0-9\.\-_]/', '', $id);
122
            if (isset($fonts[$id])) {
123
                return $fonts[$id];
124
            }
125
        }
126
127
        return null;
128 5
    }
129
130 5
    /**
131 4
     * Support for XObject
132
     *
133
     * @return PDFObject[]
134 5
     */
135
    public function getXObjects()
136 5
    {
137 5
        if (null !== $this->xobjects) {
138 5
            return $this->xobjects;
139
        }
140
141
        $resources = $this->get('Resources');
142
143 5
        if (method_exists($resources, 'has') && $resources->has('XObject')) {
144
            if ($resources->get('XObject') instanceof Header) {
145 5
                $xobjects = $resources->get('XObject')->getElements();
146 5
            } else {
147
                $xobjects = $resources->get('XObject')->getHeader()->getElements();
148
            }
149 5
150 5
            $table = [];
151 5
152
            foreach ($xobjects as $id => $xobject) {
153
                $table[$id] = $xobject;
154
155 5
                // Store too on cleaned id value (only numeric)
156
                $id = preg_replace('/[^0-9\.\-_]/', '', $id);
157
                if ('' != $id) {
158
                    $table[$id] = $xobject;
159
                }
160
            }
161 4
162
            return $this->xobjects = $table;
163 4
        }
164
165 4
        return [];
166 4
    }
167
168
    public function getXObject(string $id): ?PDFObject
169
    {
170
        $xobjects = $this->getXObjects();
171
172
        if (isset($xobjects[$id])) {
173
            return $xobjects[$id];
174
        }
175
176
        return null;
177
        /*$id = preg_replace('/[^0-9\.\-_]/', '', $id);
178
179 18
        if (isset($xobjects[$id])) {
180
            return $xobjects[$id];
181 18
        } else {
182 18
            return null;
183
        }*/
184 18
    }
185
186 18
    public function getText(self $page = null): string
187 14
    {
188
        if ($contents = $this->get('Contents')) {
189 14
            if ($contents instanceof ElementMissing) {
190
                return '';
191
            } elseif ($contents instanceof ElementNull) {
192
                return '';
193
            } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
194
                $elements = $contents->getHeader()->getElements();
195
196
                if (is_numeric(key($elements))) {
197
                    $new_content = '';
198
199
                    foreach ($elements as $element) {
200
                        if ($element instanceof ElementXRef) {
201 14
                            $new_content .= $element->getObject()->getContent();
202
                        } else {
203 6
                            $new_content .= $element->getContent();
204
                        }
205 6
                    }
206
207 6
                    $header = new Header([], $this->document);
208 6
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
209
                }
210
            } elseif ($contents instanceof ElementArray) {
211 6
                // Create a virtual global content.
212 6
                $new_content = '';
213
214
                foreach ($contents->getContent() as $content) {
215
                    $new_content .= $content->getContent()."\n";
216
                }
217
218
                $header = new Header([], $this->document);
219
                $contents = new PDFObject($this->document, $header, $new_content, $this->config);
220 18
            }
221 18
222
            /*
223 18
             * Elements referencing each other on the same page can cause endless loops during text parsing.
224
             * To combat this we keep a recursionStack containing already parsed elements on the page.
225
             * The stack is only emptied here after getting text from a page.
226
             */
227
            $contentsText = $contents->getText($this);
0 ignored issues
show
Bug introduced by
The method getText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

227
            /** @scrutinizer ignore-call */ 
228
            $contentsText = $contents->getText($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
228
            PDFObject::$recursionStack = [];
229
230
            return $contentsText;
231
        }
232
233
        return '';
234
    }
235
236
    /**
237 11
     * Return true if the current page is a (setasign\Fpdi\Fpdi) FPDI/FPDF document
238
     *
239 11
     * The metadata 'Producer' should have the value of "FPDF" . FPDF_VERSION if the
240 11
     * pdf file was generated by FPDF/Fpfi.
241 11
     *
242 2
     * @return bool true is the current page is a FPDI/FPDF document
243
     */
244
    public function isFpdf(): bool
245 10
    {
246
        if (\array_key_exists('Producer', $this->document->getDetails())
247
            && \is_string($this->document->getDetails()['Producer'])
248
            && 0 === strncmp($this->document->getDetails()['Producer'], 'FPDF', 4)) {
249
            return true;
250
        }
251
252
        return false;
253 2
    }
254
255 2
    /**
256 2
     * Return the page number of the PDF document of the page object
257 2
     *
258 2
     * @return int the page number
259 2
     */
260
    public function getPageNumber(): int
261
    {
262
        $pages = $this->document->getPages();
263 2
        $numOfPages = \count($pages);
264
        for ($pageNum = 0; $pageNum < $numOfPages; ++$pageNum) {
265
            if ($pages[$pageNum] === $this) {
266
                break;
267
            }
268
        }
269
270
        return $pageNum;
271
    }
272
273
    /**
274 1
     * Return the Object of the page if the document is a FPDF/FPDI document
275
     *
276 1
     * If the document was generated by FPDF/FPDI it returns the
277 1
     * PDFObject of the given page
278
     *
279 1
     * @return PDFObject The PDFObject for the page
280
     */
281
    public function getPDFObjectForFpdf(): PDFObject
282
    {
283
        $pageNum = $this->getPageNumber();
284
        $xObjects = $this->getXObjects();
285
286
        return $xObjects[$pageNum];
287
    }
288
289
    /**
290 1
     * Return a new PDFObject of the document created with FPDF/FPDI
291
     *
292 1
     * For a document generated by FPDF/FPDI, it generates a
293 1
     * new PDFObject for that document
294 1
     *
295 1
     * @return PDFObject The PDFObject
296
     */
297 1
    public function createPDFObjectForFpdf(): PDFObject
298
    {
299
        $pdfObject = $this->getPDFObjectForFpdf();
300
        $new_content = $pdfObject->getContent();
301
        $header = $pdfObject->getHeader();
302
        $config = $pdfObject->config;
303
304
        return new PDFObject($pdfObject->document, $header, $new_content, $config);
305 1
    }
306
307 1
    /**
308 1
     * Return page if document is a FPDF/FPDI document
309 1
     *
310 1
     * @return Page The page
311
     */
312 1
    public function createPageForFpdf(): self
313
    {
314
        $pdfObject = $this->getPDFObjectForFpdf();
315 6
        $new_content = $pdfObject->getContent();
316
        $header = $pdfObject->getHeader();
317 6
        $config = $pdfObject->config;
318 1
319 1
        return new self($pdfObject->document, $header, $new_content, $config);
320
    }
321 1
322
    public function getTextArray(self $page = null): array
323 5
    {
324 5
        if ($this->isFpdf()) {
325
            $pdfObject = $this->getPDFObjectForFpdf();
326 5
            $newPdfObject = $this->createPDFObjectForFpdf();
327
328 5
            return $newPdfObject->getTextArray($pdfObject);
329 5
        } else {
330
            if ($contents = $this->get('Contents')) {
331 5
                if ($contents instanceof ElementMissing) {
332
                    return [];
333
                } elseif ($contents instanceof ElementNull) {
334
                    return [];
335
                } elseif ($contents instanceof PDFObject) {
0 ignored issues
show
introduced by
$contents is never a sub-type of Smalot\PdfParser\PDFObject.
Loading history...
336
                    $elements = $contents->getHeader()->getElements();
337
338
                    if (is_numeric(key($elements))) {
339
                        $new_content = '';
340
341
                        /** @var PDFObject $element */
342
                        foreach ($elements as $element) {
343
                            if ($element instanceof ElementXRef) {
344
                                $new_content .= $element->getObject()->getContent();
345
                            } else {
346
                                $new_content .= $element->getContent();
347 5
                            }
348 1
                        }
349 5
350
                        $header = new Header([], $this->document);
351
                        $contents = new PDFObject($this->document, $header, $new_content, $this->config);
352 1
                    } else {
353
                        try {
354 1
                            $contents->getTextArray($this);
355
                        } catch (\Throwable $e) {
356
                            return $contents->getTextArray();
357 1
                        }
358 1
                    }
359
                } elseif ($contents instanceof ElementArray) {
360
                    // Create a virtual global content.
361 1
                    $new_content = '';
362 1
363
                    /** @var PDFObject $content */
364
                    foreach ($contents->getContent() as $content) {
365 4
                        $new_content .= $content->getContent()."\n";
366
                    }
367
368
                    $header = new Header([], $this->document);
369
                    $contents = new PDFObject($this->document, $header, $new_content, $this->config);
370
                }
371
372
                return $contents->getTextArray($this);
0 ignored issues
show
Bug introduced by
The method getTextArray() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

372
                return $contents->/** @scrutinizer ignore-call */ getTextArray($this);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
373
            }
374
375
            return [];
376
        }
377 10
    }
378
379
    /**
380
     * Gets all the text data with its internal representation of the page.
381
     *
382 10
     * Returns an array with the data and the internal representation
383 10
     */
384 10
    public function extractRawData(): array
385 10
    {
386 1
        /*
387 1
         * Now you can get the complete content of the object with the text on it
388 1
         */
389
        $extractedData = [];
390 1
        $content = $this->get('Contents');
391 1
        $values = $content->getContent();
392 1
        if (isset($values) && \is_array($values)) {
393 1
            $text = '';
394 1
            foreach ($values as $section) {
395
                $text .= $section->getContent();
396
            }
397
            $sectionsText = $this->getSectionsText($text);
398 10
            foreach ($sectionsText as $sectionText) {
399 1
                $commandsText = $this->getCommandsText($sectionText);
400
                foreach ($commandsText as $command) {
401 10
                    $extractedData[] = $command;
402 10
                }
403 10
            }
404
        } else {
405 10
            if ($this->isFpdf()) {
406 10
                $content = $this->getPDFObjectForFpdf();
407 10
            }
408
            $sectionsText = $content->getSectionsText($content->getContent());
0 ignored issues
show
Bug introduced by
The method getSectionsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

408
            /** @scrutinizer ignore-call */ 
409
            $sectionsText = $content->getSectionsText($content->getContent());

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
409
            foreach ($sectionsText as $sectionText) {
410
                $commandsText = $content->getCommandsText($sectionText);
0 ignored issues
show
Bug introduced by
The method getCommandsText() does not exist on Smalot\PdfParser\Element. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

410
                /** @scrutinizer ignore-call */ 
411
                $commandsText = $content->getCommandsText($sectionText);

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
411
                foreach ($commandsText as $command) {
412 10
                    $extractedData[] = $command;
413
                }
414
            }
415
        }
416
417
        return $extractedData;
418
    }
419
420
    /**
421
     * Gets all the decoded text data with it internal representation from a page.
422
     *
423 9
     * @param array $extractedRawData the extracted data return by extractRawData or
424
     *                                null if extractRawData should be called
425 9
     *
426 9
     * @return array An array with the data and the internal representation
427
     */
428 9
    public function extractDecodedRawData(array $extractedRawData = null): array
429 9
    {
430 9
        if (!isset($extractedRawData) || !$extractedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
431 9
            $extractedRawData = $this->extractRawData();
432 1
        }
433
        $currentFont = null; /** @var Font $currentFont */
434 9
        $clippedFont = null;
435 9
        $fpdfPage = null;
436 9
        if ($this->isFpdf()) {
437 9
            $fpdfPage = $this->createPageForFpdf();
438 7
        }
439 7
        foreach ($extractedRawData as &$command) {
440 7
            if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
441
                $data = $command['c'];
442
                if (!\is_array($data)) {
443 7
                    $tmpText = '';
444 7
                    if (isset($currentFont)) {
445 7
                        $tmpText = $currentFont->decodeOctal($data);
446
                        // $tmpText = $currentFont->decodeHexadecimal($tmpText, false);
447
                    }
448 7
                    $tmpText = str_replace(
449 7
                        ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
450 7
                        ['\\', '(', ')', "\n", "\r", "\t", ' '],
451
                        $tmpText
452 7
                    );
453 7
                    $tmpText = mb_convert_encoding($tmpText, 'UTF-8', 'ISO-8859-1');
454
                    if (isset($currentFont)) {
455 9
                        $tmpText = $currentFont->decodeContent($tmpText);
0 ignored issues
show
Bug introduced by
It seems like $tmpText can also be of type array; however, parameter $text of Smalot\PdfParser\Font::decodeContent() does only seem to accept string, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

455
                        $tmpText = $currentFont->decodeContent(/** @scrutinizer ignore-type */ $tmpText);
Loading history...
456 9
                    }
457 9
                    $command['c'] = $tmpText;
458 7
                    continue;
459
                }
460 9
                $numText = \count($data);
461 9
                for ($i = 0; $i < $numText; ++$i) {
462 9
                    if (0 != ($i % 2)) {
463 9
                        continue;
464 9
                    }
465
                    $tmpText = $data[$i]['c'];
466
                    $decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
467
                    $decodedText = str_replace(
468 9
                        ['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
469
                        ['\\', '(', ')', "\n", "\r", "\t", ' '],
470 9
                        $decodedText
471 7
                    );
472
473 9
                    $decodedText = mb_convert_encoding($decodedText, 'UTF-8', 'ISO-8859-1');
474 9
475
                    if (isset($currentFont)) {
476 9
                        $decodedText = $currentFont->decodeContent($decodedText);
477 9
                    }
478
                    $command['c'][$i]['c'] = $decodedText;
479 9
                    continue;
480 9
                }
481 9
            } elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
482
                $fontId = explode(' ', $command['c'])[0];
483 9
                // If document is a FPDI/FPDF the $page has the correct font
484
                $currentFont = isset($fpdfPage) ? $fpdfPage->getFont($fontId) : $this->getFont($fontId);
485
                continue;
486
            } elseif ('Q' == $command['o']) {
487
                $currentFont = $clippedFont;
488 9
            } elseif ('q' == $command['o']) {
489
                $clippedFont = $currentFont;
490
            }
491
        }
492
493
        return $extractedRawData;
494
    }
495
496
    /**
497
     * Gets just the Text commands that are involved in text positions and
498
     * Text Matrix (Tm)
499
     *
500
     * It extract just the PDF commands that are involved with text positions, and
501
     * the Text Matrix (Tm). These are: BT, ET, TL, Td, TD, Tm, T*, Tj, ', ", and TJ
502
     *
503 7
     * @param array $extractedDecodedRawData The data extracted by extractDecodeRawData.
504
     *                                       If it is null, the method extractDecodeRawData is called.
505 7
     *
506 7
     * @return array An array with the text command of the page
507
     */
508 7
    public function getDataCommands(array $extractedDecodedRawData = null): array
509 7
    {
510 7
        if (!isset($extractedDecodedRawData) || !$extractedDecodedRawData) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $extractedDecodedRawData of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
511
            $extractedDecodedRawData = $this->extractDecodedRawData();
512
        }
513
        $extractedData = [];
514
        foreach ($extractedDecodedRawData as $command) {
515 7
            switch ($command['o']) {
516 7
                /*
517 7
                 * BT
518
                 * Begin a text object, inicializind the Tm and Tlm to identity matrix
519
                 */
520
                case 'BT':
521
                    $extractedData[] = $command;
522
                    break;
523 7
524
                    /*
525
                     * ET
526
                     * End a text object, discarding the text matrix
527
                     */
528
                case 'ET':
529
                    $extractedData[] = $command;
530
                    break;
531
532 7
                    /*
533 5
                     * leading TL
534 5
                     * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
535
                     * Initial value: 0
536
                     */
537
                case 'TL':
538
                    $extractedData[] = $command;
539
                    break;
540
541 7
                    /*
542 7
                     * tx ty Td
543 7
                     * Move to the start of the next line, offset form the start of the
544
                     * current line by tx, ty.
545
                     */
546
                case 'Td':
547
                    $extractedData[] = $command;
548
                    break;
549
550
                    /*
551
                     * tx ty TD
552
                     * Move to the start of the next line, offset form the start of the
553
                     * current line by tx, ty. As a side effect, this operator set the leading
554 7
                     * parameter in the text state. This operator has the same effect as the
555
                     * code:
556
                     * -ty TL
557
                     * tx ty Td
558
                     */
559
                case 'TD':
560
                    $extractedData[] = $command;
561
                    break;
562
563
                    /*
564 7
                     * a b c d e f Tm
565 5
                     * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
566 5
                     * all numbers, and the initial value for Tm and Tlm is the identity matrix
567
                     * [1 0 0 1 0 0]
568
                     */
569
                case 'Tm':
570
                    $extractedData[] = $command;
571
                    break;
572
573
                    /*
574
                     * T*
575 7
                     * Move to the start of the next line. This operator has the same effect
576 5
                     * as the code:
577 5
                     * 0 Tl Td
578
                     * Where Tl is the current leading parameter in the text state.
579
                     */
580
                case 'T*':
581
                    $extractedData[] = $command;
582
                    break;
583 7
584 6
                    /*
585 6
                     * string Tj
586
                     * Show a Text String
587
                     */
588
                case 'Tj':
589
                    $extractedData[] = $command;
590
                    break;
591
592
                    /*
593
                     * string '
594 7
                     * Move to the next line and show a text string. This operator has the
595
                     * same effect as the code:
596
                     * T*
597
                     * string Tj
598
                     */
599
                case "'":
600
                    $extractedData[] = $command;
601
                    break;
602
603
                    /*
604
                     * aw ac string "
605
                     * Move to the next lkine and show a text string, using aw as the word
606
                     * spacing and ac as the character spacing. This operator has the same
607
                     * effect as the code:
608
                     * aw Tw
609 7
                     * ac Tc
610
                     * string '
611
                     * Tw set the word spacing, Tw, to wordSpace.
612
                     * Tc Set the character spacing, Tc, to charsSpace.
613 7
                     */
614 7
                case '"':
615 7
                    $extractedData[] = $command;
616 7
                    break;
617
618
                case 'Tf':
619
                case 'TF':
620
                    $extractedData[] = $command;
621
                    break;
622
623
                    /*
624
                     * array TJ
625
                     * Show one or more text strings allow individual glyph positioning.
626
                     * Each lement of array con be a string or a number. If the element is
627
                     * a string, this operator shows the string. If it is a number, the
628
                     * operator adjust the text position by that amount; that is, it translates
629
                     * the text matrix, Tm. This amount is substracted form the current
630 7
                     * horizontal or vertical coordinate, depending on the writing mode.
631 7
                     * in the default coordinate system, a positive adjustment has the effect
632 7
                     * of moving the next glyph painted either to the left or down by the given
633
                     * amount.
634
                     */
635
                case 'TJ':
636
                    $extractedData[] = $command;
637 7
                    break;
638
                default:
639
            }
640
        }
641
642
        return $extractedData;
643
    }
644
645
    /**
646
     * Gets the Text Matrix of the text in the page
647
     *
648
     * Return an array where every item is an array where the first item is the
649
     * Text Matrix (Tm) and the second is a string with the text data.  The Text matrix
650
     * is an array of 6 numbers. The last 2 numbers are the coordinates X and Y of the
651
     * text. The first 4 numbers has to be with Scalation, Rotation and Skew of the text.
652
     *
653
     * @param array $dataCommands the data extracted by getDataCommands
654 6
     *                            if null getDataCommands is called
655
     *
656 6
     * @return array an array with the data of the page including the Tm information
657 6
     *               of any text in the page
658
     */
659
    public function getDataTm(array $dataCommands = null): array
660
    {
661
        if (!isset($dataCommands) || !$dataCommands) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $dataCommands of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
662
            $dataCommands = $this->getDataCommands();
663 6
        }
664
665
        /*
666
         * At the beginning of a text object Tm is the identity matrix
667
         */
668 6
        $defaultTm = ['1', '0', '0', '1', '0', '0'];
669
670
        /*
671
         *  Set the text leading used by T*, ' and " operators
672
         */
673 6
        $defaultTl = 0;
674 6
675
        /*
676
         *  Set default values for font data
677
         */
678
        $defaultFontId = -1;
679 6
        $defaultFontSize = 1;
680
681
        /*
682
         * Indexes of horizontal/vertical scaling and X,Y-coordinates in the matrix (Tm)
683
         */
684 6
        $hSc = 0; // horizontal scaling
685 6
        /**
686 6
         * index of vertical scaling in the array that encodes the text matrix.
687
         * for more information: https://github.com/smalot/pdfparser/pull/559#discussion_r1053415500
688
         */
689
        $vSc = 3;
690
        $x = 4;
691
        $y = 5;
692
693 6
        /*
694 6
         * x,y-coordinates of text space origin in user units
695
         *
696 6
         * These will be assigned the value of the currently printed string
697 6
         */
698 6
        $Tx = 0;
699 6
        $Ty = 0;
700
701 6
        $Tm = $defaultTm;
702 6
        $Tl = $defaultTl;
703 6
        $fontId = $defaultFontId;
704 6
        $fontSize = $defaultFontSize; // reflects fontSize set by Tf or Tfs
705 6
706
        $extractedTexts = $this->getTextArray();
707
        $extractedData = [];
708
        foreach ($dataCommands as $command) {
709
            // If we've used up all the texts from getTextArray(), exit
710 6
            // so we aren't accessing non-existent array indices
711 6
            // Fixes 'undefined array key' errors in Issues #575, #576
712 6
            if (\count($extractedTexts) <= \count($extractedData)) {
713 6
                break;
714 6
            }
715 6
            $currentText = $extractedTexts[\count($extractedData)];
716 6
            switch ($command['o']) {
717 6
                /*
718
                 * BT
719
                 * Begin a text object, initializing the Tm and Tlm to identity matrix
720
                 */
721
                case 'BT':
722
                    $Tm = $defaultTm;
723 6
                    $Tl = $defaultTl;
724
                    $Tx = 0;
725
                    $Ty = 0;
726
                    break;
727
728
                    /*
729
                     * ET
730
                     * End a text object
731
                     */
732
                case 'ET':
733
                    break;
734
735
                    /*
736
                     * text leading TL
737 6
                     * Set the text leading, Tl, to leading. Tl is used by the T*, ' and " operators.
738
                     * Initial value: 0
739 4
                     */
740 4
                case 'TL':
741
                    // scaled text leading
742
                    $Tl = (float) $command['c'] * (float) $Tm[$vSc];
743
                    break;
744
745
                    /*
746
                     * tx ty Td
747 6
                     * Move to the start of the next line, offset from the start of the
748 6
                     * current line by tx, ty.
749 6
                     */
750 6
                case 'Td':
751 6
                    $coord = explode(' ', $command['c']);
752 6
                    $Tx += (float) $coord[0] * (float) $Tm[$hSc];
753 6
                    $Ty += (float) $coord[1] * (float) $Tm[$vSc];
754
                    $Tm[$x] = (string) $Tx;
755
                    $Tm[$y] = (string) $Ty;
756
                    break;
757
758
                    /*
759
                     * tx ty TD
760
                     * Move to the start of the next line, offset form the start of the
761
                     * current line by tx, ty. As a side effect, this operator set the leading
762
                     * parameter in the text state. This operator has the same effect as the
763
                     * code:
764 6
                     * -ty TL
765 1
                     * tx ty Td
766 1
                     */
767 1
                case 'TD':
768 1
                    $coord = explode(' ', $command['c']);
769 1
                    $Tl = -((float) $coord[1] * (float) $Tm[$vSc]);
770 1
                    $Tx += (float) $coord[0] * (float) $Tm[$hSc];
771 1
                    $Ty += (float) $coord[1] * (float) $Tm[$vSc];
772
                    $Tm[$x] = (string) $Tx;
773
                    $Tm[$y] = (string) $Ty;
774
                    break;
775
776
                    /*
777
                     * a b c d e f Tm
778
                     * Set the text matrix, Tm, and the text line matrix, Tlm. The operands are
779 6
                     * all numbers, and the initial value for Tm and Tlm is the identity matrix
780 4
                     * [1 0 0 1 0 0]
781 4
                     */
782 4
                case 'Tm':
783 4
                    $Tm = explode(' ', $command['c']);
784
                    $Tx = (float) $Tm[$x];
785
                    $Ty = (float) $Tm[$y];
786
                    break;
787
788
                    /*
789
                     * T*
790
                     * Move to the start of the next line. This operator has the same effect
791
                     * as the code:
792 6
                     * 0 Tl Td
793 4
                     * Where Tl is the current leading parameter in the text state.
794 4
                     */
795 4
                case 'T*':
796
                    $Ty -= $Tl;
797
                    $Tm[$y] = (string) $Ty;
798
                    break;
799
800
                    /*
801 6
                     * string Tj
802 5
                     * Show a Text String
803 5
                     */
804 1
                case 'Tj':
805 1
                    $data = [$Tm, $currentText];
806
                    if ($this->config->getDataTmFontInfoHasToBeIncluded()) {
807 5
                        $data[] = $fontId;
808 5
                        $data[] = $fontSize;
809
                    }
810
                    $extractedData[] = $data;
811
                    break;
812
813
                    /*
814
                     * string '
815
                     * Move to the next line and show a text string. This operator has the
816
                     * same effect as the code:
817 6
                     * T*
818 1
                     * string Tj
819 1
                     */
820 1
                case "'":
821 1
                    $Ty -= $Tl;
822
                    $Tm[$y] = (string) $Ty;
823
                    $extractedData[] = [$Tm, $currentText];
824
                    break;
825
826
                    /*
827
                     * aw ac string "
828
                     * Move to the next line and show a text string, using aw as the word
829
                     * spacing and ac as the character spacing. This operator has the same
830
                     * effect as the code:
831
                     * aw Tw
832
                     * ac Tc
833
                     * string '
834 6
                     * Tw set the word spacing, Tw, to wordSpace.
835
                     * Tc Set the character spacing, Tc, to charsSpace.
836
                     */
837
                case '"':
838
                    $data = explode(' ', $currentText);
839
                    $Ty -= $Tl;
840
                    $Tm[$y] = (string) $Ty;
841 6
                    $extractedData[] = [$Tm, $data[2]]; // Verify
842
                    break;
843
844
                case 'Tf':
845
                    /*
846
                     * From PDF 1.0 specification, page 106:
847
                     *     fontname size Tf Set font and size
848
                     *     Sets the text font and text size in the graphics state. There is no default value for
849
                     *     either fontname or size; they must be selected using Tf before drawing any text.
850
                     *     fontname is a resource name. size is a number expressed in text space units.
851
                     *
852 6
                     * Source: https://ia902503.us.archive.org/10/items/pdfy-0vt8s-egqFwDl7L2/PDF%20Reference%201.0.pdf
853 6
                     * Introduced with https://github.com/smalot/pdfparser/pull/516
854
                     */
855
                    list($fontId, $fontSize) = explode(' ', $command['c'], 2);
856
                    break;
857
858
                    /*
859
                     * array TJ
860
                     * Show one or more text strings allow individual glyph positioning.
861
                     * Each lement of array con be a string or a number. If the element is
862
                     * a string, this operator shows the string. If it is a number, the
863
                     * operator adjust the text position by that amount; that is, it translates
864
                     * the text matrix, Tm. This amount is substracted form the current
865
                     * horizontal or vertical coordinate, depending on the writing mode.
866
                     * in the default coordinate system, a positive adjustment has the effect
867 6
                     * of moving the next glyph painted either to the left or down by the given
868 6
                     * amount.
869 6
                     */
870 1
                case 'TJ':
871 1
                    $data = [$Tm, $currentText];
872
                    if ($this->config->getDataTmFontInfoHasToBeIncluded()) {
873 6
                        $data[] = $fontId;
874 6
                        $data[] = $fontSize;
875
                    }
876
                    $extractedData[] = $data;
877
                    break;
878 6
                default:
879
            }
880 6
        }
881
        $this->dataTm = $extractedData;
882
883
        return $extractedData;
884
    }
885
886
    /**
887
     * Gets text data that are around the given coordinates (X,Y)
888
     *
889
     * If the text is in near the given coordinates (X,Y) (or the TM info),
890
     * the text is returned.  The extractedData return by getDataTm, could be use to see
891
     * where is the coordinates of a given text, using the TM info for it.
892
     *
893
     * @param float $x      The X value of the coordinate to search for. if null
894
     *                      just the Y value is considered (same Row)
895
     * @param float $y      The Y value of the coordinate to search for
896
     *                      just the X value is considered (same column)
897
     * @param float $xError The value less or more to consider an X to be "near"
898
     * @param float $yError The value less or more to consider an Y to be "near"
899
     *
900
     * @return array An array of text that are near the given coordinates. If no text
901 2
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
902
     *               and y coordinates are null, null is returned.
903 2
     */
904 1
    public function getTextXY(float $x = null, float $y = null, float $xError = 0, float $yError = 0): array
905
    {
906
        if (!isset($this->dataTm) || !$this->dataTm) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $this->dataTm of type array is implicitly converted to a boolean; are you sure this is intended? If so, consider using empty($expr) instead to make it clear that you intend to check for an array without elements.

This check marks implicit conversions of arrays to boolean values in a comparison. While in PHP an empty array is considered to be equal (but not identical) to false, this is not always apparent.

Consider making the comparison explicit by using empty(..) or ! empty(...) instead.

Loading history...
907 2
            $this->getDataTm();
908 2
        }
909
910
        if (null !== $x) {
911 2
            $x = (float) $x;
912 2
        }
913
914
        if (null !== $y) {
915 2
            $y = (float) $y;
916
        }
917
918
        if (null === $x && null === $y) {
919 2
            return [];
920 2
        }
921
922 2
        $xError = (float) $xError;
923 2
        $yError = (float) $yError;
924 2
925 2
        $extractedData = [];
926 2
        foreach ($this->dataTm as $item) {
927 2
            $tm = $item[0];
928 2
            $xTm = (float) $tm[4];
929
            $yTm = (float) $tm[5];
930
            $text = $item[1];
931
            if (null === $y) {
932
                if (($xTm >= ($x - $xError))
933
                    && ($xTm <= ($x + $xError))) {
934
                    $extractedData[] = [$tm, $text];
935 2
                    continue;
936
                }
937
            }
938
            if (null === $x) {
939
                if (($yTm >= ($y - $yError))
940
                    && ($yTm <= ($y + $yError))) {
941
                    $extractedData[] = [$tm, $text];
942 2
                    continue;
943 2
                }
944 2
            }
945 2
            if (($xTm >= ($x - $xError))
946 2
                && ($xTm <= ($x + $xError))
947 2
                && ($yTm >= ($y - $yError))
948
                && ($yTm <= ($y + $yError))) {
949
                $extractedData[] = [$tm, $text];
950
                continue;
951 2
            }
952
        }
953
954
        return $extractedData;
955
    }
956
}
957