Completed
Pull Request — master (#27)
by Damian
06:51
created

PDFTextExtractor   A

Complexity

Total Complexity 13

Size/Duplication

Total Lines 123
Duplicated Lines 0 %

Coupling/Cohesion

Components 1
Dependencies 2

Importance

Changes 1
Bugs 0 Features 0
Metric Value
wmc 13
c 1
b 0
f 0
lcom 1
cbo 2
dl 0
loc 123
rs 10

7 Methods

Rating   Name   Duplication   Size   Complexity  
A isAvailable() 0 5 2
A supportsExtension() 0 4 1
A supportsMime() 0 12 1
A bin() 0 20 4
A getContent() 0 8 2
A getRawOutput() 0 12 2
A cleanupLigatures() 0 13 1
1
<?php
2
3
/**
4
 * Text extractor that calls pdftotext to do the conversion.
5
 * @author mstephens
6
 *
7
 */
8
class PDFTextExtractor extends FileTextExtractor
0 ignored issues
show
Coding Style Compatibility introduced by
PSR1 recommends that each class must be in a namespace of at least one level to avoid collisions.

You can fix this by adding a namespace to your class:

namespace YourVendor;

class YourClass { }

When choosing a vendor namespace, try to pick something that is not too generic to avoid conflicts with other libraries.

Loading history...
9
{
10
    /**
11
     * Set to bin path this extractor can execute
12
     *
13
     * @var string
14
     */
15
    private static $binary_location = null;
0 ignored issues
show
Unused Code introduced by
The property $binary_location is not used and could be removed.

This check marks private properties in classes that are never used. Those properties can be removed.

Loading history...
16
17
    /**
18
     * Used if binary_location isn't set.
19
     * List of locations to search for a given binary in
20
     *
21
     * @config
22
     * @var array
23
     */
24
    private static $search_binary_locations = array(
0 ignored issues
show
Unused Code introduced by
The property $search_binary_locations is not used and could be removed.

This check marks private properties in classes that are never used. Those properties can be removed.

Loading history...
25
        '/usr/bin',
26
        '/usr/local/bin',
27
    );
28
29
    public function isAvailable()
30
    {
31
        $bin = $this->bin('pdftotext');
32
        return (file_exists($bin) && is_executable($bin));
33
    }
34
35
    public function supportsExtension($extension)
36
    {
37
        return strtolower($extension) === 'pdf';
38
    }
39
40
    public function supportsMime($mime)
41
    {
42
        return in_array(
43
            strtolower($mime),
44
            array(
45
                'application/pdf',
46
                'application/x-pdf',
47
                'application/x-bzpdf',
48
                'application/x-gzpdf'
49
            )
50
        );
51
    }
52
53
    /**
54
     * Accessor to get the location of the binary
55
     *
56
     * @param string $program Name of binary
57
     * @return string
58
     */
59
    protected function bin($program = '')
60
    {
61
        // Get list of allowed search paths
62
        if ($location = $this->config()->binary_location) {
63
            $locations = array($location);
64
        } else {
65
            $locations = $this->config()->search_binary_locations;
66
        }
67
68
        // Find program in each path
69
        foreach($locations as $location) {
70
            $path = "{$location}/{$program}";
71
            if(file_exists($path)) {
72
                return $path;
73
            }
74
        }
75
        
76
        // Not found
77
        return null;
78
    }
79
80
    public function getContent($path)
81
    {
82
        if (!$path) {
83
            return "";
84
        } // no file
85
        $content = $this->getRawOutput($path);
86
        return $this->cleanupLigatures($content);
87
    }
88
89
    /**
90
     * Invoke pdftotext with the given path
91
     *
92
     * @param string $path
93
     * @return string Output
94
     * @throws FileTextExtractor_Exception
95
     */
96
    protected function getRawOutput($path)
97
    {
98
        exec(sprintf('%s %s - 2>&1', $this->bin('pdftotext'), escapeshellarg($path)), $content, $err);
99
        if ($err) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $err of type integer|null is loosely compared to true; this is ambiguous if the integer can be zero. You might want to explicitly use !== null instead.

In PHP, under loose comparison (like ==, or !=, or switch conditions), values of different types might be equal.

For integer values, zero is a special case, in particular the following results might be unexpected:

0   == false // true
0   == null  // true
123 == false // false
123 == null  // false

// It is often better to use strict comparison
0 === false // false
0 === null  // false
Loading history...
100
            throw new FileTextExtractor_Exception(sprintf(
101
                'PDFTextExtractor->getContent() failed for %s: %s',
102
                $path,
103
                implode('', $err)
104
            ));
105
        }
106
        return implode('', $content);
107
    }
108
109
    /**
110
     * Removes utf-8 ligatures.
111
     *
112
     * @link http://en.wikipedia.org/wiki/Typographic_ligature#Computer_typesetting
113
     *
114
     * @param string $input
115
     * @return string
116
     */
117
    protected function cleanupLigatures($input)
118
    {
119
        $mapping = array(
120
            'ff' => 'ff',
121
            'fi' => 'fi',
122
            'fl' => 'fl',
123
            'ffi' => 'ffi',
124
            'ffl' => 'ffl',
125
            'ſt' => 'ft',
126
            'st' => 'st'
127
        );
128
        return str_replace(array_keys($mapping), array_values($mapping), $input);
129
    }
130
}
131