Completed
Pull Request — master (#45)
by Robbie
24:33 queued 17:24
created

SolrCellTextExtractor::getContent()   B

Complexity

Conditions 9
Paths 19

Size

Total Lines 57

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 9
nc 19
nop 1
dl 0
loc 57
rs 7.3826
c 0
b 0
f 0

How to fix   Long Method   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
3
namespace SilverStripe\TextExtraction\Extractor;
4
5
use Exception;
6
use GuzzleHttp\Client;
7
use InvalidArgumentException;
8
use Psr\Log\LoggerInterface;
9
use SilverStripe\Assets\File;
10
use SilverStripe\Core\Injector\Injector;
11
12
/**
13
 * Text extractor that calls an Apache Solr instance
14
 * and extracts content via the "ExtractingRequestHandler" endpoint.
15
 * Does not alter the Solr index itself, but uses it purely
16
 * for its file parsing abilities.
17
 *
18
 * @author ischommer
19
 * @see  http://wiki.apache.org/solr/ExtractingRequestHandler
20
 */
21
class SolrCellTextExtractor extends FileTextExtractor
22
{
23
    /**
24
     * Base URL to use for Solr text extraction.
25
     * E.g. http://localhost:8983/solr/update/extract
26
     *
27
     * @config
28
     * @var string
29
     */
30
    private static $base_url;
0 ignored issues
show
introduced by
The private property $base_url is not used, and could be removed.
Loading history...
31
32
    /**
33
     * @var int
34
     * @config
35
     */
36
    private static $priority = 75;
0 ignored issues
show
introduced by
The private property $priority is not used, and could be removed.
Loading history...
37
38
    /**
39
     * @var Client
40
     */
41
    protected $httpClient;
42
43
    /**
44
     * @return Client
45
     */
46
    public function getHttpClient()
47
    {
48
        if (!$this->httpClient) {
49
            $this->httpClient = new Client();
50
        }
51
52
        return $this->httpClient;
53
    }
54
55
    /**
56
     * @param  Client $client
57
     * @return $this
58
     */
59
    public function setHttpClient(Client $client)
60
    {
61
        $this->httpClient = $client;
62
        return $this;
63
    }
64
65
    /**
66
     * @return string
67
     */
68
    public function isAvailable()
69
    {
70
        $url = $this->config()->get('base_url');
71
72
        return (bool) $url;
0 ignored issues
show
Bug Best Practice introduced by
The expression return (bool)$url returns the type boolean which is incompatible with the documented return type string.
Loading history...
73
    }
74
75
    /**
76
     * @param  string $extension
77
     * @return bool
78
     */
79
    public function supportsExtension($extension)
80
    {
81
        return in_array(
82
            strtolower($extension),
83
            [
84
                'pdf', 'doc', 'docx', 'xls', 'xlsx',
85
                'epub', 'rtf', 'odt', 'fodt', 'ods', 'fods',
86
                'ppt', 'pptx', 'odp', 'fodp', 'csv'
87
            ]
88
        );
89
    }
90
91
    /**
92
     * @param  string $mime
93
     * @return bool
94
     */
95
    public function supportsMime($mime)
96
    {
97
        // Rely on supportsExtension
98
        return false;
99
    }
100
101
    /**
102
     * @param File|string $file
103
     * @return string
104
     * @throws InvalidArgumentException
105
     */
106
    public function getContent($file)
107
    {
108
        if (!$file || (is_string($file) && !file_exists($file))) {
109
            // no file
110
            return '';
111
        }
112
113
        $fileName = $file instanceof File ? $file->getFilename() : basename($file);
114
        $client = $this->getHttpClient();
115
116
        // Get and validate base URL
117
        $baseUrl = $this->config()->get('base_url');
118
        if (!$this->config()->get('base_url')) {
119
            throw new InvalidArgumentException('SolrCellTextExtractor.base_url not specified');
120
        }
121
122
        try {
123
            $path = $this->getPathFromFile($file);
0 ignored issues
show
Bug introduced by
It seems like $file can also be of type string; however, parameter $file of SilverStripe\TextExtract...ctor::getPathFromFile() does only seem to accept SilverStripe\Assets\File, maybe add an additional type check? ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

123
            $path = $this->getPathFromFile(/** @scrutinizer ignore-type */ $file);
Loading history...
124
            $request = $client
125
                ->post($baseUrl)
126
                ->addPostFields(['extractOnly' => 'true', 'extractFormat' => 'text'])
0 ignored issues
show
Bug introduced by
The method addPostFields() does not exist on Psr\Http\Message\ResponseInterface. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-call  annotation

126
                ->/** @scrutinizer ignore-call */ addPostFields(['extractOnly' => 'true', 'extractFormat' => 'text'])

This check looks for calls to methods that do not seem to exist on a given type. It looks for the method on the type itself as well as in inherited classes or implemented interfaces.

This is most likely a typographical error or the method has been renamed.

Loading history...
127
                ->addPostFiles(['myfile' => $path]);
128
            $response = $request->send();
129
        } catch (InvalidArgumentException $e) {
130
            $msg = sprintf(
131
                'Error extracting text from "%s" (message: %s)',
132
                $fileName,
133
                $e->getMessage()
134
            );
135
            Injector::inst()->get(LoggerInterface::class)->notice($msg);
136
137
            return null;
138
        } catch (Exception $e) {
139
            // Catch other errors that Tika can throw vai Guzzle but are not caught and break Solr search
140
            // query in some cases.
141
            $msg = sprintf(
142
                'Tika server error attempting to extract from "%s" (message: %s)',
143
                $path,
144
                $e->getMessage()
145
            );
146
147
            Injector::inst()->get(LoggerInterface::class)->notice($msg);
148
149
            return null;
150
        }
151
152
        // Just initialise it, it doesn't take much.
153
        $matches = [];
154
155
        // Use preg match to avoid SimpleXML running out of memory on large text nodes
156
        preg_match(
157
            sprintf('/\<str name\="%s"\>(.*?)\<\/str\>/s', preg_quote($fileName)),
158
            (string)$response->getBody(),
159
            $matches
160
        );
161
162
        return $matches ? $matches[1] : null;
163
    }
164
}
165