Completed
Pull Request — master (#28)
by
unknown
08:00
created

HTMLTextExtractor   A

Complexity

Total Complexity 4

Size/Duplication

Total Lines 70
Duplicated Lines 0 %

Coupling/Cohesion

Components 0
Dependencies 1

Importance

Changes 6
Bugs 0 Features 1
Metric Value
c 6
b 0
f 1
dl 0
loc 70
wmc 4
lcom 0
cbo 1
rs 10

4 Methods

Rating   Name   Duplication   Size   Complexity  
A isAvailable() 0 4 1
A supportsExtension() 0 7 1
A supportsMime() 0 4 1
B getContent() 0 33 1
1
<?php
2
3
/**
4
 * Text extractor that uses php function strip_tags to get just the text. OK for indexing, not the best for readable text.
5
 * @author mstephens
6
 *
7
 */
8
class HTMLTextExtractor extends FileTextExtractor
0 ignored issues
show
Coding Style Compatibility introduced by
PSR1 recommends that each class must be in a namespace of at least one level to avoid collisions.

You can fix this by adding a namespace to your class:

namespace YourVendor;

class YourClass { }

When choosing a vendor namespace, try to pick something that is not too generic to avoid conflicts with other libraries.

Loading history...
9
{
10
    public function isAvailable()
11
    {
12
        return true;
13
    }
14
15
    public function supportsExtension($extension)
16
    {
17
        return in_array(
18
            strtolower($extension),
19
            array("html", "htm", "xhtml")
20
        );
21
    }
22
23
    public function supportsMime($mime)
24
    {
25
        return strtolower($mime) === 'text/html';
26
    }
27
28
    /**
29
     * Lower priority because its not the most clever HTML extraction. If there is something better, use it
30
     *
31
     * @config
32
     * @var integer
33
     */
34
    private static $priority = 10;
0 ignored issues
show
Comprehensibility introduced by
Consider using a different property name as you override a private property of the parent class.
Loading history...
Unused Code introduced by
The property $priority is not used and could be removed.

This check marks private properties in classes that are never used. Those properties can be removed.

Loading history...
35
36
    /**
37
     * Extracts content from regex, by using strip_tags()
38
     * combined with regular expressions to remove non-content tags like <style> or <script>,
39
     * as well as adding line breaks after block tags.
40
     * 
41
     * @param string $path
42
     * @return string
43
     */
44
    public function getContent($path)
45
    {
46
        $content = file_get_contents($path);
47
        // Yes, yes, regex'ing HTML is evil.
48
        // Since we don't care about well-formedness or markup here, it does the job.
49
        $content = preg_replace(
50
            array(
51
                // Remove invisible content 
52
                    '@<head[^>]*?>.*?</head>@siu',
53
                    '@<style[^>]*?>.*?</style>@siu',
54
                    '@<script[^>]*?.*?</script>@siu',
55
                    '@<object[^>]*?.*?</object>@siu',
56
                    '@<embed[^>]*?.*?</embed>@siu',
57
                    '@<applet[^>]*?.*?</applet>@siu',
58
                    '@<noframes[^>]*?.*?</noframes>@siu',
59
                    '@<noscript[^>]*?.*?</noscript>@siu',
60
                    '@<noembed[^>]*?.*?</noembed>@siu',
61
                // Add line breaks before and after blocks 
62
                    '@</?((address)|(blockquote)|(center)|(del))@iu',
63
                    '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
64
                    '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
65
                    '@</?((table)|(th)|(td)|(caption))@iu',
66
                    '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
67
                    '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
68
                    '@</?((frameset)|(frame)|(iframe))@iu',
69
            ),
70
            array(
71
                ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "$0", "$0", "$0", "$0", "$0", "$0", "$0", "$0",
72
            ),
73
            $content
74
        );
75
        return strip_tags($content);
76
    }
77
}
78