Completed
Push — master ( 7ba6fb...31ed5b )
by Dev
39:27 queued 38:12
created

Crawler::getCacheFolder()   A

Complexity

Conditions 1
Paths 1

Size

Total Lines 3
Code Lines 1

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 2
CRAP Score 1

Importance

Changes 0
Metric Value
cc 1
eloc 1
nc 1
nop 0
dl 0
loc 3
ccs 2
cts 2
cp 1
crap 1
rs 10
c 0
b 0
f 0
1
<?php
2
3
namespace PiedWeb\SeoPocketCrawler;
4
5
use PiedWeb\UrlHarvester\Harvest;
6
use PiedWeb\UrlHarvester\Indexable;
7
use Spatie\Robots\RobotsTxt;
8
9
class Crawler
10
{
11
    protected $userAgent;
12
    protected $project;
13
    protected $ignore;
14
    protected $limit;
15
16
    protected $currentClick = 0;
17
18
    protected $counter = 0;
19
20
    protected $base;
21
    protected $urls = [];
22
23 2
    public function __construct(string $startUrl, string $ignore, int $limit, string $userAgent)
24
    {
25 2
        $this->urls[$startUrl] = null;
26 2
        $this->base = Harvest::getDomainAndSchemeFrom($startUrl);
27 2
        $this->project = preg_replace("([^\w\s\d\-_~,;\[\]\(\).])", '', $startUrl);
28 2
        $this->ignore = new RobotsTxt($ignore);
29 2
        $this->userAgent = $userAgent;
30 2
        $this->limit = $limit;
31
32 2
        $this->initRecorderAndCache();
33 2
    }
34
35 2
    public function getDataFolder()
36
    {
37 2
        return __DIR__.'/../data/'.$this->project;
38
    }
39
40 2
    public function getCacheFolder()
41
    {
42 2
        return __DIR__.'/../cache/'.$this->project;
43
    }
44
45 2
    protected function initRecorderAndCache()
46
    {
47 2
        $this->recorder = new Recorder($this->getDataFolder());
0 ignored issues
show
Bug Best Practice introduced by
The property recorder does not exist. Although not strictly required by PHP, it is generally a best practice to declare properties explicitly.
Loading history...
48
49 2
        exec('rm -rf '.$this->getDataFolder());
50 2
        exec('rm -rf '.$this->getCacheFolder());
51
52 2
        if (!file_exists($this->getDataFolder())) {
53 2
            mkdir($this->getDataFolder());
54 2
            mkdir($this->getDataFolder().'/links');
55 2
            mkdir($this->getCacheFolder());
56
        }
57 2
    }
58
59 2
    public function crawl(bool $debug = false)
60
    {
61 2
        $nothingUpdated = true;
62
63 2
        if ($debug) {
64 2
            echo PHP_EOL.PHP_EOL.'// -----'.PHP_EOL.'// '.$this->counter.' crawled / '
65 2
                        .count($this->urls).' found '.PHP_EOL.'// -----'.PHP_EOL;
66
        }
67
68 2
        foreach ($this->urls as $urlToParse => $url) {
69 2
            if (null !== $url && (false === $url->can_be_crawled || true === $url->can_be_crawled)) { // déjà crawlé
70
                continue;
71
            }
72
73 2
            if ($debug) {
74 2
                echo '    '.$urlToParse.PHP_EOL;
75
            }
76
77 2
            $nothingUpdated = false;
78 2
            ++$this->counter;
79
80 2
            $this->harvest($urlToParse);
81
        }
82
83 2
        ++$this->currentClick;
84
85 2
        $record = $nothingUpdated || $this->currentClick >= $this->limit;
86
87 2
        return $record ? $this->recorder->record($this->urls) : $this->crawl($debug);
0 ignored issues
show
Bug introduced by
Are you sure the usage of $this->recorder->record($this->urls) targeting PiedWeb\SeoPocketCrawler\Recorder::record() seems to always return null.

This check looks for function or method calls that always return null and whose return value is used.

class A
{
    function getObject()
    {
        return null;
    }

}

$a = new A();
if ($a->getObject()) {

The method getObject() can return nothing but null, so it makes no sense to use the return value.

The reason is most likely that a function or method is imcomplete or has been reduced for debug purposes.

Loading history...
88
    }
89
90 2
    protected function cache($harvest)
91
    {
92 2
        if (false === strpos($harvest->getResponse()->getContentType(), 'text/html')) {
93
            return;
94
        }
95
96 2
        $url = ltrim($harvest->getAbsoluteInternalLink($harvest->getResponse()->getEffectiveUrl()), '/');
97 2
        $urlPart = explode('/', $url);
98 2
        $folder = $this->getCacheFolder();
99
100 2
        for ($i = 0; $i < count($urlPart); ++$i) {
0 ignored issues
show
Performance Best Practice introduced by
It seems like you are calling the size function count() as part of the test condition. You might want to compute the size beforehand, and not on each iteration.

If the size of the collection does not change during the iteration, it is generally a good practice to compute it beforehand, and not on each iteration:

for ($i=0; $i<count($array); $i++) { // calls count() on each iteration
}

// Better
for ($i=0, $c=count($array); $i<$c; $i++) { // calls count() just once
}
Loading history...
101 2
            if ($i == count($urlPart) - 1) {
102 2
                $filename = empty($urlPart[$i]) ? 'index.html' : $urlPart[$i];
103 2
                @file_put_contents($folder.'/'.$filename, $harvest->getResponse()->getContent());
0 ignored issues
show
Security Best Practice introduced by
It seems like you do not handle an error condition for file_put_contents(). This can introduce security issues, and is generally not recommended. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unhandled  annotation

103
                /** @scrutinizer ignore-unhandled */ @file_put_contents($folder.'/'.$filename, $harvest->getResponse()->getContent());

If you suppress an error, we recommend checking for the error condition explicitly:

// For example instead of
@mkdir($dir);

// Better use
if (@mkdir($dir) === false) {
    throw new \RuntimeException('The directory '.$dir.' could not be created.');
}
Loading history...
104
            } else {
105
                $folder .= '/'.$urlPart[$i];
106
                if (!file_exists($folder)) {
107
                    @mkdir($folder);
0 ignored issues
show
Security Best Practice introduced by
It seems like you do not handle an error condition for mkdir(). This can introduce security issues, and is generally not recommended. ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-unhandled  annotation

107
                    /** @scrutinizer ignore-unhandled */ @mkdir($folder);

If you suppress an error, we recommend checking for the error condition explicitly:

// For example instead of
@mkdir($dir);

// Better use
if (@mkdir($dir) === false) {
    throw new \RuntimeException('The directory '.$dir.' could not be created.');
}
Loading history...
108
                }
109
            }
110
        }
111 2
    }
112
113 2
    protected function harvest(string $urlToParse)
114
    {
115 2
        $url = $this->urls[$urlToParse] = $this->urls[$urlToParse] ?? new Url($urlToParse, $this->currentClick);
116
117 2
        $url->updated_at = date('Ymd');
118 2
        $url->can_be_crawled = $this->ignore->allows($urlToParse, $this->userAgent);
119
120 2
        if (false === $url->can_be_crawled) {
121
            return;
122
        }
123
124 2
        $harvest = Harvest::fromUrl($urlToParse, $this->userAgent);
125
126 2
        if (!$harvest instanceof Harvest) {
127
            $url->indexable = Indexable::NOT_INDEXABLE_NETWORK_ERROR;
128
129
            return;
130
        }
131
132 2
        $url->indexable = $harvest->isIndexable();
133
134 2
        if (Indexable::NOT_INDEXABLE_3XX === $url->indexable) {
135
            if ($redir = false !== $harvest->getRedirection()) {
136
                $links = Harvest::LINK_INTERNAL === $harvest->getType($redir) ? [$redir] : [];
0 ignored issues
show
introduced by
The condition PiedWeb\UrlHarvester\Har...arvest->getType($redir) is always false.
Loading history...
Bug introduced by
$redir of type true is incompatible with the type string expected by parameter $url of PiedWeb\UrlHarvester\Harvest::getType(). ( Ignorable by Annotation )

If this is a false-positive, you can also ignore this issue in your code via the ignore-type  annotation

136
                $links = Harvest::LINK_INTERNAL === $harvest->getType(/** @scrutinizer ignore-type */ $redir) ? [$redir] : [];
Loading history...
Unused Code introduced by
The assignment to $links is dead and can be removed.
Loading history...
137
            }
138
        } else {
139 2
            $this->cache($harvest);
140
141 2
            $this->recorder->recordOutboundLink($url, $harvest->getLinks());
142
143 2
            $url->links = count($harvest->getLinks());
144 2
            $url->links_duplicate = $harvest->getNbrDuplicateLinks();
145 2
            $url->links_internal = count($harvest->getLinks(Harvest::LINK_INTERNAL));
146 2
            $url->links_self = count($harvest->getLinks(Harvest::LINK_SELF));
147 2
            $url->links_sub = count($harvest->getLinks(Harvest::LINK_SUB));
148 2
            $url->links_external = count($harvest->getLinks(Harvest::LINK_EXTERNAL));
149
150 2
            $url->ratio_text_code = $harvest->getRatioTxtCode();
151 2
            $url->load_time = $harvest->getResponse()->getInfo('total_time');
152 2
            $url->size = $harvest->getResponse()->getInfo('size_download');
153
154 2
            $breadcrumb = $harvest->getBreadCrumb();
155 2
            if (is_array($breadcrumb)) {
156
                $url->breadcrumb_level = count($breadcrumb);
0 ignored issues
show
Bug introduced by
The property breadcrumb_level does not seem to exist on PiedWeb\SeoPocketCrawler\Url.
Loading history...
157
                $url->breadcrumb_fisrt = isset($breadcrumb[1]) ? $breadcrumb[1]->getCleanName() : '';
0 ignored issues
show
Bug introduced by
The property breadcrumb_fisrt does not seem to exist on PiedWeb\SeoPocketCrawler\Url.
Loading history...
158
                $url->breadcrumb_text = $harvest->getBreadCrumb('//');
0 ignored issues
show
Bug introduced by
The property breadcrumb_text does not seem to exist on PiedWeb\SeoPocketCrawler\Url.
Loading history...
159
            }
160
161 2
            $url->title = $harvest->getUniqueTag('head title') ?? '';
162 2
            $url->kws = ','.implode(',', $harvest->getKws()).',';
163 2
            $url->h1 = $harvest->getUniqueTag('h1') ?? '';
164
        }
165
166 2
        foreach ($harvest->getLinks(Harvest::LINK_INTERNAL) as $link) {
167 2
            $linkUrl = $link->getPageUrl();
168 2
            $this->urls[$linkUrl] = $this->urls[$linkUrl] ?? new Url($linkUrl, ($this->currentClick + 1));
169 2
            $this->recorder->recordInboundLink($url, $this->urls[$linkUrl]);
170 2
            ++$this->urls[$linkUrl]->inboundlinks;
171
        }
172 2
    }
173
174
    public function recordLink(Url $from, To $url)
0 ignored issues
show
Bug introduced by
The type PiedWeb\SeoPocketCrawler\To was not found. Maybe you did not declare it correctly or list all dependencies?

The issue could also be caused by a filter entry in the build configuration. If the path has been excluded in your configuration, e.g. excluded_paths: ["lib/*"], you can move it to the dependency path list as follows:

filter:
    dependency_paths: ["lib/*"]

For further information see https://scrutinizer-ci.com/docs/tools/php/php-scrutinizer/#list-dependency-paths

Loading history...
175
    {
176
        file_put_contents($this->getDataFolder().'/links/To_'.$url->sha1.'.txt', $from->uri, FILE_APPEND);
177
    }
178
}
179