Failed Conditions
Push — interwiki-remove-golucky ( 52fcdb...768be5 )
by Henry
15:30 queued 12:50
created

Doku_Indexer::tokenizer()   B

Complexity

Conditions 11
Paths 108

Size

Total Lines 38

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 11
nc 108
nop 2
dl 0
loc 38
rs 7.25
c 0
b 0
f 0

How to fix   Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
/**
3
 * Functions to create the fulltext search index
4
 *
5
 * @license    GPL 2 (http://www.gnu.org/licenses/gpl.html)
6
 * @author     Andreas Gohr <[email protected]>
7
 * @author     Tom N Harris <[email protected]>
8
 */
9
10
use dokuwiki\Extension\Event;
11
12
// Version tag used to force rebuild on upgrade
13
define('INDEXER_VERSION', 8);
14
15
// set the minimum token length to use in the index (note, this doesn't apply to numeric tokens)
16
if (!defined('IDX_MINWORDLENGTH')) define('IDX_MINWORDLENGTH',2);
17
18
/**
19
 * Version of the indexer taking into consideration the external tokenizer.
20
 * The indexer is only compatible with data written by the same version.
21
 *
22
 * @triggers INDEXER_VERSION_GET
23
 * Plugins that modify what gets indexed should hook this event and
24
 * add their version info to the event data like so:
25
 *     $data[$plugin_name] = $plugin_version;
26
 *
27
 * @author Tom N Harris <[email protected]>
28
 * @author Michael Hamann <[email protected]>
29
 *
30
 * @return int|string
31
 */
32
function idx_get_version(){
33
    static $indexer_version = null;
34
    if ($indexer_version == null) {
35
        $version = INDEXER_VERSION;
36
37
        // DokuWiki version is included for the convenience of plugins
38
        $data = array('dokuwiki'=>$version);
39
        Event::createAndTrigger('INDEXER_VERSION_GET', $data, null, false);
40
        unset($data['dokuwiki']); // this needs to be first
41
        ksort($data);
42
        foreach ($data as $plugin=>$vers)
43
            $version .= '+'.$plugin.'='.$vers;
44
        $indexer_version = $version;
45
    }
46
    return $indexer_version;
47
}
48
49
/**
50
 * Measure the length of a string.
51
 * Differs from strlen in handling of asian characters.
52
 *
53
 * @author Tom N Harris <[email protected]>
54
 *
55
 * @param string $w
56
 * @return int
57
 */
58
function wordlen($w){
59
    $l = strlen($w);
60
    // If left alone, all chinese "words" will get put into w3.idx
61
    // So the "length" of a "word" is faked
62
    if(preg_match_all('/[\xE2-\xEF]/',$w,$leadbytes)) {
63
        foreach($leadbytes[0] as $b)
64
            $l += ord($b) - 0xE1;
65
    }
66
    return $l;
67
}
68
69
/**
70
 * Class that encapsulates operations on the indexer database.
71
 *
72
 * @author Tom N Harris <[email protected]>
73
 */
74
class Doku_Indexer {
75
    /**
76
     * @var array $pidCache Cache for getPID()
77
     */
78
    protected $pidCache = array();
79
80
    /**
81
     * Adds the contents of a page to the fulltext index
82
     *
83
     * The added text replaces previous words for the same page.
84
     * An empty value erases the page.
85
     *
86
     * @param string    $page   a page name
87
     * @param string    $text   the body of the page
88
     * @return string|boolean  the function completed successfully
89
     *
90
     * @author Tom N Harris <[email protected]>
91
     * @author Andreas Gohr <[email protected]>
92
     */
93
    public function addPageWords($page, $text) {
94
        if (!$this->lock())
95
            return "locked";
96
97
        // load known documents
98
        $pid = $this->getPIDNoLock($page);
99
        if ($pid === false) {
100
            $this->unlock();
101
            return false;
102
        }
103
104
        $pagewords = array();
105
        // get word usage in page
106
        $words = $this->getPageWords($text);
107
        if ($words === false) {
108
            $this->unlock();
109
            return false;
110
        }
111
112
        if (!empty($words)) {
113
            foreach (array_keys($words) as $wlen) {
114
                $index = $this->getIndex('i', $wlen);
115
                foreach ($words[$wlen] as $wid => $freq) {
116
                    $idx = ($wid<count($index)) ? $index[$wid] : '';
117
                    $index[$wid] = $this->updateTuple($idx, $pid, $freq);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 98 can also be of type boolean; however, Doku_Indexer::updateTuple() does only seem to accept string|integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
118
                    $pagewords[] = "$wlen*$wid";
119
                }
120
                if (!$this->saveIndex('i', $wlen, $index)) {
121
                    $this->unlock();
122
                    return false;
123
                }
124
            }
125
        }
126
127
        // Remove obsolete index entries
128
        $pageword_idx = $this->getIndexKey('pageword', '', $pid);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 98 can also be of type boolean; however, Doku_Indexer::getIndexKey() does only seem to accept integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
129
        if ($pageword_idx !== '') {
130
            $oldwords = explode(':',$pageword_idx);
131
            $delwords = array_diff($oldwords, $pagewords);
132
            $upwords = array();
133
            foreach ($delwords as $word) {
134
                if ($word != '') {
135
                    list($wlen,$wid) = explode('*', $word);
136
                    $wid = (int)$wid;
137
                    $upwords[$wlen][] = $wid;
138
                }
139
            }
140
            foreach ($upwords as $wlen => $widx) {
141
                $index = $this->getIndex('i', $wlen);
142
                foreach ($widx as $wid) {
143
                    $index[$wid] = $this->updateTuple($index[$wid], $pid, 0);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 98 can also be of type boolean; however, Doku_Indexer::updateTuple() does only seem to accept string|integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
144
                }
145
                $this->saveIndex('i', $wlen, $index);
146
            }
147
        }
148
        // Save the reverse index
149
        $pageword_idx = join(':', $pagewords);
150
        if (!$this->saveIndexKey('pageword', '', $pid, $pageword_idx)) {
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 98 can also be of type boolean; however, Doku_Indexer::saveIndexKey() does only seem to accept integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
151
            $this->unlock();
152
            return false;
153
        }
154
155
        $this->unlock();
156
        return true;
157
    }
158
159
    /**
160
     * Split the words in a page and add them to the index.
161
     *
162
     * @param string    $text   content of the page
163
     * @return array            list of word IDs and number of times used
164
     *
165
     * @author Andreas Gohr <[email protected]>
166
     * @author Christopher Smith <[email protected]>
167
     * @author Tom N Harris <[email protected]>
168
     */
169
    protected function getPageWords($text) {
170
171
        $tokens = $this->tokenizer($text);
172
        $tokens = array_count_values($tokens);  // count the frequency of each token
173
174
        $words = array();
175
        foreach ($tokens as $w=>$c) {
176
            $l = wordlen($w);
177
            if (isset($words[$l])){
178
                $words[$l][$w] = $c + (isset($words[$l][$w]) ? $words[$l][$w] : 0);
179
            }else{
180
                $words[$l] = array($w => $c);
181
            }
182
        }
183
184
        // arrive here with $words = array(wordlen => array(word => frequency))
185
        $word_idx_modified = false;
186
        $index = array();   //resulting index
187
        foreach (array_keys($words) as $wlen) {
188
            $word_idx = $this->getIndex('w', $wlen);
189
            foreach ($words[$wlen] as $word => $freq) {
190
                $word = (string)$word;
191
                $wid = array_search($word, $word_idx, true);
192
                if ($wid === false) {
193
                    $wid = count($word_idx);
194
                    $word_idx[] = $word;
195
                    $word_idx_modified = true;
196
                }
197
                if (!isset($index[$wlen]))
198
                    $index[$wlen] = array();
199
                $index[$wlen][$wid] = $freq;
200
            }
201
            // save back the word index
202
            if ($word_idx_modified && !$this->saveIndex('w', $wlen, $word_idx))
203
                return false;
204
        }
205
206
        return $index;
207
    }
208
209
    /**
210
     * Add/update keys to/of the metadata index.
211
     *
212
     * Adding new keys does not remove other keys for the page.
213
     * An empty value will erase the key.
214
     * The $key parameter can be an array to add multiple keys. $value will
215
     * not be used if $key is an array.
216
     *
217
     * @param string    $page   a page name
218
     * @param mixed     $key    a key string or array of key=>value pairs
219
     * @param mixed     $value  the value or list of values
220
     * @return boolean|string     the function completed successfully
221
     *
222
     * @author Tom N Harris <[email protected]>
223
     * @author Michael Hamann <[email protected]>
224
     */
225
    public function addMetaKeys($page, $key, $value=null) {
226
        if (!is_array($key)) {
227
            $key = array($key => $value);
228
        } elseif (!is_null($value)) {
229
            // $key is array, but $value is not null
230
            trigger_error("array passed to addMetaKeys but value is not null", E_USER_WARNING);
231
        }
232
233
        if (!$this->lock())
234
            return "locked";
235
236
        // load known documents
237
        $pid = $this->getPIDNoLock($page);
238
        if ($pid === false) {
239
            $this->unlock();
240
            return false;
241
        }
242
243
        // Special handling for titles so the index file is simpler
244
        if (array_key_exists('title', $key)) {
245
            $value = $key['title'];
246
            if (is_array($value)) {
247
                $value = $value[0];
248
            }
249
            $this->saveIndexKey('title', '', $pid, $value);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 237 can also be of type boolean; however, Doku_Indexer::saveIndexKey() does only seem to accept integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
250
            unset($key['title']);
251
        }
252
253
        foreach ($key as $name => $values) {
254
            $metaname = idx_cleanName($name);
255
            $this->addIndexKey('metadata', '', $metaname);
256
            $metaidx = $this->getIndex($metaname.'_i', '');
257
            $metawords = $this->getIndex($metaname.'_w', '');
258
            $addwords = false;
259
260
            if (!is_array($values)) $values = array($values);
261
262
            $val_idx = $this->getIndexKey($metaname.'_p', '', $pid);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 237 can also be of type boolean; however, Doku_Indexer::getIndexKey() does only seem to accept integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
263
            if ($val_idx !== '') {
264
                $val_idx = explode(':', $val_idx);
265
                // -1 means remove, 0 keep, 1 add
266
                $val_idx = array_combine($val_idx, array_fill(0, count($val_idx), -1));
267
            } else {
268
                $val_idx = array();
269
            }
270
271
            foreach ($values as $val) {
272
                $val = (string)$val;
273
                if ($val !== "") {
274
                    $id = array_search($val, $metawords, true);
275
                    if ($id === false) {
276
                        // didn't find $val, so we'll add it to the end of metawords and create a placeholder in metaidx
277
                        $id = count($metawords);
278
                        $metawords[$id] = $val;
279
                        $metaidx[$id] = '';
280
                        $addwords = true;
281
                    }
282
                    // test if value is already in the index
283
                    if (isset($val_idx[$id]) && $val_idx[$id] <= 0){
284
                        $val_idx[$id] = 0;
285
                    } else { // else add it
286
                        $val_idx[$id] = 1;
287
                    }
288
                }
289
            }
290
291
            if ($addwords) {
292
                $this->saveIndex($metaname.'_w', '', $metawords);
293
            }
294
            $vals_changed = false;
295
            foreach ($val_idx as $id => $action) {
296
                if ($action == -1) {
297
                    $metaidx[$id] = $this->updateTuple($metaidx[$id], $pid, 0);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 237 can also be of type boolean; however, Doku_Indexer::updateTuple() does only seem to accept string|integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
298
                    $vals_changed = true;
299
                    unset($val_idx[$id]);
300
                } elseif ($action == 1) {
301
                    $metaidx[$id] = $this->updateTuple($metaidx[$id], $pid, 1);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 237 can also be of type boolean; however, Doku_Indexer::updateTuple() does only seem to accept string|integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
302
                    $vals_changed = true;
303
                }
304
            }
305
306
            if ($vals_changed) {
307
                $this->saveIndex($metaname.'_i', '', $metaidx);
308
                $val_idx = implode(':', array_keys($val_idx));
309
                $this->saveIndexKey($metaname.'_p', '', $pid, $val_idx);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 237 can also be of type boolean; however, Doku_Indexer::saveIndexKey() does only seem to accept integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
310
            }
311
312
            unset($metaidx);
313
            unset($metawords);
314
        }
315
316
        $this->unlock();
317
        return true;
318
    }
319
320
    /**
321
     * Rename a page in the search index without changing the indexed content. This function doesn't check if the
322
     * old or new name exists in the filesystem. It returns an error if the old page isn't in the page list of the
323
     * indexer and it deletes all previously indexed content of the new page.
324
     *
325
     * @param string $oldpage The old page name
326
     * @param string $newpage The new page name
327
     * @return string|bool If the page was successfully renamed, can be a message in the case of an error
328
     */
329
    public function renamePage($oldpage, $newpage) {
330
        if (!$this->lock()) return 'locked';
331
332
        $pages = $this->getPages();
333
334
        $id = array_search($oldpage, $pages, true);
335
        if ($id === false) {
336
            $this->unlock();
337
            return 'page is not in index';
338
        }
339
340
        $new_id = array_search($newpage, $pages, true);
341
        if ($new_id !== false) {
342
            // make sure the page is not in the index anymore
343
            if ($this->deletePageNoLock($newpage) !== true) {
344
                return false;
345
            }
346
347
            $pages[$new_id] = 'deleted:'.time().rand(0, 9999);
348
        }
349
350
        $pages[$id] = $newpage;
351
352
        // update index
353
        if (!$this->saveIndex('page', '', $pages)) {
354
            $this->unlock();
355
            return false;
356
        }
357
358
        // reset the pid cache
359
        $this->pidCache = array();
360
361
        $this->unlock();
362
        return true;
363
    }
364
365
    /**
366
     * Renames a meta value in the index. This doesn't change the meta value in the pages, it assumes that all pages
367
     * will be updated.
368
     *
369
     * @param string $key       The metadata key of which a value shall be changed
370
     * @param string $oldvalue  The old value that shall be renamed
371
     * @param string $newvalue  The new value to which the old value shall be renamed, if exists values will be merged
372
     * @return bool|string      If renaming the value has been successful, false or error message on error.
373
     */
374
    public function renameMetaValue($key, $oldvalue, $newvalue) {
375
        if (!$this->lock()) return 'locked';
376
377
        // change the relation references index
378
        $metavalues = $this->getIndex($key, '_w');
379
        $oldid = array_search($oldvalue, $metavalues, true);
380
        if ($oldid !== false) {
381
            $newid = array_search($newvalue, $metavalues, true);
382
            if ($newid !== false) {
383
                // free memory
384
                unset ($metavalues);
385
386
                // okay, now we have two entries for the same value. we need to merge them.
387
                $indexline = $this->getIndexKey($key.'_i', '', $oldid);
388
                if ($indexline != '') {
389
                    $newindexline = $this->getIndexKey($key.'_i', '', $newid);
390
                    $pagekeys     = $this->getIndex($key.'_p', '');
391
                    $parts = explode(':', $indexline);
392
                    foreach ($parts as $part) {
393
                        list($id, $count) = explode('*', $part);
394
                        $newindexline =  $this->updateTuple($newindexline, $id, $count);
395
396
                        $keyline = explode(':', $pagekeys[$id]);
397
                        // remove old meta value
398
                        $keyline = array_diff($keyline, array($oldid));
399
                        // add new meta value when not already present
400
                        if (!in_array($newid, $keyline)) {
401
                            array_push($keyline, $newid);
402
                        }
403
                        $pagekeys[$id] = implode(':', $keyline);
404
                    }
405
                    $this->saveIndex($key.'_p', '', $pagekeys);
406
                    unset($pagekeys);
407
                    $this->saveIndexKey($key.'_i', '', $oldid, '');
408
                    $this->saveIndexKey($key.'_i', '', $newid, $newindexline);
409
                }
410
            } else {
411
                $metavalues[$oldid] = $newvalue;
412
                if (!$this->saveIndex($key.'_w', '', $metavalues)) {
413
                    $this->unlock();
414
                    return false;
415
                }
416
            }
417
        }
418
419
        $this->unlock();
420
        return true;
421
    }
422
423
    /**
424
     * Remove a page from the index
425
     *
426
     * Erases entries in all known indexes.
427
     *
428
     * @param string    $page   a page name
429
     * @return string|boolean  the function completed successfully
430
     *
431
     * @author Tom N Harris <[email protected]>
432
     */
433
    public function deletePage($page) {
434
        if (!$this->lock())
435
            return "locked";
436
437
        $result = $this->deletePageNoLock($page);
438
439
        $this->unlock();
440
441
        return $result;
442
    }
443
444
    /**
445
     * Remove a page from the index without locking the index, only use this function if the index is already locked
446
     *
447
     * Erases entries in all known indexes.
448
     *
449
     * @param string    $page   a page name
450
     * @return boolean          the function completed successfully
451
     *
452
     * @author Tom N Harris <[email protected]>
453
     */
454
    protected function deletePageNoLock($page) {
455
        // load known documents
456
        $pid = $this->getPIDNoLock($page);
457
        if ($pid === false) {
458
            return false;
459
        }
460
461
        // Remove obsolete index entries
462
        $pageword_idx = $this->getIndexKey('pageword', '', $pid);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 456 can also be of type boolean; however, Doku_Indexer::getIndexKey() does only seem to accept integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
463
        if ($pageword_idx !== '') {
464
            $delwords = explode(':',$pageword_idx);
465
            $upwords = array();
466
            foreach ($delwords as $word) {
467
                if ($word != '') {
468
                    list($wlen,$wid) = explode('*', $word);
469
                    $wid = (int)$wid;
470
                    $upwords[$wlen][] = $wid;
471
                }
472
            }
473
            foreach ($upwords as $wlen => $widx) {
474
                $index = $this->getIndex('i', $wlen);
475
                foreach ($widx as $wid) {
476
                    $index[$wid] = $this->updateTuple($index[$wid], $pid, 0);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 456 can also be of type boolean; however, Doku_Indexer::updateTuple() does only seem to accept string|integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
477
                }
478
                $this->saveIndex('i', $wlen, $index);
479
            }
480
        }
481
        // Save the reverse index
482
        if (!$this->saveIndexKey('pageword', '', $pid, "")) {
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 456 can also be of type boolean; however, Doku_Indexer::saveIndexKey() does only seem to accept integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
483
            return false;
484
        }
485
486
        $this->saveIndexKey('title', '', $pid, "");
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 456 can also be of type boolean; however, Doku_Indexer::saveIndexKey() does only seem to accept integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
487
        $keyidx = $this->getIndex('metadata', '');
488
        foreach ($keyidx as $metaname) {
489
            $val_idx = explode(':', $this->getIndexKey($metaname.'_p', '', $pid));
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 456 can also be of type boolean; however, Doku_Indexer::getIndexKey() does only seem to accept integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
490
            $meta_idx = $this->getIndex($metaname.'_i', '');
491
            foreach ($val_idx as $id) {
492
                if ($id === '') continue;
493
                $meta_idx[$id] = $this->updateTuple($meta_idx[$id], $pid, 0);
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 456 can also be of type boolean; however, Doku_Indexer::updateTuple() does only seem to accept string|integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
494
            }
495
            $this->saveIndex($metaname.'_i', '', $meta_idx);
496
            $this->saveIndexKey($metaname.'_p', '', $pid, '');
0 ignored issues
show
Bug introduced by
It seems like $pid defined by $this->getPIDNoLock($page) on line 456 can also be of type boolean; however, Doku_Indexer::saveIndexKey() does only seem to accept integer, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
497
        }
498
499
        return true;
500
    }
501
502
    /**
503
     * Clear the whole index
504
     *
505
     * @return bool If the index has been cleared successfully
506
     */
507
    public function clear() {
508
        global $conf;
509
510
        if (!$this->lock()) return false;
511
512
        @unlink($conf['indexdir'].'/page.idx');
513
        @unlink($conf['indexdir'].'/title.idx');
514
        @unlink($conf['indexdir'].'/pageword.idx');
515
        @unlink($conf['indexdir'].'/metadata.idx');
516
        $dir = @opendir($conf['indexdir']);
517
        if($dir!==false){
518
            while(($f = readdir($dir)) !== false){
519
                if(substr($f,-4)=='.idx' &&
520
                    (substr($f,0,1)=='i' || substr($f,0,1)=='w'
521
                        || substr($f,-6)=='_w.idx' || substr($f,-6)=='_i.idx' || substr($f,-6)=='_p.idx'))
522
                    @unlink($conf['indexdir']."/$f");
523
            }
524
        }
525
        @unlink($conf['indexdir'].'/lengths.idx');
526
527
        // clear the pid cache
528
        $this->pidCache = array();
529
530
        $this->unlock();
531
        return true;
532
    }
533
534
    /**
535
     * Split the text into words for fulltext search
536
     *
537
     * TODO: does this also need &$stopwords ?
538
     *
539
     * @triggers INDEXER_TEXT_PREPARE
540
     * This event allows plugins to modify the text before it gets tokenized.
541
     * Plugins intercepting this event should also intercept INDEX_VERSION_GET
542
     *
543
     * @param string    $text   plain text
544
     * @param boolean   $wc     are wildcards allowed?
545
     * @return array            list of words in the text
546
     *
547
     * @author Tom N Harris <[email protected]>
548
     * @author Andreas Gohr <[email protected]>
549
     */
550
    public function tokenizer($text, $wc=false) {
551
        $wc = ($wc) ? '' : '\*';
552
        $stopwords =& idx_get_stopwords();
553
554
        // prepare the text to be tokenized
555
        $evt = new Event('INDEXER_TEXT_PREPARE', $text);
556
        if ($evt->advise_before(true)) {
557
            if (preg_match('/[^0-9A-Za-z ]/u', $text)) {
558
                $text = \dokuwiki\Utf8\Asian::separateAsianWords($text);
559
            }
560
        }
561
        $evt->advise_after();
562
        unset($evt);
563
564
        $text = strtr($text,
565
                       array(
566
                           "\r" => ' ',
567
                           "\n" => ' ',
568
                           "\t" => ' ',
569
                           "\xC2\xAD" => '', //soft-hyphen
570
                       )
571
                     );
572
        if (preg_match('/[^0-9A-Za-z ]/u', $text))
573
            $text = \dokuwiki\Utf8\Clean::stripspecials($text, ' ', '\._\-:'.$wc);
574
575
        $wordlist = explode(' ', $text);
576
        foreach ($wordlist as $i => $word) {
577
            $wordlist[$i] = (preg_match('/[^0-9A-Za-z]/u', $word)) ?
578
                \dokuwiki\Utf8\PhpString::strtolower($word) : strtolower($word);
579
        }
580
581
        foreach ($wordlist as $i => $word) {
582
            if ((!is_numeric($word) && strlen($word) < IDX_MINWORDLENGTH)
583
              || array_search($word, $stopwords, true) !== false)
584
                unset($wordlist[$i]);
585
        }
586
        return array_values($wordlist);
587
    }
588
589
    /**
590
     * Get the numeric PID of a page
591
     *
592
     * @param string $page The page to get the PID for
593
     * @return bool|int The page id on success, false on error
594
     */
595
    public function getPID($page) {
596
        // return PID without locking when it is in the cache
597
        if (isset($this->pidCache[$page])) return $this->pidCache[$page];
598
599
        if (!$this->lock())
600
            return false;
601
602
        // load known documents
603
        $pid = $this->getPIDNoLock($page);
604
        if ($pid === false) {
605
            $this->unlock();
606
            return false;
607
        }
608
609
        $this->unlock();
610
        return $pid;
611
    }
612
613
    /**
614
     * Get the numeric PID of a page without locking the index.
615
     * Only use this function when the index is already locked.
616
     *
617
     * @param string $page The page to get the PID for
618
     * @return bool|int The page id on success, false on error
619
     */
620
    protected function getPIDNoLock($page) {
621
        // avoid expensive addIndexKey operation for the most recently requested pages by using a cache
622
        if (isset($this->pidCache[$page])) return $this->pidCache[$page];
623
        $pid = $this->addIndexKey('page', '', $page);
0 ignored issues
show
Bug Compatibility introduced by
The expression $this->addIndexKey('page', '', $page); of type false|integer|string adds the type string to the return on line 628 which is incompatible with the return type documented by Doku_Indexer::getPIDNoLock of type boolean|integer.
Loading history...
624
        // limit cache to 10 entries by discarding the oldest element as in DokuWiki usually only the most recently
625
        // added item will be requested again
626
        if (count($this->pidCache) > 10) array_shift($this->pidCache);
627
        $this->pidCache[$page] = $pid;
628
        return $pid;
629
    }
630
631
    /**
632
     * Get the page id of a numeric PID
633
     *
634
     * @param int $pid The PID to get the page id for
635
     * @return string The page id
636
     */
637
    public function getPageFromPID($pid) {
638
        return $this->getIndexKey('page', '', $pid);
639
    }
640
641
    /**
642
     * Find pages in the fulltext index containing the words,
643
     *
644
     * The search words must be pre-tokenized, meaning only letters and
645
     * numbers with an optional wildcard
646
     *
647
     * The returned array will have the original tokens as key. The values
648
     * in the returned list is an array with the page names as keys and the
649
     * number of times that token appears on the page as value.
650
     *
651
     * @param array  $tokens list of words to search for
652
     * @return array         list of page names with usage counts
653
     *
654
     * @author Tom N Harris <[email protected]>
655
     * @author Andreas Gohr <[email protected]>
656
     */
657
    public function lookup(&$tokens) {
658
        $result = array();
659
        $wids = $this->getIndexWords($tokens, $result);
660
        if (empty($wids)) return array();
661
        // load known words and documents
662
        $page_idx = $this->getIndex('page', '');
663
        $docs = array();
664
        foreach (array_keys($wids) as $wlen) {
665
            $wids[$wlen] = array_unique($wids[$wlen]);
666
            $index = $this->getIndex('i', $wlen);
667
            foreach($wids[$wlen] as $ixid) {
668
                if ($ixid < count($index))
669
                    $docs["$wlen*$ixid"] = $this->parseTuples($page_idx, $index[$ixid]);
670
            }
671
        }
672
        // merge found pages into final result array
673
        $final = array();
674
        foreach ($result as $word => $res) {
675
            $final[$word] = array();
676
            foreach ($res as $wid) {
677
                // handle the case when ($ixid < count($index)) has been false
678
                // and thus $docs[$wid] hasn't been set.
679
                if (!isset($docs[$wid])) continue;
680
                $hits = &$docs[$wid];
681
                foreach ($hits as $hitkey => $hitcnt) {
682
                    // make sure the document still exists
683
                    if (!page_exists($hitkey, '', false)) continue;
684
                    if (!isset($final[$word][$hitkey]))
685
                        $final[$word][$hitkey] = $hitcnt;
686
                    else
687
                        $final[$word][$hitkey] += $hitcnt;
688
                }
689
            }
690
        }
691
        return $final;
692
    }
693
694
    /**
695
     * Find pages containing a metadata key.
696
     *
697
     * The metadata values are compared as case-sensitive strings. Pass a
698
     * callback function that returns true or false to use a different
699
     * comparison function. The function will be called with the $value being
700
     * searched for as the first argument, and the word in the index as the
701
     * second argument. The function preg_match can be used directly if the
702
     * values are regexes.
703
     *
704
     * @param string    $key    name of the metadata key to look for
705
     * @param string    $value  search term to look for, must be a string or array of strings
706
     * @param callback  $func   comparison function
707
     * @return array            lists with page names, keys are query values if $value is array
708
     *
709
     * @author Tom N Harris <[email protected]>
710
     * @author Michael Hamann <[email protected]>
711
     */
712
    public function lookupKey($key, &$value, $func=null) {
713
        if (!is_array($value))
714
            $value_array = array($value);
715
        else
716
            $value_array =& $value;
717
718
        // the matching ids for the provided value(s)
719
        $value_ids = array();
720
721
        $metaname = idx_cleanName($key);
722
723
        // get all words in order to search the matching ids
724
        if ($key == 'title') {
725
            $words = $this->getIndex('title', '');
726
        } else {
727
            $words = $this->getIndex($metaname.'_w', '');
728
        }
729
730
        if (!is_null($func)) {
731
            foreach ($value_array as $val) {
732
                foreach ($words as $i => $word) {
733
                    if (call_user_func_array($func, array($val, $word)))
734
                        $value_ids[$i][] = $val;
735
                }
736
            }
737
        } else {
738
            foreach ($value_array as $val) {
739
                $xval = $val;
740
                $caret = '^';
741
                $dollar = '$';
742
                // check for wildcards
743
                if (substr($xval, 0, 1) == '*') {
744
                    $xval = substr($xval, 1);
745
                    $caret = '';
746
                }
747
                if (substr($xval, -1, 1) == '*') {
748
                    $xval = substr($xval, 0, -1);
749
                    $dollar = '';
750
                }
751
                if (!$caret || !$dollar) {
752
                    $re = $caret.preg_quote($xval, '/').$dollar;
753
                    foreach(array_keys(preg_grep('/'.$re.'/', $words)) as $i)
754
                        $value_ids[$i][] = $val;
755
                } else {
756
                    if (($i = array_search($val, $words, true)) !== false)
757
                        $value_ids[$i][] = $val;
758
                }
759
            }
760
        }
761
762
        unset($words); // free the used memory
763
764
        // initialize the result so it won't be null
765
        $result = array();
766
        foreach ($value_array as $val) {
767
            $result[$val] = array();
768
        }
769
770
        $page_idx = $this->getIndex('page', '');
771
772
        // Special handling for titles
773
        if ($key == 'title') {
774
            foreach ($value_ids as $pid => $val_list) {
775
                $page = $page_idx[$pid];
776
                foreach ($val_list as $val) {
777
                    $result[$val][] = $page;
778
                }
779
            }
780
        } else {
781
            // load all lines and pages so the used lines can be taken and matched with the pages
782
            $lines = $this->getIndex($metaname.'_i', '');
783
784
            foreach ($value_ids as $value_id => $val_list) {
785
                // parse the tuples of the form page_id*1:page2_id*1 and so on, return value
786
                // is an array with page_id => 1, page2_id => 1 etc. so take the keys only
787
                $pages = array_keys($this->parseTuples($page_idx, $lines[$value_id]));
788
                foreach ($val_list as $val) {
789
                    $result[$val] = array_merge($result[$val], $pages);
790
                }
791
            }
792
        }
793
        if (!is_array($value)) $result = $result[$value];
794
        return $result;
795
    }
796
797
    /**
798
     * Find the index ID of each search term.
799
     *
800
     * The query terms should only contain valid characters, with a '*' at
801
     * either the beginning or end of the word (or both).
802
     * The $result parameter can be used to merge the index locations with
803
     * the appropriate query term.
804
     *
805
     * @param array  $words  The query terms.
806
     * @param array  $result Set to word => array("length*id" ...)
807
     * @return array         Set to length => array(id ...)
808
     *
809
     * @author Tom N Harris <[email protected]>
810
     */
811
    protected function getIndexWords(&$words, &$result) {
812
        $tokens = array();
813
        $tokenlength = array();
814
        $tokenwild = array();
815
        foreach ($words as $word) {
816
            $result[$word] = array();
817
            $caret = '^';
818
            $dollar = '$';
819
            $xword = $word;
820
            $wlen = wordlen($word);
821
822
            // check for wildcards
823
            if (substr($xword, 0, 1) == '*') {
824
                $xword = substr($xword, 1);
825
                $caret = '';
826
                $wlen -= 1;
827
            }
828
            if (substr($xword, -1, 1) == '*') {
829
                $xword = substr($xword, 0, -1);
830
                $dollar = '';
831
                $wlen -= 1;
832
            }
833
            if ($wlen < IDX_MINWORDLENGTH && $caret && $dollar && !is_numeric($xword))
834
                continue;
835
            if (!isset($tokens[$xword]))
836
                $tokenlength[$wlen][] = $xword;
837
            if (!$caret || !$dollar) {
838
                $re = $caret.preg_quote($xword, '/').$dollar;
839
                $tokens[$xword][] = array($word, '/'.$re.'/');
840
                if (!isset($tokenwild[$xword]))
841
                    $tokenwild[$xword] = $wlen;
842
            } else {
843
                $tokens[$xword][] = array($word, null);
844
            }
845
        }
846
        asort($tokenwild);
847
        // $tokens = array( base word => array( [ query term , regexp ] ... ) ... )
848
        // $tokenlength = array( base word length => base word ... )
849
        // $tokenwild = array( base word => base word length ... )
850
        $length_filter = empty($tokenwild) ? $tokenlength : min(array_keys($tokenlength));
851
        $indexes_known = $this->indexLengths($length_filter);
852
        if (!empty($tokenwild)) sort($indexes_known);
853
        // get word IDs
854
        $wids = array();
855
        foreach ($indexes_known as $ixlen) {
856
            $word_idx = $this->getIndex('w', $ixlen);
857
            // handle exact search
858
            if (isset($tokenlength[$ixlen])) {
859
                foreach ($tokenlength[$ixlen] as $xword) {
860
                    $wid = array_search($xword, $word_idx, true);
861
                    if ($wid !== false) {
862
                        $wids[$ixlen][] = $wid;
863
                        foreach ($tokens[$xword] as $w)
864
                            $result[$w[0]][] = "$ixlen*$wid";
865
                    }
866
                }
867
            }
868
            // handle wildcard search
869
            foreach ($tokenwild as $xword => $wlen) {
870
                if ($wlen >= $ixlen) break;
871
                foreach ($tokens[$xword] as $w) {
872
                    if (is_null($w[1])) continue;
873
                    foreach(array_keys(preg_grep($w[1], $word_idx)) as $wid) {
874
                        $wids[$ixlen][] = $wid;
875
                        $result[$w[0]][] = "$ixlen*$wid";
876
                    }
877
                }
878
            }
879
        }
880
        return $wids;
881
    }
882
883
    /**
884
     * Return a list of all pages
885
     * Warning: pages may not exist!
886
     *
887
     * @param string    $key    list only pages containing the metadata key (optional)
888
     * @return array            list of page names
889
     *
890
     * @author Tom N Harris <[email protected]>
891
     */
892
    public function getPages($key=null) {
893
        $page_idx = $this->getIndex('page', '');
894
        if (is_null($key)) return $page_idx;
895
896
        $metaname = idx_cleanName($key);
897
898
        // Special handling for titles
899
        if ($key == 'title') {
900
            $title_idx = $this->getIndex('title', '');
901
            array_splice($page_idx, count($title_idx));
902
            foreach ($title_idx as $i => $title)
903
                if ($title === "") unset($page_idx[$i]);
904
            return array_values($page_idx);
905
        }
906
907
        $pages = array();
908
        $lines = $this->getIndex($metaname.'_i', '');
909
        foreach ($lines as $line) {
910
            $pages = array_merge($pages, $this->parseTuples($page_idx, $line));
911
        }
912
        return array_keys($pages);
913
    }
914
915
    /**
916
     * Return a list of words sorted by number of times used
917
     *
918
     * @param int       $min    bottom frequency threshold
919
     * @param int       $max    upper frequency limit. No limit if $max<$min
920
     * @param int       $minlen minimum length of words to count
921
     * @param string    $key    metadata key to list. Uses the fulltext index if not given
922
     * @return array            list of words as the keys and frequency as values
923
     *
924
     * @author Tom N Harris <[email protected]>
925
     */
926
    public function histogram($min=1, $max=0, $minlen=3, $key=null) {
927
        if ($min < 1)
928
            $min = 1;
929
        if ($max < $min)
930
            $max = 0;
931
932
        $result = array();
933
934
        if ($key == 'title') {
935
            $index = $this->getIndex('title', '');
936
            $index = array_count_values($index);
937
            foreach ($index as $val => $cnt) {
938
                if ($cnt >= $min && (!$max || $cnt <= $max) && strlen($val) >= $minlen)
939
                    $result[$val] = $cnt;
940
            }
941
        }
942
        elseif (!is_null($key)) {
943
            $metaname = idx_cleanName($key);
944
            $index = $this->getIndex($metaname.'_i', '');
945
            $val_idx = array();
946
            foreach ($index as $wid => $line) {
947
                $freq = $this->countTuples($line);
948
                if ($freq >= $min && (!$max || $freq <= $max))
949
                    $val_idx[$wid] = $freq;
950
            }
951
            if (!empty($val_idx)) {
952
                $words = $this->getIndex($metaname.'_w', '');
953
                foreach ($val_idx as $wid => $freq) {
954
                    if (strlen($words[$wid]) >= $minlen)
955
                        $result[$words[$wid]] = $freq;
956
                }
957
            }
958
        }
959
        else {
960
            $lengths = idx_listIndexLengths();
961
            foreach ($lengths as $length) {
962
                if ($length < $minlen) continue;
963
                $index = $this->getIndex('i', $length);
964
                $words = null;
965
                foreach ($index as $wid => $line) {
966
                    $freq = $this->countTuples($line);
967
                    if ($freq >= $min && (!$max || $freq <= $max)) {
968
                        if ($words === null)
969
                            $words = $this->getIndex('w', $length);
970
                        $result[$words[$wid]] = $freq;
971
                    }
972
                }
973
            }
974
        }
975
976
        arsort($result);
977
        return $result;
978
    }
979
980
    /**
981
     * Lock the indexer.
982
     *
983
     * @author Tom N Harris <[email protected]>
984
     *
985
     * @return bool|string
986
     */
987
    protected function lock() {
988
        global $conf;
989
        $status = true;
990
        $run = 0;
991
        $lock = $conf['lockdir'].'/_indexer.lock';
992
        while (!@mkdir($lock, $conf['dmode'])) {
993
            usleep(50);
994
            if(is_dir($lock) && time()-@filemtime($lock) > 60*5){
995
                // looks like a stale lock - remove it
996
                if (!@rmdir($lock)) {
997
                    $status = "removing the stale lock failed";
0 ignored issues
show
Unused Code introduced by
$status is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
998
                    return false;
999
                } else {
1000
                    $status = "stale lock removed";
1001
                }
1002
            }elseif($run++ == 1000){
1003
                // we waited 5 seconds for that lock
1004
                return false;
1005
            }
1006
        }
1007
        if (!empty($conf['dperm'])) {
1008
            chmod($lock, $conf['dperm']);
1009
        }
1010
        return $status;
1011
    }
1012
1013
    /**
1014
     * Release the indexer lock.
1015
     *
1016
     * @author Tom N Harris <[email protected]>
1017
     *
1018
     * @return bool
1019
     */
1020
    protected function unlock() {
1021
        global $conf;
1022
        @rmdir($conf['lockdir'].'/_indexer.lock');
1023
        return true;
1024
    }
1025
1026
    /**
1027
     * Retrieve the entire index.
1028
     *
1029
     * The $suffix argument is for an index that is split into
1030
     * multiple parts. Different index files should use different
1031
     * base names.
1032
     *
1033
     * @param string    $idx    name of the index
1034
     * @param string    $suffix subpart identifier
1035
     * @return array            list of lines without CR or LF
1036
     *
1037
     * @author Tom N Harris <[email protected]>
1038
     */
1039
    protected function getIndex($idx, $suffix) {
1040
        global $conf;
1041
        $fn = $conf['indexdir'].'/'.$idx.$suffix.'.idx';
1042
        if (!file_exists($fn)) return array();
1043
        return file($fn, FILE_IGNORE_NEW_LINES);
1044
    }
1045
1046
    /**
1047
     * Replace the contents of the index with an array.
1048
     *
1049
     * @param string    $idx    name of the index
1050
     * @param string    $suffix subpart identifier
1051
     * @param array     $lines  list of lines without LF
1052
     * @return bool             If saving succeeded
1053
     *
1054
     * @author Tom N Harris <[email protected]>
1055
     */
1056
    protected function saveIndex($idx, $suffix, &$lines) {
1057
        global $conf;
1058
        $fn = $conf['indexdir'].'/'.$idx.$suffix;
1059
        $fh = @fopen($fn.'.tmp', 'w');
1060
        if (!$fh) return false;
1061
        fwrite($fh, join("\n", $lines));
1062
        if (!empty($lines))
1063
            fwrite($fh, "\n");
1064
        fclose($fh);
1065
        if (isset($conf['fperm']))
1066
            chmod($fn.'.tmp', $conf['fperm']);
1067
        io_rename($fn.'.tmp', $fn.'.idx');
1068
        return true;
1069
    }
1070
1071
    /**
1072
     * Retrieve a line from the index.
1073
     *
1074
     * @param string    $idx    name of the index
1075
     * @param string    $suffix subpart identifier
1076
     * @param int       $id     the line number
1077
     * @return string           a line with trailing whitespace removed
1078
     *
1079
     * @author Tom N Harris <[email protected]>
1080
     */
1081
    protected function getIndexKey($idx, $suffix, $id) {
1082
        global $conf;
1083
        $fn = $conf['indexdir'].'/'.$idx.$suffix.'.idx';
1084
        if (!file_exists($fn)) return '';
1085
        $fh = @fopen($fn, 'r');
1086
        if (!$fh) return '';
1087
        $ln = -1;
1088
        while (($line = fgets($fh)) !== false) {
1089
            if (++$ln == $id) break;
1090
        }
1091
        fclose($fh);
1092
        return rtrim((string)$line);
1093
    }
1094
1095
    /**
1096
     * Write a line into the index.
1097
     *
1098
     * @param string    $idx    name of the index
1099
     * @param string    $suffix subpart identifier
1100
     * @param int       $id     the line number
1101
     * @param string    $line   line to write
1102
     * @return bool             If saving succeeded
1103
     *
1104
     * @author Tom N Harris <[email protected]>
1105
     */
1106
    protected function saveIndexKey($idx, $suffix, $id, $line) {
1107
        global $conf;
1108
        if (substr($line, -1) != "\n")
1109
            $line .= "\n";
1110
        $fn = $conf['indexdir'].'/'.$idx.$suffix;
1111
        $fh = @fopen($fn.'.tmp', 'w');
1112
        if (!$fh) return false;
1113
        $ih = @fopen($fn.'.idx', 'r');
1114
        if ($ih) {
1115
            $ln = -1;
1116
            while (($curline = fgets($ih)) !== false) {
1117
                fwrite($fh, (++$ln == $id) ? $line : $curline);
1118
            }
1119
            if ($id > $ln) {
1120
                while ($id > ++$ln)
1121
                    fwrite($fh, "\n");
1122
                fwrite($fh, $line);
1123
            }
1124
            fclose($ih);
1125
        } else {
1126
            $ln = -1;
1127
            while ($id > ++$ln)
1128
                fwrite($fh, "\n");
1129
            fwrite($fh, $line);
1130
        }
1131
        fclose($fh);
1132
        if (isset($conf['fperm']))
1133
            chmod($fn.'.tmp', $conf['fperm']);
1134
        io_rename($fn.'.tmp', $fn.'.idx');
1135
        return true;
1136
    }
1137
1138
    /**
1139
     * Retrieve or insert a value in the index.
1140
     *
1141
     * @param string    $idx    name of the index
1142
     * @param string    $suffix subpart identifier
1143
     * @param string    $value  line to find in the index
1144
     * @return int|bool          line number of the value in the index or false if writing the index failed
1145
     *
1146
     * @author Tom N Harris <[email protected]>
1147
     */
1148
    protected function addIndexKey($idx, $suffix, $value) {
1149
        $index = $this->getIndex($idx, $suffix);
1150
        $id = array_search($value, $index, true);
1151
        if ($id === false) {
1152
            $id = count($index);
1153
            $index[$id] = $value;
1154
            if (!$this->saveIndex($idx, $suffix, $index)) {
1155
                trigger_error("Failed to write $idx index", E_USER_ERROR);
1156
                return false;
1157
            }
1158
        }
1159
        return $id;
1160
    }
1161
1162
    /**
1163
     * Get the list of lengths indexed in the wiki.
1164
     *
1165
     * Read the index directory or a cache file and returns
1166
     * a sorted array of lengths of the words used in the wiki.
1167
     *
1168
     * @author YoBoY <[email protected]>
1169
     *
1170
     * @return array
1171
     */
1172
    protected function listIndexLengths() {
1173
        return idx_listIndexLengths();
1174
    }
1175
1176
    /**
1177
     * Get the word lengths that have been indexed.
1178
     *
1179
     * Reads the index directory and returns an array of lengths
1180
     * that there are indices for.
1181
     *
1182
     * @author YoBoY <[email protected]>
1183
     *
1184
     * @param array|int $filter
1185
     * @return array
1186
     */
1187
    protected function indexLengths($filter) {
1188
        global $conf;
1189
        $idx = array();
1190
        if (is_array($filter)) {
1191
            // testing if index files exist only
1192
            $path = $conf['indexdir']."/i";
1193
            foreach ($filter as $key => $value) {
1194
                if (file_exists($path.$key.'.idx'))
1195
                    $idx[] = $key;
1196
            }
1197
        } else {
1198
            $lengths = idx_listIndexLengths();
1199
            foreach ($lengths as $key => $length) {
1200
                // keep all the values equal or superior
1201
                if ((int)$length >= (int)$filter)
1202
                    $idx[] = $length;
1203
            }
1204
        }
1205
        return $idx;
1206
    }
1207
1208
    /**
1209
     * Insert or replace a tuple in a line.
1210
     *
1211
     * @author Tom N Harris <[email protected]>
1212
     *
1213
     * @param string $line
1214
     * @param string|int $id
1215
     * @param int    $count
1216
     * @return string
1217
     */
1218
    protected function updateTuple($line, $id, $count) {
1219
        if ($line != ''){
1220
            $line = preg_replace('/(^|:)'.preg_quote($id,'/').'\*\d*/', '', $line);
1221
        }
1222
        $line = trim($line, ':');
1223
        if ($count) {
1224
            if ($line) {
1225
                return "$id*$count:".$line;
1226
            } else {
1227
                return "$id*$count";
1228
            }
1229
        }
1230
        return $line;
1231
    }
1232
1233
    /**
1234
     * Split a line into an array of tuples.
1235
     *
1236
     * @author Tom N Harris <[email protected]>
1237
     * @author Andreas Gohr <[email protected]>
1238
     *
1239
     * @param array $keys
1240
     * @param string $line
1241
     * @return array
1242
     */
1243
    protected function parseTuples(&$keys, $line) {
1244
        $result = array();
1245
        if ($line == '') return $result;
1246
        $parts = explode(':', $line);
1247
        foreach ($parts as $tuple) {
1248
            if ($tuple === '') continue;
1249
            list($key, $cnt) = explode('*', $tuple);
1250
            if (!$cnt) continue;
1251
            $key = $keys[$key];
1252
            if ($key === false || is_null($key)) continue;
1253
            $result[$key] = $cnt;
1254
        }
1255
        return $result;
1256
    }
1257
1258
    /**
1259
     * Sum the counts in a list of tuples.
1260
     *
1261
     * @author Tom N Harris <[email protected]>
1262
     *
1263
     * @param string $line
1264
     * @return int
1265
     */
1266
    protected function countTuples($line) {
1267
        $freq = 0;
1268
        $parts = explode(':', $line);
1269
        foreach ($parts as $tuple) {
1270
            if ($tuple === '') continue;
1271
            list(/* $pid */, $cnt) = explode('*', $tuple);
1272
            $freq += (int)$cnt;
1273
        }
1274
        return $freq;
1275
    }
1276
}
1277
1278
/**
1279
 * Create an instance of the indexer.
1280
 *
1281
 * @return Doku_Indexer    a Doku_Indexer
1282
 *
1283
 * @author Tom N Harris <[email protected]>
1284
 */
1285
function idx_get_indexer() {
1286
    static $Indexer;
1287
    if (!isset($Indexer)) {
1288
        $Indexer = new Doku_Indexer();
1289
    }
1290
    return $Indexer;
1291
}
1292
1293
/**
1294
 * Returns words that will be ignored.
1295
 *
1296
 * @return array                list of stop words
1297
 *
1298
 * @author Tom N Harris <[email protected]>
1299
 */
1300
function & idx_get_stopwords() {
1301
    static $stopwords = null;
1302
    if (is_null($stopwords)) {
1303
        global $conf;
1304
        $swfile = DOKU_INC.'inc/lang/'.$conf['lang'].'/stopwords.txt';
1305
        if(file_exists($swfile)){
1306
            $stopwords = file($swfile, FILE_IGNORE_NEW_LINES);
1307
        }else{
1308
            $stopwords = array();
1309
        }
1310
    }
1311
    return $stopwords;
1312
}
1313
1314
/**
1315
 * Adds/updates the search index for the given page
1316
 *
1317
 * Locking is handled internally.
1318
 *
1319
 * @param string        $page   name of the page to index
1320
 * @param boolean       $verbose    print status messages
1321
 * @param boolean       $force  force reindexing even when the index is up to date
1322
 * @return string|boolean  the function completed successfully
1323
 *
1324
 * @author Tom N Harris <[email protected]>
1325
 */
1326
function idx_addPage($page, $verbose=false, $force=false) {
1327
    $idxtag = metaFN($page,'.indexed');
1328
    // check if page was deleted but is still in the index
1329
    if (!page_exists($page)) {
1330
        if (!file_exists($idxtag)) {
1331
            if ($verbose) print("Indexer: $page does not exist, ignoring".DOKU_LF);
1332
            return false;
1333
        }
1334
        $Indexer = idx_get_indexer();
1335
        $result = $Indexer->deletePage($page);
1336
        if ($result === "locked") {
1337
            if ($verbose) print("Indexer: locked".DOKU_LF);
1338
            return false;
1339
        }
1340
        @unlink($idxtag);
1341
        return $result;
1342
    }
1343
1344
    // check if indexing needed
1345
    if(!$force && file_exists($idxtag)){
1346
        if(trim(io_readFile($idxtag)) == idx_get_version()){
1347
            $last = @filemtime($idxtag);
1348
            if($last > @filemtime(wikiFN($page))){
1349
                if ($verbose) print("Indexer: index for $page up to date".DOKU_LF);
1350
                return false;
1351
            }
1352
        }
1353
    }
1354
1355
    $indexenabled = p_get_metadata($page, 'internal index', METADATA_RENDER_UNLIMITED);
1356
    if ($indexenabled === false) {
1357
        $result = false;
1358
        if (file_exists($idxtag)) {
1359
            $Indexer = idx_get_indexer();
1360
            $result = $Indexer->deletePage($page);
1361
            if ($result === "locked") {
1362
                if ($verbose) print("Indexer: locked".DOKU_LF);
1363
                return false;
1364
            }
1365
            @unlink($idxtag);
1366
        }
1367
        if ($verbose) print("Indexer: index disabled for $page".DOKU_LF);
1368
        return $result;
1369
    }
1370
1371
    $Indexer = idx_get_indexer();
1372
    $pid = $Indexer->getPID($page);
1373
    if ($pid === false) {
1374
        if ($verbose) print("Indexer: getting the PID failed for $page".DOKU_LF);
1375
        return false;
1376
    }
1377
    $body = '';
1378
    $metadata = array();
1379
    $metadata['title'] = p_get_metadata($page, 'title', METADATA_RENDER_UNLIMITED);
1380
    if (($references = p_get_metadata($page, 'relation references', METADATA_RENDER_UNLIMITED)) !== null)
1381
        $metadata['relation_references'] = array_keys($references);
1382
    else
1383
        $metadata['relation_references'] = array();
1384
1385
    if (($media = p_get_metadata($page, 'relation media', METADATA_RENDER_UNLIMITED)) !== null)
1386
        $metadata['relation_media'] = array_keys($media);
1387
    else
1388
        $metadata['relation_media'] = array();
1389
1390
    $data = compact('page', 'body', 'metadata', 'pid');
1391
    $evt = new Event('INDEXER_PAGE_ADD', $data);
1392
    if ($evt->advise_before()) $data['body'] = $data['body'] . " " . rawWiki($page);
1393
    $evt->advise_after();
1394
    unset($evt);
1395
    extract($data);
1396
1397
    $result = $Indexer->addPageWords($page, $body);
1398
    if ($result === "locked") {
1399
        if ($verbose) print("Indexer: locked".DOKU_LF);
1400
        return false;
1401
    }
1402
1403
    if ($result) {
1404
        $result = $Indexer->addMetaKeys($page, $metadata);
1405
        if ($result === "locked") {
1406
            if ($verbose) print("Indexer: locked".DOKU_LF);
1407
            return false;
1408
        }
1409
    }
1410
1411
    if ($result)
1412
        io_saveFile(metaFN($page,'.indexed'), idx_get_version());
1413
    if ($verbose) {
1414
        print("Indexer: finished".DOKU_LF);
1415
        return true;
1416
    }
1417
    return $result;
1418
}
1419
1420
/**
1421
 * Find tokens in the fulltext index
1422
 *
1423
 * Takes an array of words and will return a list of matching
1424
 * pages for each one.
1425
 *
1426
 * Important: No ACL checking is done here! All results are
1427
 *            returned, regardless of permissions
1428
 *
1429
 * @param array      $words  list of words to search for
1430
 * @return array             list of pages found, associated with the search terms
1431
 */
1432
function idx_lookup(&$words) {
1433
    $Indexer = idx_get_indexer();
1434
    return $Indexer->lookup($words);
1435
}
1436
1437
/**
1438
 * Split a string into tokens
1439
 *
1440
 * @param string $string
1441
 * @param bool $wc
1442
 *
1443
 * @return array
1444
 */
1445
function idx_tokenizer($string, $wc=false) {
1446
    $Indexer = idx_get_indexer();
1447
    return $Indexer->tokenizer($string, $wc);
1448
}
1449
1450
/* For compatibility */
1451
1452
/**
1453
 * Read the list of words in an index (if it exists).
1454
 *
1455
 * @author Tom N Harris <[email protected]>
1456
 *
1457
 * @param string $idx
1458
 * @param string $suffix
1459
 * @return array
1460
 */
1461
function idx_getIndex($idx, $suffix) {
1462
    global $conf;
1463
    $fn = $conf['indexdir'].'/'.$idx.$suffix.'.idx';
1464
    if (!file_exists($fn)) return array();
1465
    return file($fn);
1466
}
1467
1468
/**
1469
 * Get the list of lengths indexed in the wiki.
1470
 *
1471
 * Read the index directory or a cache file and returns
1472
 * a sorted array of lengths of the words used in the wiki.
1473
 *
1474
 * @author YoBoY <[email protected]>
1475
 *
1476
 * @return array
1477
 */
1478
function idx_listIndexLengths() {
1479
    global $conf;
1480
    // testing what we have to do, create a cache file or not.
1481
    if ($conf['readdircache'] == 0) {
1482
        $docache = false;
1483
    } else {
1484
        clearstatcache();
1485
        if (file_exists($conf['indexdir'].'/lengths.idx')
1486
        && (time() < @filemtime($conf['indexdir'].'/lengths.idx') + $conf['readdircache'])) {
1487
            if (
1488
                ($lengths = @file($conf['indexdir'].'/lengths.idx', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES))
1489
                !== false
1490
            ) {
1491
                $idx = array();
1492
                foreach ($lengths as $length) {
1493
                    $idx[] = (int)$length;
1494
                }
1495
                return $idx;
1496
            }
1497
        }
1498
        $docache = true;
1499
    }
1500
1501
    if ($conf['readdircache'] == 0 || $docache) {
1502
        $dir = @opendir($conf['indexdir']);
1503
        if ($dir === false)
1504
            return array();
1505
        $idx = array();
1506
        while (($f = readdir($dir)) !== false) {
1507
            if (substr($f, 0, 1) == 'i' && substr($f, -4) == '.idx') {
1508
                $i = substr($f, 1, -4);
1509
                if (is_numeric($i))
1510
                    $idx[] = (int)$i;
1511
            }
1512
        }
1513
        closedir($dir);
1514
        sort($idx);
1515
        // save this in a file
1516
        if ($docache) {
1517
            $handle = @fopen($conf['indexdir'].'/lengths.idx', 'w');
1518
            @fwrite($handle, implode("\n", $idx));
1519
            @fclose($handle);
1520
        }
1521
        return $idx;
1522
    }
1523
1524
    return array();
1525
}
1526
1527
/**
1528
 * Get the word lengths that have been indexed.
1529
 *
1530
 * Reads the index directory and returns an array of lengths
1531
 * that there are indices for.
1532
 *
1533
 * @author YoBoY <[email protected]>
1534
 *
1535
 * @param array|int $filter
1536
 * @return array
1537
 */
1538
function idx_indexLengths($filter) {
1539
    global $conf;
1540
    $idx = array();
1541
    if (is_array($filter)) {
1542
        // testing if index files exist only
1543
        $path = $conf['indexdir']."/i";
1544
        foreach ($filter as $key => $value) {
1545
            if (file_exists($path.$key.'.idx'))
1546
                $idx[] = $key;
1547
        }
1548
    } else {
1549
        $lengths = idx_listIndexLengths();
1550
        foreach ($lengths as $key => $length) {
1551
            // keep all the values equal or superior
1552
            if ((int)$length >= (int)$filter)
1553
                $idx[] = $length;
1554
        }
1555
    }
1556
    return $idx;
1557
}
1558
1559
/**
1560
 * Clean a name of a key for use as a file name.
1561
 *
1562
 * Romanizes non-latin characters, then strips away anything that's
1563
 * not a letter, number, or underscore.
1564
 *
1565
 * @author Tom N Harris <[email protected]>
1566
 *
1567
 * @param string $name
1568
 * @return string
1569
 */
1570
function idx_cleanName($name) {
1571
    $name = \dokuwiki\Utf8\Clean::romanize(trim((string)$name));
1572
    $name = preg_replace('#[ \./\\:-]+#', '_', $name);
1573
    $name = preg_replace('/[^A-Za-z0-9_]/', '', $name);
1574
    return strtolower($name);
1575
}
1576
1577
//Setup VIM: ex: et ts=4 :
1578