Issues (847)

Security Analysis    not enabled

This project does not seem to handle request data directly as such no vulnerable execution paths were found.

  Cross-Site Scripting
Cross-Site Scripting enables an attacker to inject code into the response of a web-request that is viewed by other users. It can for example be used to bypass access controls, or even to take over other users' accounts.
  File Exposure
File Exposure allows an attacker to gain access to local files that he should not be able to access. These files can for example include database credentials, or other configuration files.
  File Manipulation
File Manipulation enables an attacker to write custom data to files. This potentially leads to injection of arbitrary code on the server.
  Object Injection
Object Injection enables an attacker to inject an object into PHP code, and can lead to arbitrary code execution, file exposure, or file manipulation attacks.
  Code Injection
Code Injection enables an attacker to execute arbitrary code on the server.
  Response Splitting
Response Splitting can be used to send arbitrary responses.
  File Inclusion
File Inclusion enables an attacker to inject custom files into PHP's file loading mechanism, either explicitly passed to include, or for example via PHP's auto-loading mechanism.
  Command Injection
Command Injection enables an attacker to inject a shell command that is execute with the privileges of the web-server. This can be used to expose sensitive data, or gain access of your server.
  SQL Injection
SQL Injection enables an attacker to execute arbitrary SQL code on your database server gaining access to user data, or manipulating user data.
  XPath Injection
XPath Injection enables an attacker to modify the parts of XML document that are read. If that XML document is for example used for authentication, this can lead to further vulnerabilities similar to SQL Injection.
  LDAP Injection
LDAP Injection enables an attacker to inject LDAP statements potentially granting permission to run unauthorized queries, or modify content inside the LDAP tree.
  Header Injection
  Other Vulnerability
This category comprises other attack vectors such as manipulating the PHP runtime, loading custom extensions, freezing the runtime, or similar.
  Regex Injection
Regex Injection enables an attacker to execute arbitrary code in your PHP process.
  XML Injection
XML Injection enables an attacker to read files on your local filesystem including configuration files, or can be abused to freeze your web-server process.
  Variable Injection
Variable Injection enables an attacker to overwrite program variables with custom data, and can lead to further vulnerabilities.
Unfortunately, the security analysis is currently not available for your project. If you are a non-commercial open-source project, please contact support to gain access.

inc/fulltext.php (2 issues)

Upgrade to new PHP Analysis Engine

These results are based on our legacy PHP analysis, consider migrating to our new PHP analysis engine instead. Learn more

1
<?php
2
/**
3
 * DokuWiki fulltextsearch functions using the index
4
 *
5
 * @license    GPL 2 (http://www.gnu.org/licenses/gpl.html)
6
 * @author     Andreas Gohr <[email protected]>
7
 */
8
9
use dokuwiki\Extension\Event;
10
use dokuwiki\Utf8\Sort;
11
12
/**
13
 * create snippets for the first few results only
14
 */
15
if(!defined('FT_SNIPPET_NUMBER')) define('FT_SNIPPET_NUMBER',15);
16
17
/**
18
 * The fulltext search
19
 *
20
 * Returns a list of matching documents for the given query
21
 *
22
 * refactored into ft_pageSearch(), _ft_pageSearch() and trigger_event()
23
 *
24
 * @param string     $query
25
 * @param array      $highlight
26
 * @param string     $sort
27
 * @param int|string $after  only show results with mtime after this date, accepts timestap or strtotime arguments
28
 * @param int|string $before only show results with mtime before this date, accepts timestap or strtotime arguments
29
 *
30
 * @return array
31
 */
32
function ft_pageSearch($query,&$highlight, $sort = null, $after = null, $before = null){
33
34
    if ($sort === null) {
35
        $sort = 'hits';
36
    }
37
    $data = [
38
        'query' => $query,
39
        'sort' => $sort,
40
        'after' => $after,
41
        'before' => $before
42
    ];
43
    $data['highlight'] =& $highlight;
44
45
    return Event::createAndTrigger('SEARCH_QUERY_FULLPAGE', $data, '_ft_pageSearch');
46
}
47
48
/**
49
 * Returns a list of matching documents for the given query
50
 *
51
 * @author Andreas Gohr <[email protected]>
52
 * @author Kazutaka Miyasaka <[email protected]>
53
 *
54
 * @param array $data event data
55
 * @return array matching documents
56
 */
57
function _ft_pageSearch(&$data) {
58
    $Indexer = idx_get_indexer();
59
60
    // parse the given query
61
    $q = ft_queryParser($Indexer, $data['query']);
62
    $data['highlight'] = $q['highlight'];
63
64
    if (empty($q['parsed_ary'])) return array();
65
66
    // lookup all words found in the query
67
    $lookup = $Indexer->lookup($q['words']);
68
69
    // get all pages in this dokuwiki site (!: includes nonexistent pages)
70
    $pages_all = array();
71
    foreach ($Indexer->getPages() as $id) {
72
        $pages_all[$id] = 0; // base: 0 hit
73
    }
74
75
    // process the query
76
    $stack = array();
77
    foreach ($q['parsed_ary'] as $token) {
0 ignored issues
show
The expression $q['parsed_ary'] of type string|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
78
        switch (substr($token, 0, 3)) {
79
            case 'W+:':
80
            case 'W-:':
81
            case 'W_:': // word
82
                $word    = substr($token, 3);
83
                if(isset($lookup[$word])) {
84
                    $stack[] = (array)$lookup[$word];
85
                }
86
                break;
87
            case 'P+:':
88
            case 'P-:': // phrase
89
                $phrase = substr($token, 3);
90
                // since phrases are always parsed as ((W1)(W2)...(P)),
91
                // the end($stack) always points the pages that contain
92
                // all words in this phrase
93
                $pages  = end($stack);
94
                $pages_matched = array();
95
                foreach(array_keys($pages) as $id){
96
                    $evdata = array(
97
                        'id' => $id,
98
                        'phrase' => $phrase,
99
                        'text' => rawWiki($id)
100
                    );
101
                    $evt = new Event('FULLTEXT_PHRASE_MATCH',$evdata);
102
                    if ($evt->advise_before() && $evt->result !== true) {
103
                        $text = \dokuwiki\Utf8\PhpString::strtolower($evdata['text']);
104
                        if (strpos($text, $phrase) !== false) {
105
                            $evt->result = true;
106
                        }
107
                    }
108
                    $evt->advise_after();
109
                    if ($evt->result === true) {
110
                        $pages_matched[$id] = 0; // phrase: always 0 hit
111
                    }
112
                }
113
                $stack[] = $pages_matched;
114
                break;
115
            case 'N+:':
116
            case 'N-:': // namespace
117
                $ns = cleanID(substr($token, 3)) . ':';
118
                $pages_matched = array();
119
                foreach (array_keys($pages_all) as $id) {
120
                    if (strpos($id, $ns) === 0) {
121
                        $pages_matched[$id] = 0; // namespace: always 0 hit
122
                    }
123
                }
124
                $stack[] = $pages_matched;
125
                break;
126
            case 'AND': // and operation
127
                list($pages1, $pages2) = array_splice($stack, -2);
128
                $stack[] = ft_resultCombine(array($pages1, $pages2));
129
                break;
130
            case 'OR':  // or operation
131
                list($pages1, $pages2) = array_splice($stack, -2);
132
                $stack[] = ft_resultUnite(array($pages1, $pages2));
133
                break;
134
            case 'NOT': // not operation (unary)
135
                $pages   = array_pop($stack);
136
                $stack[] = ft_resultComplement(array($pages_all, $pages));
137
                break;
138
        }
139
    }
140
    $docs = array_pop($stack);
141
142
    if (empty($docs)) return array();
143
144
    // check: settings, acls, existence
145
    foreach (array_keys($docs) as $id) {
146
        if (isHiddenPage($id) || auth_quickaclcheck($id) < AUTH_READ || !page_exists($id, '', false)) {
147
            unset($docs[$id]);
148
        }
149
    }
150
151
    $docs = _ft_filterResultsByTime($docs, $data['after'], $data['before']);
152
153
    if ($data['sort'] === 'mtime') {
154
        uksort($docs, 'ft_pagemtimesorter');
155
    } else {
156
        // sort docs by count
157
        uksort($docs, 'ft_pagesorter');
158
        arsort($docs);
159
    }
160
161
    return $docs;
162
}
163
164
/**
165
 * Returns the backlinks for a given page
166
 *
167
 * Uses the metadata index.
168
 *
169
 * @param string $id           The id for which links shall be returned
170
 * @param bool   $ignore_perms Ignore the fact that pages are hidden or read-protected
171
 * @return array The pages that contain links to the given page
172
 */
173
function ft_backlinks($id, $ignore_perms = false){
174
    $result = idx_get_indexer()->lookupKey('relation_references', $id);
175
176
    if(!count($result)) return $result;
177
178
    // check ACL permissions
179
    foreach(array_keys($result) as $idx){
180
        if(($ignore_perms !== true && (
181
                isHiddenPage($result[$idx]) || auth_quickaclcheck($result[$idx]) < AUTH_READ
182
            )) || !page_exists($result[$idx], '', false)){
183
            unset($result[$idx]);
184
        }
185
    }
186
187
    Sort::sort($result);
188
    return $result;
189
}
190
191
/**
192
 * Returns the pages that use a given media file
193
 *
194
 * Uses the relation media metadata property and the metadata index.
195
 *
196
 * Note that before 2013-07-31 the second parameter was the maximum number of results and
197
 * permissions were ignored. That's why the parameter is now checked to be explicitely set
198
 * to true (with type bool) in order to be compatible with older uses of the function.
199
 *
200
 * @param string $id           The media id to look for
201
 * @param bool   $ignore_perms Ignore hidden pages and acls (optional, default: false)
202
 * @return array A list of pages that use the given media file
203
 */
204
function ft_mediause($id, $ignore_perms = false){
205
    $result = idx_get_indexer()->lookupKey('relation_media', $id);
206
207
    if(!count($result)) return $result;
208
209
    // check ACL permissions
210
    foreach(array_keys($result) as $idx){
211
        if(($ignore_perms !== true && (
212
                    isHiddenPage($result[$idx]) || auth_quickaclcheck($result[$idx]) < AUTH_READ
213
                )) || !page_exists($result[$idx], '', false)){
214
            unset($result[$idx]);
215
        }
216
    }
217
218
    Sort::sort($result);
219
    return $result;
220
}
221
222
223
/**
224
 * Quicksearch for pagenames
225
 *
226
 * By default it only matches the pagename and ignores the
227
 * namespace. This can be changed with the second parameter.
228
 * The third parameter allows to search in titles as well.
229
 *
230
 * The function always returns titles as well
231
 *
232
 * @triggers SEARCH_QUERY_PAGELOOKUP
233
 * @author   Andreas Gohr <[email protected]>
234
 * @author   Adrian Lang <[email protected]>
235
 *
236
 * @param string     $id       page id
237
 * @param bool       $in_ns    match against namespace as well?
238
 * @param bool       $in_title search in title?
239
 * @param int|string $after    only show results with mtime after this date, accepts timestap or strtotime arguments
240
 * @param int|string $before   only show results with mtime before this date, accepts timestap or strtotime arguments
241
 *
242
 * @return string[]
243
 */
244
function ft_pageLookup($id, $in_ns=false, $in_title=false, $after = null, $before = null){
245
    $data = [
246
        'id' => $id,
247
        'in_ns' => $in_ns,
248
        'in_title' => $in_title,
249
        'after' => $after,
250
        'before' => $before
251
    ];
252
    $data['has_titles'] = true; // for plugin backward compatibility check
253
    return Event::createAndTrigger('SEARCH_QUERY_PAGELOOKUP', $data, '_ft_pageLookup');
254
}
255
256
/**
257
 * Returns list of pages as array(pageid => First Heading)
258
 *
259
 * @param array &$data event data
260
 * @return string[]
261
 */
262
function _ft_pageLookup(&$data){
263
    // split out original parameters
264
    $id = $data['id'];
265
    $Indexer = idx_get_indexer();
266
    $parsedQuery = ft_queryParser($Indexer, $id);
267
    if (count($parsedQuery['ns']) > 0) {
268
        $ns = cleanID($parsedQuery['ns'][0]) . ':';
269
        $id = implode(' ', $parsedQuery['highlight']);
270
    }
271
    if (count($parsedQuery['notns']) > 0) {
272
        $notns = cleanID($parsedQuery['notns'][0]) . ':';
273
        $id = implode(' ', $parsedQuery['highlight']);
274
    }
275
276
    $in_ns    = $data['in_ns'];
277
    $in_title = $data['in_title'];
278
    $cleaned = cleanID($id);
279
280
    $Indexer = idx_get_indexer();
281
    $page_idx = $Indexer->getPages();
282
283
    $pages = array();
284
    if ($id !== '' && $cleaned !== '') {
285
        foreach ($page_idx as $p_id) {
286
            if ((strpos($in_ns ? $p_id : noNSorNS($p_id), $cleaned) !== false)) {
287
                if (!isset($pages[$p_id]))
288
                    $pages[$p_id] = p_get_first_heading($p_id, METADATA_DONT_RENDER);
289
            }
290
        }
291
        if ($in_title) {
292
            foreach ($Indexer->lookupKey('title', $id, '_ft_pageLookupTitleCompare') as $p_id) {
293
                if (!isset($pages[$p_id]))
294
                    $pages[$p_id] = p_get_first_heading($p_id, METADATA_DONT_RENDER);
295
            }
296
        }
297
    }
298
299
    if (isset($ns)) {
300
        foreach (array_keys($pages) as $p_id) {
301
            if (strpos($p_id, $ns) !== 0) {
302
                unset($pages[$p_id]);
303
            }
304
        }
305
    }
306
    if (isset($notns)) {
307
        foreach (array_keys($pages) as $p_id) {
308
            if (strpos($p_id, $notns) === 0) {
309
                unset($pages[$p_id]);
310
            }
311
        }
312
    }
313
314
    // discard hidden pages
315
    // discard nonexistent pages
316
    // check ACL permissions
317
    foreach(array_keys($pages) as $idx){
318
        if(!isVisiblePage($idx) || !page_exists($idx) ||
319
           auth_quickaclcheck($idx) < AUTH_READ) {
320
            unset($pages[$idx]);
321
        }
322
    }
323
324
    $pages = _ft_filterResultsByTime($pages, $data['after'], $data['before']);
325
326
    uksort($pages,'ft_pagesorter');
327
    return $pages;
328
}
329
330
331
/**
332
 * @param array      $results search results in the form pageid => value
333
 * @param int|string $after   only returns results with mtime after this date, accepts timestap or strtotime arguments
334
 * @param int|string $before  only returns results with mtime after this date, accepts timestap or strtotime arguments
335
 *
336
 * @return array
337
 */
338
function _ft_filterResultsByTime(array $results, $after, $before) {
339
    if ($after || $before) {
340
        $after = is_int($after) ? $after : strtotime($after);
341
        $before = is_int($before) ? $before : strtotime($before);
342
343
        foreach ($results as $id => $value) {
344
            $mTime = filemtime(wikiFN($id));
345
            if ($after && $after > $mTime) {
346
                unset($results[$id]);
347
                continue;
348
            }
349
            if ($before && $before < $mTime) {
350
                unset($results[$id]);
351
            }
352
        }
353
    }
354
355
    return $results;
356
}
357
358
/**
359
 * Tiny helper function for comparing the searched title with the title
360
 * from the search index. This function is a wrapper around stripos with
361
 * adapted argument order and return value.
362
 *
363
 * @param string $search searched title
364
 * @param string $title  title from index
365
 * @return bool
366
 */
367
function _ft_pageLookupTitleCompare($search, $title) {
368
    return stripos($title, $search) !== false;
369
}
370
371
/**
372
 * Sort pages based on their namespace level first, then on their string
373
 * values. This makes higher hierarchy pages rank higher than lower hierarchy
374
 * pages.
375
 *
376
 * @param string $a
377
 * @param string $b
378
 * @return int Returns < 0 if $a is less than $b; > 0 if $a is greater than $b, and 0 if they are equal.
379
 */
380
function ft_pagesorter($a, $b){
381
    $ac = count(explode(':',$a));
382
    $bc = count(explode(':',$b));
383
    if($ac < $bc){
384
        return -1;
385
    }elseif($ac > $bc){
386
        return 1;
387
    }
388
    return Sort::strcmp($a,$b);
389
}
390
391
/**
392
 * Sort pages by their mtime, from newest to oldest
393
 *
394
 * @param string $a
395
 * @param string $b
396
 *
397
 * @return int Returns < 0 if $a is newer than $b, > 0 if $b is newer than $a and 0 if they are of the same age
398
 */
399
function ft_pagemtimesorter($a, $b) {
400
    $mtimeA = filemtime(wikiFN($a));
401
    $mtimeB = filemtime(wikiFN($b));
402
    return $mtimeB - $mtimeA;
403
}
404
405
/**
406
 * Creates a snippet extract
407
 *
408
 * @author Andreas Gohr <[email protected]>
409
 * @triggers FULLTEXT_SNIPPET_CREATE
410
 *
411
 * @param string $id page id
412
 * @param array $highlight
413
 * @return mixed
414
 */
415
function ft_snippet($id,$highlight){
416
    $text = rawWiki($id);
417
    $text = str_replace("\xC2\xAD",'',$text); // remove soft-hyphens
418
    $evdata = array(
419
            'id'        => $id,
420
            'text'      => &$text,
421
            'highlight' => &$highlight,
422
            'snippet'   => '',
423
            );
424
425
    $evt = new Event('FULLTEXT_SNIPPET_CREATE',$evdata);
426
    if ($evt->advise_before()) {
427
        $match = array();
428
        $snippets = array();
429
        $utf8_offset = $offset = $end = 0;
430
        $len = \dokuwiki\Utf8\PhpString::strlen($text);
431
432
        // build a regexp from the phrases to highlight
433
        $re1 = '(' .
434
            join(
435
                '|',
436
                array_map(
437
                    'ft_snippet_re_preprocess',
438
                    array_map(
439
                        'preg_quote_cb',
440
                        array_filter((array) $highlight)
441
                    )
442
                )
443
            ) .
444
            ')';
445
        $re2 = "$re1.{0,75}(?!\\1)$re1";
446
        $re3 = "$re1.{0,45}(?!\\1)$re1.{0,45}(?!\\1)(?!\\2)$re1";
447
448
        for ($cnt=4; $cnt--;) {
449
            if (0) {
450
            } else if (preg_match('/'.$re3.'/iu',$text,$match,PREG_OFFSET_CAPTURE,$offset)) {
451
            } else if (preg_match('/'.$re2.'/iu',$text,$match,PREG_OFFSET_CAPTURE,$offset)) {
452
            } else if (preg_match('/'.$re1.'/iu',$text,$match,PREG_OFFSET_CAPTURE,$offset)) {
453
            } else {
454
                break;
455
            }
456
457
            list($str,$idx) = $match[0];
458
459
            // convert $idx (a byte offset) into a utf8 character offset
460
            $utf8_idx = \dokuwiki\Utf8\PhpString::strlen(substr($text,0,$idx));
461
            $utf8_len = \dokuwiki\Utf8\PhpString::strlen($str);
462
463
            // establish context, 100 bytes surrounding the match string
464
            // first look to see if we can go 100 either side,
465
            // then drop to 50 adding any excess if the other side can't go to 50,
466
            $pre = min($utf8_idx-$utf8_offset,100);
467
            $post = min($len-$utf8_idx-$utf8_len,100);
468
469
            if ($pre>50 && $post>50) {
470
                $pre = $post = 50;
471
            } else if ($pre>50) {
472
                $pre = min($pre,100-$post);
473
            } else if ($post>50) {
474
                $post = min($post, 100-$pre);
475
            } else if ($offset == 0) {
476
                // both are less than 50, means the context is the whole string
477
                // make it so and break out of this loop - there is no need for the
478
                // complex snippet calculations
479
                $snippets = array($text);
480
                break;
481
            }
482
483
            // establish context start and end points, try to append to previous
484
            // context if possible
485
            $start = $utf8_idx - $pre;
486
            $append = ($start < $end) ? $end : false;  // still the end of the previous context snippet
487
            $end = $utf8_idx + $utf8_len + $post;      // now set it to the end of this context
488
489
            if ($append) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $append of type integer|false is loosely compared to true; this is ambiguous if the integer can be zero. You might want to explicitly use !== null instead.

In PHP, under loose comparison (like ==, or !=, or switch conditions), values of different types might be equal.

For integer values, zero is a special case, in particular the following results might be unexpected:

0   == false // true
0   == null  // true
123 == false // false
123 == null  // false

// It is often better to use strict comparison
0 === false // false
0 === null  // false
Loading history...
490
                $snippets[count($snippets)-1] .= \dokuwiki\Utf8\PhpString::substr($text,$append,$end-$append);
491
            } else {
492
                $snippets[] = \dokuwiki\Utf8\PhpString::substr($text,$start,$end-$start);
493
            }
494
495
            // set $offset for next match attempt
496
            // continue matching after the current match
497
            // if the current match is not the longest possible match starting at the current offset
498
            // this prevents further matching of this snippet but for possible matches of length
499
            // smaller than match length + context (at least 50 characters) this match is part of the context
500
            $utf8_offset = $utf8_idx + $utf8_len;
501
            $offset = $idx + strlen(\dokuwiki\Utf8\PhpString::substr($text,$utf8_idx,$utf8_len));
502
            $offset = \dokuwiki\Utf8\Clean::correctIdx($text,$offset);
503
        }
504
505
        $m = "\1";
506
        $snippets = preg_replace('/'.$re1.'/iu',$m.'$1'.$m,$snippets);
507
        $snippet = preg_replace(
508
            '/' . $m . '([^' . $m . ']*?)' . $m . '/iu',
509
            '<strong class="search_hit">$1</strong>',
510
            hsc(join('... ', $snippets))
511
        );
512
513
        $evdata['snippet'] = $snippet;
514
    }
515
    $evt->advise_after();
516
    unset($evt);
517
518
    return $evdata['snippet'];
519
}
520
521
/**
522
 * Wraps a search term in regex boundary checks.
523
 *
524
 * @param string $term
525
 * @return string
526
 */
527
function ft_snippet_re_preprocess($term) {
528
    // do not process asian terms where word boundaries are not explicit
529
    if(\dokuwiki\Utf8\Asian::isAsianWords($term)) return $term;
530
531
    if (UTF8_PROPERTYSUPPORT) {
532
        // unicode word boundaries
533
        // see http://stackoverflow.com/a/2449017/172068
534
        $BL = '(?<!\pL)';
535
        $BR = '(?!\pL)';
536
    } else {
537
        // not as correct as above, but at least won't break
538
        $BL = '\b';
539
        $BR = '\b';
540
    }
541
542
    if(substr($term,0,2) == '\\*'){
543
        $term = substr($term,2);
544
    }else{
545
        $term = $BL.$term;
546
    }
547
548
    if(substr($term,-2,2) == '\\*'){
549
        $term = substr($term,0,-2);
550
    }else{
551
        $term = $term.$BR;
552
    }
553
554
    if($term == $BL || $term == $BR || $term == $BL.$BR) $term = '';
555
    return $term;
556
}
557
558
/**
559
 * Combine found documents and sum up their scores
560
 *
561
 * This function is used to combine searched words with a logical
562
 * AND. Only documents available in all arrays are returned.
563
 *
564
 * based upon PEAR's PHP_Compat function for array_intersect_key()
565
 *
566
 * @param array $args An array of page arrays
567
 * @return array
568
 */
569
function ft_resultCombine($args){
570
    $array_count = count($args);
571
    if($array_count == 1){
572
        return $args[0];
573
    }
574
575
    $result = array();
576
    if ($array_count > 1) {
577
        foreach ($args[0] as $key => $value) {
578
            $result[$key] = $value;
579
            for ($i = 1; $i !== $array_count; $i++) {
580
                if (!isset($args[$i][$key])) {
581
                    unset($result[$key]);
582
                    break;
583
                }
584
                $result[$key] += $args[$i][$key];
585
            }
586
        }
587
    }
588
    return $result;
589
}
590
591
/**
592
 * Unites found documents and sum up their scores
593
 *
594
 * based upon ft_resultCombine() function
595
 *
596
 * @param array $args An array of page arrays
597
 * @return array
598
 *
599
 * @author Kazutaka Miyasaka <[email protected]>
600
 */
601
function ft_resultUnite($args) {
602
    $array_count = count($args);
603
    if ($array_count === 1) {
604
        return $args[0];
605
    }
606
607
    $result = $args[0];
608
    for ($i = 1; $i !== $array_count; $i++) {
609
        foreach (array_keys($args[$i]) as $id) {
610
            $result[$id] += $args[$i][$id];
611
        }
612
    }
613
    return $result;
614
}
615
616
/**
617
 * Computes the difference of documents using page id for comparison
618
 *
619
 * nearly identical to PHP5's array_diff_key()
620
 *
621
 * @param array $args An array of page arrays
622
 * @return array
623
 *
624
 * @author Kazutaka Miyasaka <[email protected]>
625
 */
626
function ft_resultComplement($args) {
627
    $array_count = count($args);
628
    if ($array_count === 1) {
629
        return $args[0];
630
    }
631
632
    $result = $args[0];
633
    foreach (array_keys($result) as $id) {
634
        for ($i = 1; $i !== $array_count; $i++) {
635
            if (isset($args[$i][$id])) unset($result[$id]);
636
        }
637
    }
638
    return $result;
639
}
640
641
/**
642
 * Parses a search query and builds an array of search formulas
643
 *
644
 * @author Andreas Gohr <[email protected]>
645
 * @author Kazutaka Miyasaka <[email protected]>
646
 *
647
 * @param dokuwiki\Search\Indexer $Indexer
648
 * @param string                  $query search query
649
 * @return array of search formulas
650
 */
651
function ft_queryParser($Indexer, $query){
652
    /**
653
     * parse a search query and transform it into intermediate representation
654
     *
655
     * in a search query, you can use the following expressions:
656
     *
657
     *   words:
658
     *     include
659
     *     -exclude
660
     *   phrases:
661
     *     "phrase to be included"
662
     *     -"phrase you want to exclude"
663
     *   namespaces:
664
     *     @include:namespace (or ns:include:namespace)
665
     *     ^exclude:namespace (or -ns:exclude:namespace)
666
     *   groups:
667
     *     ()
668
     *     -()
669
     *   operators:
670
     *     and ('and' is the default operator: you can always omit this)
671
     *     or  (or pipe symbol '|', lower precedence than 'and')
672
     *
673
     * e.g. a query [ aa "bb cc" @dd:ee ] means "search pages which contain
674
     *      a word 'aa', a phrase 'bb cc' and are within a namespace 'dd:ee'".
675
     *      this query is equivalent to [ -(-aa or -"bb cc" or -ns:dd:ee) ]
676
     *      as long as you don't mind hit counts.
677
     *
678
     * intermediate representation consists of the following parts:
679
     *
680
     *   ( )           - group
681
     *   AND           - logical and
682
     *   OR            - logical or
683
     *   NOT           - logical not
684
     *   W+:, W-:, W_: - word      (underscore: no need to highlight)
685
     *   P+:, P-:      - phrase    (minus sign: logically in NOT group)
686
     *   N+:, N-:      - namespace
687
     */
688
    $parsed_query = '';
689
    $parens_level = 0;
690
    $terms = preg_split('/(-?".*?")/u', \dokuwiki\Utf8\PhpString::strtolower($query),
691
        -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
692
693
    foreach ($terms as $term) {
694
        $parsed = '';
695
        if (preg_match('/^(-?)"(.+)"$/u', $term, $matches)) {
696
            // phrase-include and phrase-exclude
697
            $not = $matches[1] ? 'NOT' : '';
698
            $parsed = $not.ft_termParser($Indexer, $matches[2], false, true);
699
        } else {
700
            // fix incomplete phrase
701
            $term = str_replace('"', ' ', $term);
702
703
            // fix parentheses
704
            $term = str_replace(')'  , ' ) ', $term);
705
            $term = str_replace('('  , ' ( ', $term);
706
            $term = str_replace('- (', ' -(', $term);
707
708
            // treat pipe symbols as 'OR' operators
709
            $term = str_replace('|', ' or ', $term);
710
711
            // treat ideographic spaces (U+3000) as search term separators
712
            // FIXME: some more separators?
713
            $term = preg_replace('/[ \x{3000}]+/u', ' ',  $term);
714
            $term = trim($term);
715
            if ($term === '') continue;
716
717
            $tokens = explode(' ', $term);
718
            foreach ($tokens as $token) {
719
                if ($token === '(') {
720
                    // parenthesis-include-open
721
                    $parsed .= '(';
722
                    ++$parens_level;
723
                } elseif ($token === '-(') {
724
                    // parenthesis-exclude-open
725
                    $parsed .= 'NOT(';
726
                    ++$parens_level;
727
                } elseif ($token === ')') {
728
                    // parenthesis-any-close
729
                    if ($parens_level === 0) continue;
730
                    $parsed .= ')';
731
                    $parens_level--;
732
                } elseif ($token === 'and') {
733
                    // logical-and (do nothing)
734
                } elseif ($token === 'or') {
735
                    // logical-or
736
                    $parsed .= 'OR';
737
                } elseif (preg_match('/^(?:\^|-ns:)(.+)$/u', $token, $matches)) {
738
                    // namespace-exclude
739
                    $parsed .= 'NOT(N+:'.$matches[1].')';
740
                } elseif (preg_match('/^(?:@|ns:)(.+)$/u', $token, $matches)) {
741
                    // namespace-include
742
                    $parsed .= '(N+:'.$matches[1].')';
743
                } elseif (preg_match('/^-(.+)$/', $token, $matches)) {
744
                    // word-exclude
745
                    $parsed .= 'NOT('.ft_termParser($Indexer, $matches[1]).')';
746
                } else {
747
                    // word-include
748
                    $parsed .= ft_termParser($Indexer, $token);
749
                }
750
            }
751
        }
752
        $parsed_query .= $parsed;
753
    }
754
755
    // cleanup (very sensitive)
756
    $parsed_query .= str_repeat(')', $parens_level);
757
    do {
758
        $parsed_query_old = $parsed_query;
759
        $parsed_query = preg_replace('/(NOT)?\(\)/u', '', $parsed_query);
760
    } while ($parsed_query !== $parsed_query_old);
761
    $parsed_query = preg_replace('/(NOT|OR)+\)/u', ')'      , $parsed_query);
762
    $parsed_query = preg_replace('/(OR)+/u'      , 'OR'     , $parsed_query);
763
    $parsed_query = preg_replace('/\(OR/u'       , '('      , $parsed_query);
764
    $parsed_query = preg_replace('/^OR|OR$/u'    , ''       , $parsed_query);
765
    $parsed_query = preg_replace('/\)(NOT)?\(/u' , ')AND$1(', $parsed_query);
766
767
    // adjustment: make highlightings right
768
    $parens_level     = 0;
769
    $notgrp_levels    = array();
770
    $parsed_query_new = '';
771
    $tokens = preg_split('/(NOT\(|[()])/u', $parsed_query, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
772
    foreach ($tokens as $token) {
773
        if ($token === 'NOT(') {
774
            $notgrp_levels[] = ++$parens_level;
775
        } elseif ($token === '(') {
776
            ++$parens_level;
777
        } elseif ($token === ')') {
778
            if ($parens_level-- === end($notgrp_levels)) array_pop($notgrp_levels);
779
        } elseif (count($notgrp_levels) % 2 === 1) {
780
            // turn highlight-flag off if terms are logically in "NOT" group
781
            $token = preg_replace('/([WPN])\+\:/u', '$1-:', $token);
782
        }
783
        $parsed_query_new .= $token;
784
    }
785
    $parsed_query = $parsed_query_new;
786
787
    /**
788
     * convert infix notation string into postfix (Reverse Polish notation) array
789
     * by Shunting-yard algorithm
790
     *
791
     * see: http://en.wikipedia.org/wiki/Reverse_Polish_notation
792
     * see: http://en.wikipedia.org/wiki/Shunting-yard_algorithm
793
     */
794
    $parsed_ary     = array();
795
    $ope_stack      = array();
796
    $ope_precedence = array(')' => 1, 'OR' => 2, 'AND' => 3, 'NOT' => 4, '(' => 5);
797
    $ope_regex      = '/([()]|OR|AND|NOT)/u';
798
799
    $tokens = preg_split($ope_regex, $parsed_query, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
800
    foreach ($tokens as $token) {
801
        if (preg_match($ope_regex, $token)) {
802
            // operator
803
            $last_ope = end($ope_stack);
804
            while ($last_ope !== false && $ope_precedence[$token] <= $ope_precedence[$last_ope] && $last_ope != '(') {
805
                $parsed_ary[] = array_pop($ope_stack);
806
                $last_ope = end($ope_stack);
807
            }
808
            if ($token == ')') {
809
                array_pop($ope_stack); // this array_pop always deletes '('
810
            } else {
811
                $ope_stack[] = $token;
812
            }
813
        } else {
814
            // operand
815
            $token_decoded = str_replace(array('OP', 'CP'), array('(', ')'), $token);
816
            $parsed_ary[] = $token_decoded;
817
        }
818
    }
819
    $parsed_ary = array_values(array_merge($parsed_ary, array_reverse($ope_stack)));
820
821
    // cleanup: each double "NOT" in RPN array actually does nothing
822
    $parsed_ary_count = count($parsed_ary);
823
    for ($i = 1; $i < $parsed_ary_count; ++$i) {
824
        if ($parsed_ary[$i] === 'NOT' && $parsed_ary[$i - 1] === 'NOT') {
825
            unset($parsed_ary[$i], $parsed_ary[$i - 1]);
826
        }
827
    }
828
    $parsed_ary = array_values($parsed_ary);
829
830
    // build return value
831
    $q = array();
832
    $q['query']      = $query;
833
    $q['parsed_str'] = $parsed_query;
834
    $q['parsed_ary'] = $parsed_ary;
835
836
    foreach ($q['parsed_ary'] as $token) {
837
        if ($token[2] !== ':') continue;
838
        $body = substr($token, 3);
839
840
        switch (substr($token, 0, 3)) {
841
            case 'N+:':
842
                     $q['ns'][]        = $body; // for backward compatibility
843
                     break;
844
            case 'N-:':
845
                     $q['notns'][]     = $body; // for backward compatibility
846
                     break;
847
            case 'W_:':
848
                     $q['words'][]     = $body;
849
                     break;
850
            case 'W-:':
851
                     $q['words'][]     = $body;
852
                     $q['not'][]       = $body; // for backward compatibility
853
                     break;
854
            case 'W+:':
855
                     $q['words'][]     = $body;
856
                     $q['highlight'][] = $body;
857
                     $q['and'][]       = $body; // for backward compatibility
858
                     break;
859
            case 'P-:':
860
                     $q['phrases'][]   = $body;
861
                     break;
862
            case 'P+:':
863
                     $q['phrases'][]   = $body;
864
                     $q['highlight'][] = $body;
865
                     break;
866
        }
867
    }
868
    foreach (array('words', 'phrases', 'highlight', 'ns', 'notns', 'and', 'not') as $key) {
869
        $q[$key] = empty($q[$key]) ? array() : array_values(array_unique($q[$key]));
870
    }
871
872
    return $q;
873
}
874
875
/**
876
 * Transforms given search term into intermediate representation
877
 *
878
 * This function is used in ft_queryParser() and not for general purpose use.
879
 *
880
 * @author Kazutaka Miyasaka <[email protected]>
881
 *
882
 * @param dokuwiki\Search\Indexer $Indexer
883
 * @param string                  $term
884
 * @param bool                    $consider_asian
885
 * @param bool                    $phrase_mode
886
 * @return string
887
 */
888
function ft_termParser($Indexer, $term, $consider_asian = true, $phrase_mode = false) {
889
    $parsed = '';
890
    if ($consider_asian) {
891
        // successive asian characters need to be searched as a phrase
892
        $words = \dokuwiki\Utf8\Asian::splitAsianWords($term);
893
        foreach ($words as $word) {
894
            $phrase_mode = $phrase_mode ? true : \dokuwiki\Utf8\Asian::isAsianWords($word);
895
            $parsed .= ft_termParser($Indexer, $word, false, $phrase_mode);
896
        }
897
    } else {
898
        $term_noparen = str_replace(array('(', ')'), ' ', $term);
899
        $words = $Indexer->tokenizer($term_noparen, true);
900
901
        // W_: no need to highlight
902
        if (empty($words)) {
903
            $parsed = '()'; // important: do not remove
904
        } elseif ($words[0] === $term) {
905
            $parsed = '(W+:'.$words[0].')';
906
        } elseif ($phrase_mode) {
907
            $term_encoded = str_replace(array('(', ')'), array('OP', 'CP'), $term);
908
            $parsed = '((W_:'.implode(')(W_:', $words).')(P+:'.$term_encoded.'))';
909
        } else {
910
            $parsed = '((W+:'.implode(')(W+:', $words).'))';
911
        }
912
    }
913
    return $parsed;
914
}
915
916
/**
917
 * Recreate a search query string based on parsed parts, doesn't support negated phrases and `OR` searches
918
 *
919
 * @param array $and
920
 * @param array $not
921
 * @param array $phrases
922
 * @param array $ns
923
 * @param array $notns
924
 *
925
 * @return string
926
 */
927
function ft_queryUnparser_simple(array $and, array $not, array $phrases, array $ns, array $notns) {
928
    $query = implode(' ', $and);
929
    if (!empty($not)) {
930
        $query .= ' -' . implode(' -', $not);
931
    }
932
933
    if (!empty($phrases)) {
934
        $query .= ' "' . implode('" "', $phrases) . '"';
935
    }
936
937
    if (!empty($ns)) {
938
        $query .= ' @' . implode(' @', $ns);
939
    }
940
941
    if (!empty($notns)) {
942
        $query .= ' ^' . implode(' ^', $notns);
943
    }
944
945
    return $query;
946
}
947
948
//Setup VIM: ex: et ts=4 :
949