Failed Conditions
Push — master ( cbaf27...ca549e )
by Andreas
08:53 queued 04:43
created

fulltext.php ➔ _ft_pageLookup()   D

Complexity

Conditions 18
Paths 48

Size

Total Lines 56

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 18
nc 48
nop 1
dl 0
loc 56
rs 4.8666
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
/**
3
 * DokuWiki fulltextsearch functions using the index
4
 *
5
 * @license    GPL 2 (http://www.gnu.org/licenses/gpl.html)
6
 * @author     Andreas Gohr <[email protected]>
7
 */
8
9
if(!defined('DOKU_INC')) die('meh.');
10
11
/**
12
 * create snippets for the first few results only
13
 */
14
if(!defined('FT_SNIPPET_NUMBER')) define('FT_SNIPPET_NUMBER',15);
15
16
/**
17
 * The fulltext search
18
 *
19
 * Returns a list of matching documents for the given query
20
 *
21
 * refactored into ft_pageSearch(), _ft_pageSearch() and trigger_event()
22
 *
23
 * @param string     $query
24
 * @param array      $highlight
25
 * @param string     $sort
26
 * @param int|string $after  only show results with an modified time after this date, accepts timestap or strtotime arguments
27
 * @param int|string $before only show results with an modified time before this date, accepts timestap or strtotime arguments
28
 *
29
 * @return array
30
 */
31
function ft_pageSearch($query,&$highlight, $sort = null, $after = null, $before = null){
32
33
    if ($sort === null) {
34
        $sort = 'hits';
35
    }
36
    $data = [
37
        'query' => $query,
38
        'sort' => $sort,
39
        'after' => $after,
40
        'before' => $before
41
    ];
42
    $data['highlight'] =& $highlight;
43
44
    return trigger_event('SEARCH_QUERY_FULLPAGE', $data, '_ft_pageSearch');
45
}
46
47
/**
48
 * Returns a list of matching documents for the given query
49
 *
50
 * @author Andreas Gohr <[email protected]>
51
 * @author Kazutaka Miyasaka <[email protected]>
52
 *
53
 * @param array $data event data
54
 * @return array matching documents
55
 */
56
function _ft_pageSearch(&$data) {
57
    $Indexer = idx_get_indexer();
58
59
    // parse the given query
60
    $q = ft_queryParser($Indexer, $data['query']);
61
    $data['highlight'] = $q['highlight'];
62
63
    if (empty($q['parsed_ary'])) return array();
64
65
    // lookup all words found in the query
66
    $lookup = $Indexer->lookup($q['words']);
67
68
    // get all pages in this dokuwiki site (!: includes nonexistent pages)
69
    $pages_all = array();
70
    foreach ($Indexer->getPages() as $id) {
71
        $pages_all[$id] = 0; // base: 0 hit
72
    }
73
74
    // process the query
75
    $stack = array();
76
    foreach ($q['parsed_ary'] as $token) {
0 ignored issues
show
Bug introduced by
The expression $q['parsed_ary'] of type string|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
77
        switch (substr($token, 0, 3)) {
78
            case 'W+:':
79
            case 'W-:':
80
            case 'W_:': // word
81
                $word    = substr($token, 3);
82
                $stack[] = (array) $lookup[$word];
83
                break;
84
            case 'P+:':
85
            case 'P-:': // phrase
86
                $phrase = substr($token, 3);
87
                // since phrases are always parsed as ((W1)(W2)...(P)),
88
                // the end($stack) always points the pages that contain
89
                // all words in this phrase
90
                $pages  = end($stack);
91
                $pages_matched = array();
92
                foreach(array_keys($pages) as $id){
93
                    $evdata = array(
94
                        'id' => $id,
95
                        'phrase' => $phrase,
96
                        'text' => rawWiki($id)
97
                    );
98
                    $evt = new Doku_Event('FULLTEXT_PHRASE_MATCH',$evdata);
99
                    if ($evt->advise_before() && $evt->result !== true) {
100
                        $text = utf8_strtolower($evdata['text']);
101
                        if (strpos($text, $phrase) !== false) {
102
                            $evt->result = true;
103
                        }
104
                    }
105
                    $evt->advise_after();
106
                    if ($evt->result === true) {
107
                        $pages_matched[$id] = 0; // phrase: always 0 hit
108
                    }
109
                }
110
                $stack[] = $pages_matched;
111
                break;
112
            case 'N+:':
113
            case 'N-:': // namespace
114
                $ns = cleanID(substr($token, 3)) . ':';
115
                $pages_matched = array();
116
                foreach (array_keys($pages_all) as $id) {
117
                    if (strpos($id, $ns) === 0) {
118
                        $pages_matched[$id] = 0; // namespace: always 0 hit
119
                    }
120
                }
121
                $stack[] = $pages_matched;
122
                break;
123
            case 'AND': // and operation
124
                list($pages1, $pages2) = array_splice($stack, -2);
125
                $stack[] = ft_resultCombine(array($pages1, $pages2));
126
                break;
127
            case 'OR':  // or operation
128
                list($pages1, $pages2) = array_splice($stack, -2);
129
                $stack[] = ft_resultUnite(array($pages1, $pages2));
130
                break;
131
            case 'NOT': // not operation (unary)
132
                $pages   = array_pop($stack);
133
                $stack[] = ft_resultComplement(array($pages_all, $pages));
134
                break;
135
        }
136
    }
137
    $docs = array_pop($stack);
138
139
    if (empty($docs)) return array();
140
141
    // check: settings, acls, existence
142
    foreach (array_keys($docs) as $id) {
143
        if (isHiddenPage($id) || auth_quickaclcheck($id) < AUTH_READ || !page_exists($id, '', false)) {
144
            unset($docs[$id]);
145
        }
146
    }
147
148
    $docs = _ft_filterResultsByTime($docs, $data['after'], $data['before']);
149
150
    if ($data['sort'] === 'mtime') {
151
        uksort($docs, 'ft_pagemtimesorter');
152
    } else {
153
        // sort docs by count
154
        arsort($docs);
155
    }
156
157
    return $docs;
158
}
159
160
/**
161
 * Returns the backlinks for a given page
162
 *
163
 * Uses the metadata index.
164
 *
165
 * @param string $id           The id for which links shall be returned
166
 * @param bool   $ignore_perms Ignore the fact that pages are hidden or read-protected
167
 * @return array The pages that contain links to the given page
168
 */
169
function ft_backlinks($id, $ignore_perms = false){
170
    $result = idx_get_indexer()->lookupKey('relation_references', $id);
171
172
    if(!count($result)) return $result;
173
174
    // check ACL permissions
175
    foreach(array_keys($result) as $idx){
176
        if(($ignore_perms !== true && (
177
                isHiddenPage($result[$idx]) || auth_quickaclcheck($result[$idx]) < AUTH_READ
178
            )) || !page_exists($result[$idx], '', false)){
179
            unset($result[$idx]);
180
        }
181
    }
182
183
    sort($result);
184
    return $result;
185
}
186
187
/**
188
 * Returns the pages that use a given media file
189
 *
190
 * Uses the relation media metadata property and the metadata index.
191
 *
192
 * Note that before 2013-07-31 the second parameter was the maximum number of results and
193
 * permissions were ignored. That's why the parameter is now checked to be explicitely set
194
 * to true (with type bool) in order to be compatible with older uses of the function.
195
 *
196
 * @param string $id           The media id to look for
197
 * @param bool   $ignore_perms Ignore hidden pages and acls (optional, default: false)
198
 * @return array A list of pages that use the given media file
199
 */
200
function ft_mediause($id, $ignore_perms = false){
201
    $result = idx_get_indexer()->lookupKey('relation_media', $id);
202
203
    if(!count($result)) return $result;
204
205
    // check ACL permissions
206
    foreach(array_keys($result) as $idx){
207
        if(($ignore_perms !== true && (
208
                    isHiddenPage($result[$idx]) || auth_quickaclcheck($result[$idx]) < AUTH_READ
209
                )) || !page_exists($result[$idx], '', false)){
210
            unset($result[$idx]);
211
        }
212
    }
213
214
    sort($result);
215
    return $result;
216
}
217
218
219
/**
220
 * Quicksearch for pagenames
221
 *
222
 * By default it only matches the pagename and ignores the
223
 * namespace. This can be changed with the second parameter.
224
 * The third parameter allows to search in titles as well.
225
 *
226
 * The function always returns titles as well
227
 *
228
 * @triggers SEARCH_QUERY_PAGELOOKUP
229
 * @author   Andreas Gohr <[email protected]>
230
 * @author   Adrian Lang <[email protected]>
231
 *
232
 * @param string     $id       page id
233
 * @param bool       $in_ns    match against namespace as well?
234
 * @param bool       $in_title search in title?
235
 * @param int|string $after    only show results with an modified time after this date, accepts timestap or strtotime arguments
236
 * @param int|string $before   only show results with an modified time before this date, accepts timestap or strtotime arguments
237
 *
238
 * @return string[]
239
 */
240
function ft_pageLookup($id, $in_ns=false, $in_title=false, $after = null, $before = null){
241
    $data = [
242
        'id' => $id,
243
        'in_ns' => $in_ns,
244
        'in_title' => $in_title,
245
        'after' => $after,
246
        'before' => $before
247
    ];
248
    $data['has_titles'] = true; // for plugin backward compatibility check
249
    return trigger_event('SEARCH_QUERY_PAGELOOKUP', $data, '_ft_pageLookup');
250
}
251
252
/**
253
 * Returns list of pages as array(pageid => First Heading)
254
 *
255
 * @param array &$data event data
256
 * @return string[]
257
 */
258
function _ft_pageLookup(&$data){
259
    // split out original parameters
260
    $id = $data['id'];
261
    $Indexer = idx_get_indexer();
262
    $parsedQuery = ft_queryParser($Indexer, $id);
263
    if (count($parsedQuery['ns']) > 0) {
264
        $ns = cleanID($parsedQuery['ns'][0]) . ':';
265
        $id = implode(' ', $parsedQuery['highlight']);
266
    }
267
268
    $in_ns    = $data['in_ns'];
269
    $in_title = $data['in_title'];
270
    $cleaned = cleanID($id);
271
272
    $Indexer = idx_get_indexer();
273
    $page_idx = $Indexer->getPages();
274
275
    $pages = array();
276
    if ($id !== '' && $cleaned !== '') {
277
        foreach ($page_idx as $p_id) {
278
            if ((strpos($in_ns ? $p_id : noNSorNS($p_id), $cleaned) !== false)) {
279
                if (!isset($pages[$p_id]))
280
                    $pages[$p_id] = p_get_first_heading($p_id, METADATA_DONT_RENDER);
281
            }
282
        }
283
        if ($in_title) {
284
            foreach ($Indexer->lookupKey('title', $id, '_ft_pageLookupTitleCompare') as $p_id) {
285
                if (!isset($pages[$p_id]))
286
                    $pages[$p_id] = p_get_first_heading($p_id, METADATA_DONT_RENDER);
287
            }
288
        }
289
    }
290
291
    if (isset($ns)) {
292
        foreach (array_keys($pages) as $p_id) {
293
            if (strpos($p_id, $ns) !== 0) {
294
                unset($pages[$p_id]);
295
            }
296
        }
297
    }
298
299
    // discard hidden pages
300
    // discard nonexistent pages
301
    // check ACL permissions
302
    foreach(array_keys($pages) as $idx){
303
        if(!isVisiblePage($idx) || !page_exists($idx) ||
304
           auth_quickaclcheck($idx) < AUTH_READ) {
305
            unset($pages[$idx]);
306
        }
307
    }
308
309
    $pages = _ft_filterResultsByTime($pages, $data['after'], $data['before']);
310
311
    uksort($pages,'ft_pagesorter');
312
    return $pages;
313
}
314
315
316
/**
317
 * @param array      $results search results in the form pageid => value
318
 * @param int|string $after   only returns results with an modified time after this date, accepts timestap or strtotime arguments
319
 * @param int|string $before  only returns results with an modified time after this date, accepts timestap or strtotime arguments
320
 *
321
 * @return array
322
 */
323
function _ft_filterResultsByTime(array $results, $after, $before) {
324
    if ($after || $before) {
325
        $after = is_int($after) ? $after : strtotime($after);
326
        $before = is_int($before) ? $before : strtotime($before);
327
328
        foreach ($results as $id => $value) {
329
            $mTime = filemtime(wikiFN($id));
330
            if ($after && $after > $mTime) {
331
                unset($results[$id]);
332
                continue;
333
            }
334
            if ($before && $before < $mTime) {
335
                unset($results[$id]);
336
            }
337
        }
338
    }
339
340
    return $results;
341
}
342
343
/**
344
 * Tiny helper function for comparing the searched title with the title
345
 * from the search index. This function is a wrapper around stripos with
346
 * adapted argument order and return value.
347
 *
348
 * @param string $search searched title
349
 * @param string $title  title from index
350
 * @return bool
351
 */
352
function _ft_pageLookupTitleCompare($search, $title) {
353
    return stripos($title, $search) !== false;
354
}
355
356
/**
357
 * Sort pages based on their namespace level first, then on their string
358
 * values. This makes higher hierarchy pages rank higher than lower hierarchy
359
 * pages.
360
 *
361
 * @param string $a
362
 * @param string $b
363
 * @return int Returns < 0 if $a is less than $b; > 0 if $a is greater than $b, and 0 if they are equal.
364
 */
365
function ft_pagesorter($a, $b){
366
    $ac = count(explode(':',$a));
367
    $bc = count(explode(':',$b));
368
    if($ac < $bc){
369
        return -1;
370
    }elseif($ac > $bc){
371
        return 1;
372
    }
373
    return strcmp ($a,$b);
374
}
375
376
/**
377
 * Sort pages by their mtime, from newest to oldest
378
 *
379
 * @param string $a
380
 * @param string $b
381
 *
382
 * @return int Returns < 0 if $a is newer than $b, > 0 if $b is newer than $a and 0 if they are of the same age
383
 */
384
function ft_pagemtimesorter($a, $b) {
385
    $mtimeA = filemtime(wikiFN($a));
386
    $mtimeB = filemtime(wikiFN($b));
387
    return $mtimeB - $mtimeA;
388
}
389
390
/**
391
 * Creates a snippet extract
392
 *
393
 * @author Andreas Gohr <[email protected]>
394
 * @triggers FULLTEXT_SNIPPET_CREATE
395
 *
396
 * @param string $id page id
397
 * @param array $highlight
398
 * @return mixed
399
 */
400
function ft_snippet($id,$highlight){
401
    $text = rawWiki($id);
402
    $text = str_replace("\xC2\xAD",'',$text); // remove soft-hyphens
403
    $evdata = array(
404
            'id'        => $id,
405
            'text'      => &$text,
406
            'highlight' => &$highlight,
407
            'snippet'   => '',
408
            );
409
410
    $evt = new Doku_Event('FULLTEXT_SNIPPET_CREATE',$evdata);
411
    if ($evt->advise_before()) {
412
        $match = array();
413
        $snippets = array();
414
        $utf8_offset = $offset = $end = 0;
415
        $len = utf8_strlen($text);
416
417
        // build a regexp from the phrases to highlight
418
        $re1 = '('.join('|',array_map('ft_snippet_re_preprocess', array_map('preg_quote_cb',array_filter((array) $highlight)))).')';
419
        $re2 = "$re1.{0,75}(?!\\1)$re1";
420
        $re3 = "$re1.{0,45}(?!\\1)$re1.{0,45}(?!\\1)(?!\\2)$re1";
421
422
        for ($cnt=4; $cnt--;) {
423
            if (0) {
424
            } else if (preg_match('/'.$re3.'/iu',$text,$match,PREG_OFFSET_CAPTURE,$offset)) {
425
            } else if (preg_match('/'.$re2.'/iu',$text,$match,PREG_OFFSET_CAPTURE,$offset)) {
426
            } else if (preg_match('/'.$re1.'/iu',$text,$match,PREG_OFFSET_CAPTURE,$offset)) {
427
            } else {
428
                break;
429
            }
430
431
            list($str,$idx) = $match[0];
432
433
            // convert $idx (a byte offset) into a utf8 character offset
434
            $utf8_idx = utf8_strlen(substr($text,0,$idx));
435
            $utf8_len = utf8_strlen($str);
436
437
            // establish context, 100 bytes surrounding the match string
438
            // first look to see if we can go 100 either side,
439
            // then drop to 50 adding any excess if the other side can't go to 50,
440
            $pre = min($utf8_idx-$utf8_offset,100);
441
            $post = min($len-$utf8_idx-$utf8_len,100);
442
443
            if ($pre>50 && $post>50) {
444
                $pre = $post = 50;
445
            } else if ($pre>50) {
446
                $pre = min($pre,100-$post);
447
            } else if ($post>50) {
448
                $post = min($post, 100-$pre);
449
            } else if ($offset == 0) {
450
                // both are less than 50, means the context is the whole string
451
                // make it so and break out of this loop - there is no need for the
452
                // complex snippet calculations
453
                $snippets = array($text);
454
                break;
455
            }
456
457
            // establish context start and end points, try to append to previous
458
            // context if possible
459
            $start = $utf8_idx - $pre;
460
            $append = ($start < $end) ? $end : false;  // still the end of the previous context snippet
461
            $end = $utf8_idx + $utf8_len + $post;      // now set it to the end of this context
462
463
            if ($append) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $append of type integer|false is loosely compared to true; this is ambiguous if the integer can be zero. You might want to explicitly use !== null instead.

In PHP, under loose comparison (like ==, or !=, or switch conditions), values of different types might be equal.

For integer values, zero is a special case, in particular the following results might be unexpected:

0   == false // true
0   == null  // true
123 == false // false
123 == null  // false

// It is often better to use strict comparison
0 === false // false
0 === null  // false
Loading history...
464
                $snippets[count($snippets)-1] .= utf8_substr($text,$append,$end-$append);
465
            } else {
466
                $snippets[] = utf8_substr($text,$start,$end-$start);
467
            }
468
469
            // set $offset for next match attempt
470
            // continue matching after the current match
471
            // if the current match is not the longest possible match starting at the current offset
472
            // this prevents further matching of this snippet but for possible matches of length
473
            // smaller than match length + context (at least 50 characters) this match is part of the context
474
            $utf8_offset = $utf8_idx + $utf8_len;
475
            $offset = $idx + strlen(utf8_substr($text,$utf8_idx,$utf8_len));
476
            $offset = utf8_correctIdx($text,$offset);
477
        }
478
479
        $m = "\1";
480
        $snippets = preg_replace('/'.$re1.'/iu',$m.'$1'.$m,$snippets);
481
        $snippet = preg_replace('/'.$m.'([^'.$m.']*?)'.$m.'/iu','<strong class="search_hit">$1</strong>',hsc(join('... ',$snippets)));
482
483
        $evdata['snippet'] = $snippet;
484
    }
485
    $evt->advise_after();
486
    unset($evt);
487
488
    return $evdata['snippet'];
489
}
490
491
/**
492
 * Wraps a search term in regex boundary checks.
493
 *
494
 * @param string $term
495
 * @return string
496
 */
497
function ft_snippet_re_preprocess($term) {
498
    // do not process asian terms where word boundaries are not explicit
499
    if(preg_match('/'.IDX_ASIAN.'/u',$term)){
500
        return $term;
501
    }
502
503
    if (UTF8_PROPERTYSUPPORT) {
504
        // unicode word boundaries
505
        // see http://stackoverflow.com/a/2449017/172068
506
        $BL = '(?<!\pL)';
507
        $BR = '(?!\pL)';
508
    } else {
509
        // not as correct as above, but at least won't break
510
        $BL = '\b';
511
        $BR = '\b';
512
    }
513
514
    if(substr($term,0,2) == '\\*'){
515
        $term = substr($term,2);
516
    }else{
517
        $term = $BL.$term;
518
    }
519
520
    if(substr($term,-2,2) == '\\*'){
521
        $term = substr($term,0,-2);
522
    }else{
523
        $term = $term.$BR;
524
    }
525
526
    if($term == $BL || $term == $BR || $term == $BL.$BR) $term = '';
527
    return $term;
528
}
529
530
/**
531
 * Combine found documents and sum up their scores
532
 *
533
 * This function is used to combine searched words with a logical
534
 * AND. Only documents available in all arrays are returned.
535
 *
536
 * based upon PEAR's PHP_Compat function for array_intersect_key()
537
 *
538
 * @param array $args An array of page arrays
539
 * @return array
540
 */
541
function ft_resultCombine($args){
542
    $array_count = count($args);
543
    if($array_count == 1){
544
        return $args[0];
545
    }
546
547
    $result = array();
548
    if ($array_count > 1) {
549
        foreach ($args[0] as $key => $value) {
550
            $result[$key] = $value;
551
            for ($i = 1; $i !== $array_count; $i++) {
552
                if (!isset($args[$i][$key])) {
553
                    unset($result[$key]);
554
                    break;
555
                }
556
                $result[$key] += $args[$i][$key];
557
            }
558
        }
559
    }
560
    return $result;
561
}
562
563
/**
564
 * Unites found documents and sum up their scores
565
 *
566
 * based upon ft_resultCombine() function
567
 *
568
 * @param array $args An array of page arrays
569
 * @return array
570
 *
571
 * @author Kazutaka Miyasaka <[email protected]>
572
 */
573
function ft_resultUnite($args) {
574
    $array_count = count($args);
575
    if ($array_count === 1) {
576
        return $args[0];
577
    }
578
579
    $result = $args[0];
580
    for ($i = 1; $i !== $array_count; $i++) {
581
        foreach (array_keys($args[$i]) as $id) {
582
            $result[$id] += $args[$i][$id];
583
        }
584
    }
585
    return $result;
586
}
587
588
/**
589
 * Computes the difference of documents using page id for comparison
590
 *
591
 * nearly identical to PHP5's array_diff_key()
592
 *
593
 * @param array $args An array of page arrays
594
 * @return array
595
 *
596
 * @author Kazutaka Miyasaka <[email protected]>
597
 */
598
function ft_resultComplement($args) {
599
    $array_count = count($args);
600
    if ($array_count === 1) {
601
        return $args[0];
602
    }
603
604
    $result = $args[0];
605
    foreach (array_keys($result) as $id) {
606
        for ($i = 1; $i !== $array_count; $i++) {
607
            if (isset($args[$i][$id])) unset($result[$id]);
608
        }
609
    }
610
    return $result;
611
}
612
613
/**
614
 * Parses a search query and builds an array of search formulas
615
 *
616
 * @author Andreas Gohr <[email protected]>
617
 * @author Kazutaka Miyasaka <[email protected]>
618
 *
619
 * @param Doku_Indexer $Indexer
620
 * @param string $query search query
621
 * @return array of search formulas
622
 */
623
function ft_queryParser($Indexer, $query){
624
    /**
625
     * parse a search query and transform it into intermediate representation
626
     *
627
     * in a search query, you can use the following expressions:
628
     *
629
     *   words:
630
     *     include
631
     *     -exclude
632
     *   phrases:
633
     *     "phrase to be included"
634
     *     -"phrase you want to exclude"
635
     *   namespaces:
636
     *     @include:namespace (or ns:include:namespace)
637
     *     ^exclude:namespace (or -ns:exclude:namespace)
638
     *   groups:
639
     *     ()
640
     *     -()
641
     *   operators:
642
     *     and ('and' is the default operator: you can always omit this)
643
     *     or  (or pipe symbol '|', lower precedence than 'and')
644
     *
645
     * e.g. a query [ aa "bb cc" @dd:ee ] means "search pages which contain
646
     *      a word 'aa', a phrase 'bb cc' and are within a namespace 'dd:ee'".
647
     *      this query is equivalent to [ -(-aa or -"bb cc" or -ns:dd:ee) ]
648
     *      as long as you don't mind hit counts.
649
     *
650
     * intermediate representation consists of the following parts:
651
     *
652
     *   ( )           - group
653
     *   AND           - logical and
654
     *   OR            - logical or
655
     *   NOT           - logical not
656
     *   W+:, W-:, W_: - word      (underscore: no need to highlight)
657
     *   P+:, P-:      - phrase    (minus sign: logically in NOT group)
658
     *   N+:, N-:      - namespace
659
     */
660
    $parsed_query = '';
661
    $parens_level = 0;
662
    $terms = preg_split('/(-?".*?")/u', utf8_strtolower($query), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
663
664
    foreach ($terms as $term) {
665
        $parsed = '';
666
        if (preg_match('/^(-?)"(.+)"$/u', $term, $matches)) {
667
            // phrase-include and phrase-exclude
668
            $not = $matches[1] ? 'NOT' : '';
669
            $parsed = $not.ft_termParser($Indexer, $matches[2], false, true);
670
        } else {
671
            // fix incomplete phrase
672
            $term = str_replace('"', ' ', $term);
673
674
            // fix parentheses
675
            $term = str_replace(')'  , ' ) ', $term);
676
            $term = str_replace('('  , ' ( ', $term);
677
            $term = str_replace('- (', ' -(', $term);
678
679
            // treat pipe symbols as 'OR' operators
680
            $term = str_replace('|', ' or ', $term);
681
682
            // treat ideographic spaces (U+3000) as search term separators
683
            // FIXME: some more separators?
684
            $term = preg_replace('/[ \x{3000}]+/u', ' ',  $term);
685
            $term = trim($term);
686
            if ($term === '') continue;
687
688
            $tokens = explode(' ', $term);
689
            foreach ($tokens as $token) {
690
                if ($token === '(') {
691
                    // parenthesis-include-open
692
                    $parsed .= '(';
693
                    ++$parens_level;
694
                } elseif ($token === '-(') {
695
                    // parenthesis-exclude-open
696
                    $parsed .= 'NOT(';
697
                    ++$parens_level;
698
                } elseif ($token === ')') {
699
                    // parenthesis-any-close
700
                    if ($parens_level === 0) continue;
701
                    $parsed .= ')';
702
                    $parens_level--;
703
                } elseif ($token === 'and') {
704
                    // logical-and (do nothing)
705
                } elseif ($token === 'or') {
706
                    // logical-or
707
                    $parsed .= 'OR';
708
                } elseif (preg_match('/^(?:\^|-ns:)(.+)$/u', $token, $matches)) {
709
                    // namespace-exclude
710
                    $parsed .= 'NOT(N+:'.$matches[1].')';
711
                } elseif (preg_match('/^(?:@|ns:)(.+)$/u', $token, $matches)) {
712
                    // namespace-include
713
                    $parsed .= '(N+:'.$matches[1].')';
714
                } elseif (preg_match('/^-(.+)$/', $token, $matches)) {
715
                    // word-exclude
716
                    $parsed .= 'NOT('.ft_termParser($Indexer, $matches[1]).')';
717
                } else {
718
                    // word-include
719
                    $parsed .= ft_termParser($Indexer, $token);
720
                }
721
            }
722
        }
723
        $parsed_query .= $parsed;
724
    }
725
726
    // cleanup (very sensitive)
727
    $parsed_query .= str_repeat(')', $parens_level);
728
    do {
729
        $parsed_query_old = $parsed_query;
730
        $parsed_query = preg_replace('/(NOT)?\(\)/u', '', $parsed_query);
731
    } while ($parsed_query !== $parsed_query_old);
732
    $parsed_query = preg_replace('/(NOT|OR)+\)/u', ')'      , $parsed_query);
733
    $parsed_query = preg_replace('/(OR)+/u'      , 'OR'     , $parsed_query);
734
    $parsed_query = preg_replace('/\(OR/u'       , '('      , $parsed_query);
735
    $parsed_query = preg_replace('/^OR|OR$/u'    , ''       , $parsed_query);
736
    $parsed_query = preg_replace('/\)(NOT)?\(/u' , ')AND$1(', $parsed_query);
737
738
    // adjustment: make highlightings right
739
    $parens_level     = 0;
740
    $notgrp_levels    = array();
741
    $parsed_query_new = '';
742
    $tokens = preg_split('/(NOT\(|[()])/u', $parsed_query, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
743
    foreach ($tokens as $token) {
744
        if ($token === 'NOT(') {
745
            $notgrp_levels[] = ++$parens_level;
746
        } elseif ($token === '(') {
747
            ++$parens_level;
748
        } elseif ($token === ')') {
749
            if ($parens_level-- === end($notgrp_levels)) array_pop($notgrp_levels);
750
        } elseif (count($notgrp_levels) % 2 === 1) {
751
            // turn highlight-flag off if terms are logically in "NOT" group
752
            $token = preg_replace('/([WPN])\+\:/u', '$1-:', $token);
753
        }
754
        $parsed_query_new .= $token;
755
    }
756
    $parsed_query = $parsed_query_new;
757
758
    /**
759
     * convert infix notation string into postfix (Reverse Polish notation) array
760
     * by Shunting-yard algorithm
761
     *
762
     * see: http://en.wikipedia.org/wiki/Reverse_Polish_notation
763
     * see: http://en.wikipedia.org/wiki/Shunting-yard_algorithm
764
     */
765
    $parsed_ary     = array();
766
    $ope_stack      = array();
767
    $ope_precedence = array(')' => 1, 'OR' => 2, 'AND' => 3, 'NOT' => 4, '(' => 5);
768
    $ope_regex      = '/([()]|OR|AND|NOT)/u';
769
770
    $tokens = preg_split($ope_regex, $parsed_query, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
771
    foreach ($tokens as $token) {
772
        if (preg_match($ope_regex, $token)) {
773
            // operator
774
            $last_ope = end($ope_stack);
775
            while ($last_ope !== false && $ope_precedence[$token] <= $ope_precedence[$last_ope] && $last_ope != '(') {
776
                $parsed_ary[] = array_pop($ope_stack);
777
                $last_ope = end($ope_stack);
778
            }
779
            if ($token == ')') {
780
                array_pop($ope_stack); // this array_pop always deletes '('
781
            } else {
782
                $ope_stack[] = $token;
783
            }
784
        } else {
785
            // operand
786
            $token_decoded = str_replace(array('OP', 'CP'), array('(', ')'), $token);
787
            $parsed_ary[] = $token_decoded;
788
        }
789
    }
790
    $parsed_ary = array_values(array_merge($parsed_ary, array_reverse($ope_stack)));
791
792
    // cleanup: each double "NOT" in RPN array actually does nothing
793
    $parsed_ary_count = count($parsed_ary);
794
    for ($i = 1; $i < $parsed_ary_count; ++$i) {
795
        if ($parsed_ary[$i] === 'NOT' && $parsed_ary[$i - 1] === 'NOT') {
796
            unset($parsed_ary[$i], $parsed_ary[$i - 1]);
797
        }
798
    }
799
    $parsed_ary = array_values($parsed_ary);
800
801
    // build return value
802
    $q = array();
803
    $q['query']      = $query;
804
    $q['parsed_str'] = $parsed_query;
805
    $q['parsed_ary'] = $parsed_ary;
806
807
    foreach ($q['parsed_ary'] as $token) {
808
        if ($token[2] !== ':') continue;
809
        $body = substr($token, 3);
810
811
        switch (substr($token, 0, 3)) {
812
            case 'N+:':
813
                     $q['ns'][]        = $body; // for backward compatibility
814
                     break;
815
            case 'N-:':
816
                     $q['notns'][]     = $body; // for backward compatibility
817
                     break;
818
            case 'W_:':
819
                     $q['words'][]     = $body;
820
                     break;
821
            case 'W-:':
822
                     $q['words'][]     = $body;
823
                     $q['not'][]       = $body; // for backward compatibility
824
                     break;
825
            case 'W+:':
826
                     $q['words'][]     = $body;
827
                     $q['highlight'][] = $body;
828
                     $q['and'][]       = $body; // for backward compatibility
829
                     break;
830
            case 'P-:':
831
                     $q['phrases'][]   = $body;
832
                     break;
833
            case 'P+:':
834
                     $q['phrases'][]   = $body;
835
                     $q['highlight'][] = $body;
836
                     break;
837
        }
838
    }
839
    foreach (array('words', 'phrases', 'highlight', 'ns', 'notns', 'and', 'not') as $key) {
840
        $q[$key] = empty($q[$key]) ? array() : array_values(array_unique($q[$key]));
841
    }
842
843
    return $q;
844
}
845
846
/**
847
 * Transforms given search term into intermediate representation
848
 *
849
 * This function is used in ft_queryParser() and not for general purpose use.
850
 *
851
 * @author Kazutaka Miyasaka <[email protected]>
852
 *
853
 * @param Doku_Indexer $Indexer
854
 * @param string       $term
855
 * @param bool         $consider_asian
856
 * @param bool         $phrase_mode
857
 * @return string
858
 */
859
function ft_termParser($Indexer, $term, $consider_asian = true, $phrase_mode = false) {
860
    $parsed = '';
861
    if ($consider_asian) {
862
        // successive asian characters need to be searched as a phrase
863
        $words = preg_split('/('.IDX_ASIAN.'+)/u', $term, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
864
        foreach ($words as $word) {
865
            $phrase_mode = $phrase_mode ? true : preg_match('/'.IDX_ASIAN.'/u', $word);
866
            $parsed .= ft_termParser($Indexer, $word, false, $phrase_mode);
0 ignored issues
show
Bug introduced by
It seems like $phrase_mode defined by $phrase_mode ? true : pr...DX_ASIAN . '/u', $word) on line 865 can also be of type integer; however, ft_termParser() does only seem to accept boolean, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
867
        }
868
    } else {
869
        $term_noparen = str_replace(array('(', ')'), ' ', $term);
870
        $words = $Indexer->tokenizer($term_noparen, true);
871
872
        // W_: no need to highlight
873
        if (empty($words)) {
874
            $parsed = '()'; // important: do not remove
875
        } elseif ($words[0] === $term) {
876
            $parsed = '(W+:'.$words[0].')';
877
        } elseif ($phrase_mode) {
878
            $term_encoded = str_replace(array('(', ')'), array('OP', 'CP'), $term);
879
            $parsed = '((W_:'.implode(')(W_:', $words).')(P+:'.$term_encoded.'))';
880
        } else {
881
            $parsed = '((W+:'.implode(')(W+:', $words).'))';
882
        }
883
    }
884
    return $parsed;
885
}
886
887
/**
888
 * Recreate a search query string based on parsed parts, doesn't support negated phrases and `OR` searches
889
 *
890
 * @param array $and
891
 * @param array $not
892
 * @param array $phrases
893
 * @param array $ns
894
 * @param array $notns
895
 *
896
 * @return string
897
 */
898
function ft_queryUnparser_simple(array $and, array $not, array $phrases, array $ns, array $notns) {
899
    $query = implode(' ', $and);
900
    if (!empty($not)) {
901
        $query .= ' -' . implode(' -', $not);
902
    }
903
904
    if (!empty($phrases)) {
905
        $query .= ' "' . implode('" "', $phrases) . '"';
906
    }
907
908
    if (!empty($ns)) {
909
        $query .= ' @' . implode(' @', $ns);
910
    }
911
912
    if (!empty($notns)) {
913
        $query .= ' ^' . implode(' ^', $notns);
914
    }
915
916
    return $query;
917
}
918
919
//Setup VIM: ex: et ts=4 :
920