Completed
Push — authpdo ( 7f89f0...388201 )
by Andreas
18:44 queued 12:59
created

fulltext.php ➔ ft_queryParser()   F

Complexity

Conditions 42
Paths 18900

Size

Total Lines 222
Code Lines 127

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 42
eloc 127
nc 18900
nop 2
dl 0
loc 222
rs 2
c 0
b 0
f 0

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
/**
3
 * DokuWiki fulltextsearch functions using the index
4
 *
5
 * @license    GPL 2 (http://www.gnu.org/licenses/gpl.html)
6
 * @author     Andreas Gohr <[email protected]>
7
 */
8
9
if(!defined('DOKU_INC')) die('meh.');
10
11
/**
12
 * create snippets for the first few results only
13
 */
14
if(!defined('FT_SNIPPET_NUMBER')) define('FT_SNIPPET_NUMBER',15);
15
16
/**
17
 * The fulltext search
18
 *
19
 * Returns a list of matching documents for the given query
20
 *
21
 * refactored into ft_pageSearch(), _ft_pageSearch() and trigger_event()
22
 *
23
 * @param string $query
24
 * @param array $highlight
25
 * @return array
26
 */
27
function ft_pageSearch($query,&$highlight){
28
29
    $data = array();
30
    $data['query'] = $query;
31
    $data['highlight'] =& $highlight;
32
33
    return trigger_event('SEARCH_QUERY_FULLPAGE', $data, '_ft_pageSearch');
34
}
35
36
/**
37
 * Returns a list of matching documents for the given query
38
 *
39
 * @author Andreas Gohr <[email protected]>
40
 * @author Kazutaka Miyasaka <[email protected]>
41
 *
42
 * @param array $data event data
43
 * @return array matching documents
44
 */
45
function _ft_pageSearch(&$data) {
46
    $Indexer = idx_get_indexer();
47
48
    // parse the given query
49
    $q = ft_queryParser($Indexer, $data['query']);
50
    $data['highlight'] = $q['highlight'];
51
52
    if (empty($q['parsed_ary'])) return array();
53
54
    // lookup all words found in the query
55
    $lookup = $Indexer->lookup($q['words']);
56
57
    // get all pages in this dokuwiki site (!: includes nonexistent pages)
58
    $pages_all = array();
59
    foreach ($Indexer->getPages() as $id) {
60
        $pages_all[$id] = 0; // base: 0 hit
61
    }
62
63
    // process the query
64
    $stack = array();
65
    foreach ($q['parsed_ary'] as $token) {
0 ignored issues
show
Bug introduced by
The expression $q['parsed_ary'] of type string|array is not guaranteed to be traversable. How about adding an additional type check?

There are different options of fixing this problem.

  1. If you want to be on the safe side, you can add an additional type-check:

    $collection = json_decode($data, true);
    if ( ! is_array($collection)) {
        throw new \RuntimeException('$collection must be an array.');
    }
    
    foreach ($collection as $item) { /** ... */ }
    
  2. If you are sure that the expression is traversable, you might want to add a doc comment cast to improve IDE auto-completion and static analysis:

    /** @var array $collection */
    $collection = json_decode($data, true);
    
    foreach ($collection as $item) { /** .. */ }
    
  3. Mark the issue as a false-positive: Just hover the remove button, in the top-right corner of this issue for more options.

Loading history...
66
        switch (substr($token, 0, 3)) {
67
            case 'W+:':
68
            case 'W-:':
69
            case 'W_:': // word
70
                $word    = substr($token, 3);
71
                $stack[] = (array) $lookup[$word];
72
                break;
73
            case 'P+:':
74
            case 'P-:': // phrase
75
                $phrase = substr($token, 3);
76
                // since phrases are always parsed as ((W1)(W2)...(P)),
77
                // the end($stack) always points the pages that contain
78
                // all words in this phrase
79
                $pages  = end($stack);
80
                $pages_matched = array();
81
                foreach(array_keys($pages) as $id){
82
                    $evdata = array(
83
                        'id' => $id,
84
                        'phrase' => $phrase,
85
                        'text' => rawWiki($id)
86
                    );
87
                    $evt = new Doku_Event('FULLTEXT_PHRASE_MATCH',$evdata);
88
                    if ($evt->advise_before() && $evt->result !== true) {
89
                        $text = utf8_strtolower($evdata['text']);
90
                        if (strpos($text, $phrase) !== false) {
91
                            $evt->result = true;
92
                        }
93
                    }
94
                    $evt->advise_after();
95
                    if ($evt->result === true) {
96
                        $pages_matched[$id] = 0; // phrase: always 0 hit
97
                    }
98
                }
99
                $stack[] = $pages_matched;
100
                break;
101
            case 'N+:':
102
            case 'N-:': // namespace
103
                $ns = substr($token, 3);
104
                $pages_matched = array();
105
                foreach (array_keys($pages_all) as $id) {
106
                    if (strpos($id, $ns) === 0) {
107
                        $pages_matched[$id] = 0; // namespace: always 0 hit
108
                    }
109
                }
110
                $stack[] = $pages_matched;
111
                break;
112
            case 'AND': // and operation
113
                list($pages1, $pages2) = array_splice($stack, -2);
114
                $stack[] = ft_resultCombine(array($pages1, $pages2));
115
                break;
116
            case 'OR':  // or operation
117
                list($pages1, $pages2) = array_splice($stack, -2);
118
                $stack[] = ft_resultUnite(array($pages1, $pages2));
119
                break;
120
            case 'NOT': // not operation (unary)
121
                $pages   = array_pop($stack);
122
                $stack[] = ft_resultComplement(array($pages_all, $pages));
123
                break;
124
        }
125
    }
126
    $docs = array_pop($stack);
127
128
    if (empty($docs)) return array();
129
130
    // check: settings, acls, existence
131
    foreach (array_keys($docs) as $id) {
132
        if (isHiddenPage($id) || auth_quickaclcheck($id) < AUTH_READ || !page_exists($id, '', false)) {
133
            unset($docs[$id]);
134
        }
135
    }
136
137
    // sort docs by count
138
    arsort($docs);
139
140
    return $docs;
141
}
142
143
/**
144
 * Returns the backlinks for a given page
145
 *
146
 * Uses the metadata index.
147
 *
148
 * @param string $id           The id for which links shall be returned
149
 * @param bool   $ignore_perms Ignore the fact that pages are hidden or read-protected
150
 * @return array The pages that contain links to the given page
151
 */
152
function ft_backlinks($id, $ignore_perms = false){
153
    $result = idx_get_indexer()->lookupKey('relation_references', $id);
154
155
    if(!count($result)) return $result;
156
157
    // check ACL permissions
158
    foreach(array_keys($result) as $idx){
159
        if(($ignore_perms !== true && (
160
                isHiddenPage($result[$idx]) || auth_quickaclcheck($result[$idx]) < AUTH_READ
161
            )) || !page_exists($result[$idx], '', false)){
162
            unset($result[$idx]);
163
        }
164
    }
165
166
    sort($result);
167
    return $result;
168
}
169
170
/**
171
 * Returns the pages that use a given media file
172
 *
173
 * Uses the relation media metadata property and the metadata index.
174
 *
175
 * Note that before 2013-07-31 the second parameter was the maximum number of results and
176
 * permissions were ignored. That's why the parameter is now checked to be explicitely set
177
 * to true (with type bool) in order to be compatible with older uses of the function.
178
 *
179
 * @param string $id           The media id to look for
180
 * @param bool   $ignore_perms Ignore hidden pages and acls (optional, default: false)
181
 * @return array A list of pages that use the given media file
182
 */
183
function ft_mediause($id, $ignore_perms = false){
184
    $result = idx_get_indexer()->lookupKey('relation_media', $id);
185
186
    if(!count($result)) return $result;
187
188
    // check ACL permissions
189
    foreach(array_keys($result) as $idx){
190
        if(($ignore_perms !== true && (
191
                    isHiddenPage($result[$idx]) || auth_quickaclcheck($result[$idx]) < AUTH_READ
192
                )) || !page_exists($result[$idx], '', false)){
193
            unset($result[$idx]);
194
        }
195
    }
196
197
    sort($result);
198
    return $result;
199
}
200
201
202
203
/**
204
 * Quicksearch for pagenames
205
 *
206
 * By default it only matches the pagename and ignores the
207
 * namespace. This can be changed with the second parameter.
208
 * The third parameter allows to search in titles as well.
209
 *
210
 * The function always returns titles as well
211
 *
212
 * @triggers SEARCH_QUERY_PAGELOOKUP
213
 * @author Andreas Gohr <[email protected]>
214
 * @author Adrian Lang <[email protected]>
215
 *
216
 * @param string $id        page id
217
 * @param bool   $in_ns     match against namespace as well?
218
 * @param bool   $in_title  search in title?
219
 * @return string[]
220
 */
221
function ft_pageLookup($id, $in_ns=false, $in_title=false){
222
    $data = compact('id', 'in_ns', 'in_title');
223
    $data['has_titles'] = true; // for plugin backward compatibility check
224
    return trigger_event('SEARCH_QUERY_PAGELOOKUP', $data, '_ft_pageLookup');
225
}
226
227
/**
228
 * Returns list of pages as array(pageid => First Heading)
229
 *
230
 * @param array &$data event data
231
 * @return string[]
232
 */
233
function _ft_pageLookup(&$data){
234
    // split out original parameters
235
    $id = $data['id'];
236
    if (preg_match('/(?:^| )(?:@|ns:)([\w:]+)/', $id, $matches)) {
237
        $ns = cleanID($matches[1]) . ':';
238
        $id = str_replace($matches[0], '', $id);
239
    }
240
241
    $in_ns    = $data['in_ns'];
242
    $in_title = $data['in_title'];
243
    $cleaned = cleanID($id);
244
245
    $Indexer = idx_get_indexer();
246
    $page_idx = $Indexer->getPages();
247
248
    $pages = array();
249
    if ($id !== '' && $cleaned !== '') {
250
        foreach ($page_idx as $p_id) {
251
            if ((strpos($in_ns ? $p_id : noNSorNS($p_id), $cleaned) !== false)) {
252
                if (!isset($pages[$p_id]))
253
                    $pages[$p_id] = p_get_first_heading($p_id, METADATA_DONT_RENDER);
254
            }
255
        }
256
        if ($in_title) {
257
            foreach ($Indexer->lookupKey('title', $id, '_ft_pageLookupTitleCompare') as $p_id) {
258
                if (!isset($pages[$p_id]))
259
                    $pages[$p_id] = p_get_first_heading($p_id, METADATA_DONT_RENDER);
260
            }
261
        }
262
    }
263
264
    if (isset($ns)) {
265
        foreach (array_keys($pages) as $p_id) {
266
            if (strpos($p_id, $ns) !== 0) {
267
                unset($pages[$p_id]);
268
            }
269
        }
270
    }
271
272
    // discard hidden pages
273
    // discard nonexistent pages
274
    // check ACL permissions
275
    foreach(array_keys($pages) as $idx){
276
        if(!isVisiblePage($idx) || !page_exists($idx) ||
277
           auth_quickaclcheck($idx) < AUTH_READ) {
278
            unset($pages[$idx]);
279
        }
280
    }
281
282
    uksort($pages,'ft_pagesorter');
283
    return $pages;
284
}
285
286
/**
287
 * Tiny helper function for comparing the searched title with the title
288
 * from the search index. This function is a wrapper around stripos with
289
 * adapted argument order and return value.
290
 *
291
 * @param string $search searched title
292
 * @param string $title  title from index
293
 * @return bool
294
 */
295
function _ft_pageLookupTitleCompare($search, $title) {
296
    return stripos($title, $search) !== false;
297
}
298
299
/**
300
 * Sort pages based on their namespace level first, then on their string
301
 * values. This makes higher hierarchy pages rank higher than lower hierarchy
302
 * pages.
303
 *
304
 * @param string $a
305
 * @param string $b
306
 * @return int Returns < 0 if $a is less than $b; > 0 if $a is greater than $b, and 0 if they are equal.
307
 */
308
function ft_pagesorter($a, $b){
309
    $ac = count(explode(':',$a));
310
    $bc = count(explode(':',$b));
311
    if($ac < $bc){
312
        return -1;
313
    }elseif($ac > $bc){
314
        return 1;
315
    }
316
    return strcmp ($a,$b);
317
}
318
319
/**
320
 * Creates a snippet extract
321
 *
322
 * @author Andreas Gohr <[email protected]>
323
 * @triggers FULLTEXT_SNIPPET_CREATE
324
 *
325
 * @param string $id page id
326
 * @param array $highlight
327
 * @return mixed
328
 */
329
function ft_snippet($id,$highlight){
330
    $text = rawWiki($id);
331
    $text = str_replace("\xC2\xAD",'',$text); // remove soft-hyphens
332
    $evdata = array(
333
            'id'        => $id,
334
            'text'      => &$text,
335
            'highlight' => &$highlight,
336
            'snippet'   => '',
337
            );
338
339
    $evt = new Doku_Event('FULLTEXT_SNIPPET_CREATE',$evdata);
340
    if ($evt->advise_before()) {
341
        $match = array();
342
        $snippets = array();
343
        $utf8_offset = $offset = $end = 0;
344
        $len = utf8_strlen($text);
345
346
        // build a regexp from the phrases to highlight
347
        $re1 = '('.join('|',array_map('ft_snippet_re_preprocess', array_map('preg_quote_cb',array_filter((array) $highlight)))).')';
348
        $re2 = "$re1.{0,75}(?!\\1)$re1";
349
        $re3 = "$re1.{0,45}(?!\\1)$re1.{0,45}(?!\\1)(?!\\2)$re1";
350
351
        for ($cnt=4; $cnt--;) {
352
            if (0) {
353
            } else if (preg_match('/'.$re3.'/iu',$text,$match,PREG_OFFSET_CAPTURE,$offset)) {
354
            } else if (preg_match('/'.$re2.'/iu',$text,$match,PREG_OFFSET_CAPTURE,$offset)) {
355
            } else if (preg_match('/'.$re1.'/iu',$text,$match,PREG_OFFSET_CAPTURE,$offset)) {
356
            } else {
357
                break;
358
            }
359
360
            list($str,$idx) = $match[0];
361
362
            // convert $idx (a byte offset) into a utf8 character offset
363
            $utf8_idx = utf8_strlen(substr($text,0,$idx));
364
            $utf8_len = utf8_strlen($str);
365
366
            // establish context, 100 bytes surrounding the match string
367
            // first look to see if we can go 100 either side,
368
            // then drop to 50 adding any excess if the other side can't go to 50,
369
            $pre = min($utf8_idx-$utf8_offset,100);
370
            $post = min($len-$utf8_idx-$utf8_len,100);
371
372
            if ($pre>50 && $post>50) {
373
                $pre = $post = 50;
374
            } else if ($pre>50) {
375
                $pre = min($pre,100-$post);
376
            } else if ($post>50) {
377
                $post = min($post, 100-$pre);
378
            } else if ($offset == 0) {
379
                // both are less than 50, means the context is the whole string
380
                // make it so and break out of this loop - there is no need for the
381
                // complex snippet calculations
382
                $snippets = array($text);
383
                break;
384
            }
385
386
            // establish context start and end points, try to append to previous
387
            // context if possible
388
            $start = $utf8_idx - $pre;
389
            $append = ($start < $end) ? $end : false;  // still the end of the previous context snippet
390
            $end = $utf8_idx + $utf8_len + $post;      // now set it to the end of this context
391
392
            if ($append) {
0 ignored issues
show
Bug Best Practice introduced by
The expression $append of type integer|false is loosely compared to true; this is ambiguous if the integer can be zero. You might want to explicitly use !== null instead.

In PHP, under loose comparison (like ==, or !=, or switch conditions), values of different types might be equal.

For integer values, zero is a special case, in particular the following results might be unexpected:

0   == false // true
0   == null  // true
123 == false // false
123 == null  // false

// It is often better to use strict comparison
0 === false // false
0 === null  // false
Loading history...
393
                $snippets[count($snippets)-1] .= utf8_substr($text,$append,$end-$append);
394
            } else {
395
                $snippets[] = utf8_substr($text,$start,$end-$start);
396
            }
397
398
            // set $offset for next match attempt
399
            // continue matching after the current match
400
            // if the current match is not the longest possible match starting at the current offset
401
            // this prevents further matching of this snippet but for possible matches of length
402
            // smaller than match length + context (at least 50 characters) this match is part of the context
403
            $utf8_offset = $utf8_idx + $utf8_len;
0 ignored issues
show
Unused Code introduced by
$utf8_offset is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
404
            $offset = $idx + strlen(utf8_substr($text,$utf8_idx,$utf8_len));
405
            $offset = utf8_correctIdx($text,$offset);
0 ignored issues
show
Unused Code introduced by
$offset is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
406
        }
407
408
        $m = "\1";
409
        $snippets = preg_replace('/'.$re1.'/iu',$m.'$1'.$m,$snippets);
410
        $snippet = preg_replace('/'.$m.'([^'.$m.']*?)'.$m.'/iu','<strong class="search_hit">$1</strong>',hsc(join('... ',$snippets)));
411
412
        $evdata['snippet'] = $snippet;
413
    }
414
    $evt->advise_after();
415
    unset($evt);
416
417
    return $evdata['snippet'];
418
}
419
420
/**
421
 * Wraps a search term in regex boundary checks.
422
 *
423
 * @param string $term
424
 * @return string
425
 */
426
function ft_snippet_re_preprocess($term) {
427
    // do not process asian terms where word boundaries are not explicit
428
    if(preg_match('/'.IDX_ASIAN.'/u',$term)){
429
        return $term;
430
    }
431
432
    if (UTF8_PROPERTYSUPPORT) {
433
        // unicode word boundaries
434
        // see http://stackoverflow.com/a/2449017/172068
435
        $BL = '(?<!\pL)';
436
        $BR = '(?!\pL)';
437
    } else {
438
        // not as correct as above, but at least won't break
439
        $BL = '\b';
440
        $BR = '\b';
441
    }
442
443
    if(substr($term,0,2) == '\\*'){
444
        $term = substr($term,2);
445
    }else{
446
        $term = $BL.$term;
447
    }
448
449
    if(substr($term,-2,2) == '\\*'){
450
        $term = substr($term,0,-2);
451
    }else{
452
        $term = $term.$BR;
453
    }
454
455
    if($term == $BL || $term == $BR || $term == $BL.$BR) $term = '';
456
    return $term;
457
}
458
459
/**
460
 * Combine found documents and sum up their scores
461
 *
462
 * This function is used to combine searched words with a logical
463
 * AND. Only documents available in all arrays are returned.
464
 *
465
 * based upon PEAR's PHP_Compat function for array_intersect_key()
466
 *
467
 * @param array $args An array of page arrays
468
 * @return array
469
 */
470
function ft_resultCombine($args){
471
    $array_count = count($args);
472
    if($array_count == 1){
473
        return $args[0];
474
    }
475
476
    $result = array();
477
    if ($array_count > 1) {
478
        foreach ($args[0] as $key => $value) {
479
            $result[$key] = $value;
480
            for ($i = 1; $i !== $array_count; $i++) {
481
                if (!isset($args[$i][$key])) {
482
                    unset($result[$key]);
483
                    break;
484
                }
485
                $result[$key] += $args[$i][$key];
486
            }
487
        }
488
    }
489
    return $result;
490
}
491
492
/**
493
 * Unites found documents and sum up their scores
494
 *
495
 * based upon ft_resultCombine() function
496
 *
497
 * @param array $args An array of page arrays
498
 * @return array
499
 *
500
 * @author Kazutaka Miyasaka <[email protected]>
501
 */
502
function ft_resultUnite($args) {
503
    $array_count = count($args);
504
    if ($array_count === 1) {
505
        return $args[0];
506
    }
507
508
    $result = $args[0];
509
    for ($i = 1; $i !== $array_count; $i++) {
510
        foreach (array_keys($args[$i]) as $id) {
511
            $result[$id] += $args[$i][$id];
512
        }
513
    }
514
    return $result;
515
}
516
517
/**
518
 * Computes the difference of documents using page id for comparison
519
 *
520
 * nearly identical to PHP5's array_diff_key()
521
 *
522
 * @param array $args An array of page arrays
523
 * @return array
524
 *
525
 * @author Kazutaka Miyasaka <[email protected]>
526
 */
527
function ft_resultComplement($args) {
528
    $array_count = count($args);
529
    if ($array_count === 1) {
530
        return $args[0];
531
    }
532
533
    $result = $args[0];
534
    foreach (array_keys($result) as $id) {
535
        for ($i = 1; $i !== $array_count; $i++) {
536
            if (isset($args[$i][$id])) unset($result[$id]);
537
        }
538
    }
539
    return $result;
540
}
541
542
/**
543
 * Parses a search query and builds an array of search formulas
544
 *
545
 * @author Andreas Gohr <[email protected]>
546
 * @author Kazutaka Miyasaka <[email protected]>
547
 *
548
 * @param Doku_Indexer $Indexer
549
 * @param string $query search query
550
 * @return array of search formulas
551
 */
552
function ft_queryParser($Indexer, $query){
553
    /**
554
     * parse a search query and transform it into intermediate representation
555
     *
556
     * in a search query, you can use the following expressions:
557
     *
558
     *   words:
559
     *     include
560
     *     -exclude
561
     *   phrases:
562
     *     "phrase to be included"
563
     *     -"phrase you want to exclude"
564
     *   namespaces:
565
     *     @include:namespace (or ns:include:namespace)
566
     *     ^exclude:namespace (or -ns:exclude:namespace)
567
     *   groups:
568
     *     ()
569
     *     -()
570
     *   operators:
571
     *     and ('and' is the default operator: you can always omit this)
572
     *     or  (or pipe symbol '|', lower precedence than 'and')
573
     *
574
     * e.g. a query [ aa "bb cc" @dd:ee ] means "search pages which contain
575
     *      a word 'aa', a phrase 'bb cc' and are within a namespace 'dd:ee'".
576
     *      this query is equivalent to [ -(-aa or -"bb cc" or -ns:dd:ee) ]
577
     *      as long as you don't mind hit counts.
578
     *
579
     * intermediate representation consists of the following parts:
580
     *
581
     *   ( )           - group
582
     *   AND           - logical and
583
     *   OR            - logical or
584
     *   NOT           - logical not
585
     *   W+:, W-:, W_: - word      (underscore: no need to highlight)
586
     *   P+:, P-:      - phrase    (minus sign: logically in NOT group)
587
     *   N+:, N-:      - namespace
588
     */
589
    $parsed_query = '';
590
    $parens_level = 0;
591
    $terms = preg_split('/(-?".*?")/u', utf8_strtolower($query), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
592
593
    foreach ($terms as $term) {
594
        $parsed = '';
595
        if (preg_match('/^(-?)"(.+)"$/u', $term, $matches)) {
596
            // phrase-include and phrase-exclude
597
            $not = $matches[1] ? 'NOT' : '';
598
            $parsed = $not.ft_termParser($Indexer, $matches[2], false, true);
599
        } else {
600
            // fix incomplete phrase
601
            $term = str_replace('"', ' ', $term);
602
603
            // fix parentheses
604
            $term = str_replace(')'  , ' ) ', $term);
605
            $term = str_replace('('  , ' ( ', $term);
606
            $term = str_replace('- (', ' -(', $term);
607
608
            // treat pipe symbols as 'OR' operators
609
            $term = str_replace('|', ' or ', $term);
610
611
            // treat ideographic spaces (U+3000) as search term separators
612
            // FIXME: some more separators?
613
            $term = preg_replace('/[ \x{3000}]+/u', ' ',  $term);
614
            $term = trim($term);
615
            if ($term === '') continue;
616
617
            $tokens = explode(' ', $term);
618
            foreach ($tokens as $token) {
619
                if ($token === '(') {
620
                    // parenthesis-include-open
621
                    $parsed .= '(';
622
                    ++$parens_level;
623
                } elseif ($token === '-(') {
624
                    // parenthesis-exclude-open
625
                    $parsed .= 'NOT(';
626
                    ++$parens_level;
627
                } elseif ($token === ')') {
628
                    // parenthesis-any-close
629
                    if ($parens_level === 0) continue;
630
                    $parsed .= ')';
631
                    $parens_level--;
632
                } elseif ($token === 'and') {
633
                    // logical-and (do nothing)
634
                } elseif ($token === 'or') {
635
                    // logical-or
636
                    $parsed .= 'OR';
637
                } elseif (preg_match('/^(?:\^|-ns:)(.+)$/u', $token, $matches)) {
638
                    // namespace-exclude
639
                    $parsed .= 'NOT(N+:'.$matches[1].')';
640
                } elseif (preg_match('/^(?:@|ns:)(.+)$/u', $token, $matches)) {
641
                    // namespace-include
642
                    $parsed .= '(N+:'.$matches[1].')';
643
                } elseif (preg_match('/^-(.+)$/', $token, $matches)) {
644
                    // word-exclude
645
                    $parsed .= 'NOT('.ft_termParser($Indexer, $matches[1]).')';
646
                } else {
647
                    // word-include
648
                    $parsed .= ft_termParser($Indexer, $token);
649
                }
650
            }
651
        }
652
        $parsed_query .= $parsed;
653
    }
654
655
    // cleanup (very sensitive)
656
    $parsed_query .= str_repeat(')', $parens_level);
657
    do {
658
        $parsed_query_old = $parsed_query;
659
        $parsed_query = preg_replace('/(NOT)?\(\)/u', '', $parsed_query);
660
    } while ($parsed_query !== $parsed_query_old);
661
    $parsed_query = preg_replace('/(NOT|OR)+\)/u', ')'      , $parsed_query);
662
    $parsed_query = preg_replace('/(OR)+/u'      , 'OR'     , $parsed_query);
663
    $parsed_query = preg_replace('/\(OR/u'       , '('      , $parsed_query);
664
    $parsed_query = preg_replace('/^OR|OR$/u'    , ''       , $parsed_query);
665
    $parsed_query = preg_replace('/\)(NOT)?\(/u' , ')AND$1(', $parsed_query);
666
667
    // adjustment: make highlightings right
668
    $parens_level     = 0;
669
    $notgrp_levels    = array();
670
    $parsed_query_new = '';
671
    $tokens = preg_split('/(NOT\(|[()])/u', $parsed_query, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
672
    foreach ($tokens as $token) {
673
        if ($token === 'NOT(') {
674
            $notgrp_levels[] = ++$parens_level;
675
        } elseif ($token === '(') {
676
            ++$parens_level;
677
        } elseif ($token === ')') {
678
            if ($parens_level-- === end($notgrp_levels)) array_pop($notgrp_levels);
679
        } elseif (count($notgrp_levels) % 2 === 1) {
680
            // turn highlight-flag off if terms are logically in "NOT" group
681
            $token = preg_replace('/([WPN])\+\:/u', '$1-:', $token);
682
        }
683
        $parsed_query_new .= $token;
684
    }
685
    $parsed_query = $parsed_query_new;
686
687
    /**
688
     * convert infix notation string into postfix (Reverse Polish notation) array
689
     * by Shunting-yard algorithm
690
     *
691
     * see: http://en.wikipedia.org/wiki/Reverse_Polish_notation
692
     * see: http://en.wikipedia.org/wiki/Shunting-yard_algorithm
693
     */
694
    $parsed_ary     = array();
695
    $ope_stack      = array();
696
    $ope_precedence = array(')' => 1, 'OR' => 2, 'AND' => 3, 'NOT' => 4, '(' => 5);
697
    $ope_regex      = '/([()]|OR|AND|NOT)/u';
698
699
    $tokens = preg_split($ope_regex, $parsed_query, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
700
    foreach ($tokens as $token) {
701
        if (preg_match($ope_regex, $token)) {
702
            // operator
703
            $last_ope = end($ope_stack);
704
            while ($last_ope !== false && $ope_precedence[$token] <= $ope_precedence[$last_ope] && $last_ope != '(') {
705
                $parsed_ary[] = array_pop($ope_stack);
706
                $last_ope = end($ope_stack);
707
            }
708
            if ($token == ')') {
709
                array_pop($ope_stack); // this array_pop always deletes '('
710
            } else {
711
                $ope_stack[] = $token;
712
            }
713
        } else {
714
            // operand
715
            $token_decoded = str_replace(array('OP', 'CP'), array('(', ')'), $token);
716
            $parsed_ary[] = $token_decoded;
717
        }
718
    }
719
    $parsed_ary = array_values(array_merge($parsed_ary, array_reverse($ope_stack)));
720
721
    // cleanup: each double "NOT" in RPN array actually does nothing
722
    $parsed_ary_count = count($parsed_ary);
723
    for ($i = 1; $i < $parsed_ary_count; ++$i) {
724
        if ($parsed_ary[$i] === 'NOT' && $parsed_ary[$i - 1] === 'NOT') {
725
            unset($parsed_ary[$i], $parsed_ary[$i - 1]);
726
        }
727
    }
728
    $parsed_ary = array_values($parsed_ary);
729
730
    // build return value
731
    $q = array();
732
    $q['query']      = $query;
733
    $q['parsed_str'] = $parsed_query;
734
    $q['parsed_ary'] = $parsed_ary;
735
736
    foreach ($q['parsed_ary'] as $token) {
737
        if ($token[2] !== ':') continue;
738
        $body = substr($token, 3);
739
740
        switch (substr($token, 0, 3)) {
741
            case 'N+:':
742
                     $q['ns'][]        = $body; // for backward compatibility
743
                     break;
744
            case 'N-:':
745
                     $q['notns'][]     = $body; // for backward compatibility
746
                     break;
747
            case 'W_:':
748
                     $q['words'][]     = $body;
749
                     break;
750
            case 'W-:':
751
                     $q['words'][]     = $body;
752
                     $q['not'][]       = $body; // for backward compatibility
753
                     break;
754
            case 'W+:':
755
                     $q['words'][]     = $body;
756
                     $q['highlight'][] = $body;
757
                     $q['and'][]       = $body; // for backward compatibility
758
                     break;
759
            case 'P-:':
760
                     $q['phrases'][]   = $body;
761
                     break;
762
            case 'P+:':
763
                     $q['phrases'][]   = $body;
764
                     $q['highlight'][] = $body;
765
                     break;
766
        }
767
    }
768
    foreach (array('words', 'phrases', 'highlight', 'ns', 'notns', 'and', 'not') as $key) {
769
        $q[$key] = empty($q[$key]) ? array() : array_values(array_unique($q[$key]));
770
    }
771
772
    return $q;
773
}
774
775
/**
776
 * Transforms given search term into intermediate representation
777
 *
778
 * This function is used in ft_queryParser() and not for general purpose use.
779
 *
780
 * @author Kazutaka Miyasaka <[email protected]>
781
 *
782
 * @param Doku_Indexer $Indexer
783
 * @param string       $term
784
 * @param bool         $consider_asian
785
 * @param bool         $phrase_mode
786
 * @return string
787
 */
788
function ft_termParser($Indexer, $term, $consider_asian = true, $phrase_mode = false) {
789
    $parsed = '';
790
    if ($consider_asian) {
791
        // successive asian characters need to be searched as a phrase
792
        $words = preg_split('/('.IDX_ASIAN.'+)/u', $term, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
793
        foreach ($words as $word) {
794
            $phrase_mode = $phrase_mode ? true : preg_match('/'.IDX_ASIAN.'/u', $word);
795
            $parsed .= ft_termParser($Indexer, $word, false, $phrase_mode);
0 ignored issues
show
Bug introduced by
It seems like $phrase_mode defined by $phrase_mode ? true : pr...DX_ASIAN . '/u', $word) on line 794 can also be of type integer; however, ft_termParser() does only seem to accept boolean, maybe add an additional type check?

If a method or function can return multiple different values and unless you are sure that you only can receive a single value in this context, we recommend to add an additional type check:

/**
 * @return array|string
 */
function returnsDifferentValues($x) {
    if ($x) {
        return 'foo';
    }

    return array();
}

$x = returnsDifferentValues($y);
if (is_array($x)) {
    // $x is an array.
}

If this a common case that PHP Analyzer should handle natively, please let us know by opening an issue.

Loading history...
796
        }
797
    } else {
798
        $term_noparen = str_replace(array('(', ')'), ' ', $term);
799
        $words = $Indexer->tokenizer($term_noparen, true);
800
801
        // W_: no need to highlight
802
        if (empty($words)) {
803
            $parsed = '()'; // important: do not remove
804
        } elseif ($words[0] === $term) {
805
            $parsed = '(W+:'.$words[0].')';
806
        } elseif ($phrase_mode) {
807
            $term_encoded = str_replace(array('(', ')'), array('OP', 'CP'), $term);
808
            $parsed = '((W_:'.implode(')(W_:', $words).')(P+:'.$term_encoded.'))';
809
        } else {
810
            $parsed = '((W+:'.implode(')(W+:', $words).'))';
811
        }
812
    }
813
    return $parsed;
814
}
815
816
//Setup VIM: ex: et ts=4 :
817