0
0
Commit Graph

1289 Commits

Author SHA1 Message Date
Marty Schoch
2ba915b929 add additional parens to clarify logic 2017-02-10 20:22:32 -05:00
Marty Schoch
56a79528c3 update match_phrase query to handle multiple tokens in same pos
we now use a multiphrase query in all cases
internally its optimized to be the same as regular phrase query
anyway, and we simplly map all the tokens in the stream into
a multi-phrase query with the appropriate structure
2017-02-10 17:12:13 -05:00
Marty Schoch
a5d1d7974c add query support for multi-phrase
when parsing json, when we encounter the key "terms", we first
try to parse as traditional phrase query, then if that fails,
we also try parsing it as multi-phrase
2017-02-10 16:46:38 -05:00
Marty Schoch
c6085d8cdc address initial code review comments 2017-02-10 15:22:14 -05:00
Marty Schoch
09d00829db phrase searcher now supports multi-phrase
backwards compatability maintained through previous constructor
very basic test added (not sufficient)
2017-02-10 15:17:50 -05:00
Marty Schoch
9c8e1e82de add initial low-level support for multi-phrase
this adds basic multi-phrase support,
a shim to keep the top-level working
and unit tests for new multi-phrase cases
2017-02-10 13:16:05 -05:00
Marty Schoch
4e38c49287 move phrase search logic into phrase searcher
the logic of how a phrase search works should be an internal
detail of the phrase searcher.  further, these changes will
allow proper scoring of phrase matches, which require access
to the underlying searcher objects, which were hidden in the
previous approach.
2017-02-10 12:05:01 -05:00
Marty Schoch
97a428f5b0 Merge pull request #531 from mschoch/losefloat
remove use of float64 to represent int things
2017-02-09 20:25:20 -05:00
Marty Schoch
8096d9fb90 remove use of float64 to represent int things
this originated from a misunderstanding of mine going back
several years.  the values need not be float64 just because
we plan to serialize them as json.

there are still larger questions about what the right type should
be, and where should any conversions go.  but, this commit
simply attempts to address the most egregious problems
2017-02-09 20:15:59 -05:00
Marty Schoch
0c87b7bff1 Merge pull request #527 from mschoch/recursive_phrase
refactor phrase search to be recursive
2017-02-09 18:10:33 -05:00
Marty Schoch
87df597b21 add 'used by' badge 2017-02-09 16:40:37 -05:00
Marty Schoch
232fc80dad add support for phrase slop to internals of phrase searcher
phrase slop is not yet supported on the frontend
added lots of tests around slop
2017-02-09 15:59:51 -05:00
Marty Schoch
4da7756f67 Merge pull request #530 from steveyen/master
optimizations around search / DocumentMatchPool
2017-02-09 15:44:04 -05:00
Steve Yen
0b70a1bcb8 use inlined prealloc'ed termFreqRow in upsidedown termFieldReader 2017-02-08 18:23:13 -08:00
Steve Yen
31fecc3663 avoid row alloc's in upsidedown termFieldReader constructor 2017-02-08 18:14:30 -08:00
Steve Yen
470516d973 DocumentMatchPool hits allocator outside of loop 2017-02-06 14:26:59 -08:00
Marty Schoch
50c43bfef6 Merge pull request #528 from bfontaine/syntax
README: Use go syntax highlighting
2017-02-04 09:52:58 -05:00
Baptiste Fontaine
7e2ce1cf9e README: Use go syntax highlighting 2017-02-04 12:20:56 +01:00
Marty Schoch
f82638c117 refactor phrase search to be recursive
a more correct solution that will enable us to extend in two
important ways:

1) support slop
2) support multi-phrase
2017-02-03 16:05:21 -05:00
Marty Schoch
101ecfe972 Merge pull request #526 from sreekanth-cb/improved_facet_range_validations
MB-20793: Validation for min/max/start/end params for numeric/date ra…
2017-02-03 14:34:19 -05:00
Sreekanth Sivasankaran
029d4c73d9 clean up of unit test. 2017-02-02 23:33:26 +05:30
Sreekanth Sivasankaran
c1d28bb2fc Moving the tests to the table driven test pattern. 2017-02-02 23:31:00 +05:30
Sreekanth Sivasankaran
2a09857657 MB-20793 : Validation for min/max/start/end params for numeric/date range facets
Updated the comments in UTs.
2017-02-02 13:10:09 +05:30
Sreekanth Sivasankaran
c6f96f081d MB-20793 : Validation for min/max/start/end params for numeric/date range facets
Added few more cases in the unit tests.
2017-02-02 13:05:49 +05:30
Sreekanth Sivasankaran
78686c3fa3 MB-20793 : Validation for min/max/start/end params for numeric/date range facets
Corrected the validation and updated the unit tests.
2017-02-02 12:15:48 +05:30
Sreekanth Sivasankaran
f514ac7867 MB-20793: Validation for min/max/start/end params for numeric/date range facets
Improved the validations for date and numeric range queries for facets
2017-02-01 14:48:47 +05:30
Marty Schoch
12a7257b5f remove duplicate code suggested by review from @steveyen 2017-01-31 15:12:06 -05:00
Marty Schoch
a3ee71ddbb Merge pull request #525 from mschoch/refactor-phrase
refactor phrase search into seprate methods
2017-01-31 15:06:42 -05:00
Marty Schoch
7fd8aeb50a refactor phrase search into seprate methods
at the core, the Next() method moves another searcher forward
and checks each hit to see if it also satisfies the phrase
constraints.  the current implementation has 4 nested for loops.
these nested loops make it harder read (indentation) and harder
to reason about (complexity).

this refactor does not remove any loops, it simply moves some
of the inner loops into separate methods so that one can
more easily reason about the parts separately.
2017-01-31 13:32:46 -05:00
Marty Schoch
d48d2b6c68 Merge pull request #524 from mschoch/fix523
fix edge ngram output when side=Back and input token len=max
2017-01-31 12:03:26 -05:00
Marty Schoch
782dbecfe1 fix edge ngram output when side=Back and input token len=max
edge condition was incorreclty checked
fixes #523
2017-01-30 20:29:20 -05:00
Marty Schoch
d40cfb0870 Merge pull request #521 from mschoch/improved-backindex-row
INDEX FORMAT CHANGE: change back index row value
2017-01-24 16:07:47 -05:00
Marty Schoch
606fd6344b INDEX FORMAT CHANGE: change back index row value
Previously term entries were encoded pairwise (field/term), so
you'd have data like:

F1/T1 F1/T2 F1/T3 F2/T4 F3/T5

As you can see, even though field 1 has 3 terms, we repeat the F1
part in the encoded data.  This is a bit wasteful.

In the new format we encode it as a list of terms for each field:

F1/T1,T2,T3 F2/T4 F3/T5

When fields have multiple terms, this saves space.  In unit
tests there is no additional waste even in the case that a field
has only a single value.

Here are the results of an indexing test case (beer-search):

$ benchcmp indexing-before.txt indexing-after.txt
benchmark               old ns/op       new ns/op       delta
BenchmarkIndexing-4     11275835988     10745514321     -4.70%

benchmark               old allocs     new allocs     delta
BenchmarkIndexing-4     25230685       22480494       -10.90%

benchmark               old bytes      new bytes      delta
BenchmarkIndexing-4     4802816224     4741641856     -1.27%

And here are the results of a MatchAll search building a facet
on the "abv" field:

$ benchcmp facet-before.txt facet-after.txt
benchmark             old ns/op     new ns/op     delta
BenchmarkFacets-4     439762100     228064575     -48.14%

benchmark             old allocs     new allocs     delta
BenchmarkFacets-4     9460208        3723286        -60.64%

benchmark             old bytes     new bytes     delta
BenchmarkFacets-4     260784261     151746483     -41.81%

Although we expect the index to be smaller in many cases, the
beer-search index is about the same in this case.  However,
this may be due to the underlying storage (boltdb) in this case.

Finally, the index version was bumped from 5 to 7, since smolder
also used version 6, which could lead to some confusion.
2017-01-24 15:39:38 -05:00
Marty Schoch
f94a790156 Merge pull request #520 from mschoch/faster_regexp
improve performance of regular expression and wildcard queries
2017-01-18 16:31:49 -05:00
Marty Schoch
b55c9043b9 improve performance of regular expression and wildcard queries
While researching an observed performance issue with wildcard
queries, it was observed that the LiteralPrefix() method on
the regexp.Regexp struct did not always behave as expected.

In particular, when the pattern starts with ^, AND involves
some backtracking, the LiteralPrefix() seems to always be the
empty string.

The side-effect of this is that we rely on having a helpful
prefix, to reduce the number of terms in the term dictionary
that need to be visited.

This change now makes the searcher enforce start/end on the term
directly, by using FindStringIndex() instead of Match().
Next, we also modified WildcardQuery and RegexpQuery to no
longer include the ^ and $ modifiers.

Documentation was also udpated to instruct users that they should
not include the ^ and $ modifiers in their patterns.
2017-01-18 16:22:16 -05:00
Marty Schoch
72731336bf Merge pull request #517 from minagawa-sho/fix-confusing-variable-name
fix the confusing variable name
2017-01-14 09:00:34 -05:00
Sho Minagawa
5537688394 fix the confusing variable name 2017-01-14 20:26:08 +09:00
Marty Schoch
269cc302e3 Merge pull request #514 from steveyen/master
more upsidedown optimizations
2017-01-10 09:15:04 -05:00
Steve Yen
5927224e15 optimize mergeOldAndNew for case of first time a doc is seen 2017-01-09 22:48:58 -08:00
Steve Yen
790f2e3e32 optimize by alloc'ing arrays of TermFrequencyRow/TermVector 2017-01-09 22:42:00 -08:00
Marty Schoch
8cd6040b63 Merge pull request #512 from steveyen/master
API change: optional SearchRequest.IncludeLocations flag
2017-01-09 14:19:17 -05:00
Marty Schoch
ae219d6397 Merge pull request #489 from Shugyousha/refactorphrasesearch
Refactor PhraseSearcher
2017-01-09 14:13:22 -05:00
Steve Yen
8f4726ab10 use struct{}{} idiom instead of additional mark var 2017-01-09 10:17:26 -08:00
Marty Schoch
d081ed712a Merge pull request #513 from mosuka/master
renamed detect_lang to detectlang
2017-01-09 09:17:57 -05:00
Minoru Osuka
63c0d9a4d2 renamed detect_lang to detectlang
renamed detect_lang to detectlang.
2017-01-09 16:51:48 +09:00
Steve Yen
302cac72c4 optimize mergeOldAndNew when non-update case 2017-01-08 17:59:49 -08:00
Steve Yen
931d133024 go fmt and go vet 2017-01-07 22:14:22 -08:00
Steve Yen
40780254ae optimize upsidedown mergeOldAndNew existing key maps
The optimization is to provide a better initial size to the map
constructor and to use a 0-byte-sized struct{} as the map values.
2017-01-07 22:05:55 -08:00
Steve Yen
c2bafa2a51 optimize term vectors/locations via preallocated arrays
The change should hit the allocator less often when processing term
vectors/locations as it preallocates larger, contiguous arrays of
records upfront.
2017-01-07 12:34:06 -08:00
Steve Yen
8b140d84c4 minor optimization of upsidedown backIndexRowForDoc
This change might allow a smart enough golang compiler to perhaps
allocate a backIndexRow on the stack rather than the heap.
2017-01-07 11:49:42 -08:00