some downstream applications need to know which sub-commands
may alter the index. this new function allows them to be
identified, while not prescribing any particular behavior.
we now use a multiphrase query in all cases
internally its optimized to be the same as regular phrase query
anyway, and we simplly map all the tokens in the stream into
a multi-phrase query with the appropriate structure
when parsing json, when we encounter the key "terms", we first
try to parse as traditional phrase query, then if that fails,
we also try parsing it as multi-phrase
the logic of how a phrase search works should be an internal
detail of the phrase searcher. further, these changes will
allow proper scoring of phrase matches, which require access
to the underlying searcher objects, which were hidden in the
previous approach.
this originated from a misunderstanding of mine going back
several years. the values need not be float64 just because
we plan to serialize them as json.
there are still larger questions about what the right type should
be, and where should any conversions go. but, this commit
simply attempts to address the most egregious problems
at the core, the Next() method moves another searcher forward
and checks each hit to see if it also satisfies the phrase
constraints. the current implementation has 4 nested for loops.
these nested loops make it harder read (indentation) and harder
to reason about (complexity).
this refactor does not remove any loops, it simply moves some
of the inner loops into separate methods so that one can
more easily reason about the parts separately.
Previously term entries were encoded pairwise (field/term), so
you'd have data like:
F1/T1 F1/T2 F1/T3 F2/T4 F3/T5
As you can see, even though field 1 has 3 terms, we repeat the F1
part in the encoded data. This is a bit wasteful.
In the new format we encode it as a list of terms for each field:
F1/T1,T2,T3 F2/T4 F3/T5
When fields have multiple terms, this saves space. In unit
tests there is no additional waste even in the case that a field
has only a single value.
Here are the results of an indexing test case (beer-search):
$ benchcmp indexing-before.txt indexing-after.txt
benchmark old ns/op new ns/op delta
BenchmarkIndexing-4 11275835988 10745514321 -4.70%
benchmark old allocs new allocs delta
BenchmarkIndexing-4 25230685 22480494 -10.90%
benchmark old bytes new bytes delta
BenchmarkIndexing-4 4802816224 4741641856 -1.27%
And here are the results of a MatchAll search building a facet
on the "abv" field:
$ benchcmp facet-before.txt facet-after.txt
benchmark old ns/op new ns/op delta
BenchmarkFacets-4 439762100 228064575 -48.14%
benchmark old allocs new allocs delta
BenchmarkFacets-4 9460208 3723286 -60.64%
benchmark old bytes new bytes delta
BenchmarkFacets-4 260784261 151746483 -41.81%
Although we expect the index to be smaller in many cases, the
beer-search index is about the same in this case. However,
this may be due to the underlying storage (boltdb) in this case.
Finally, the index version was bumped from 5 to 7, since smolder
also used version 6, which could lead to some confusion.
While researching an observed performance issue with wildcard
queries, it was observed that the LiteralPrefix() method on
the regexp.Regexp struct did not always behave as expected.
In particular, when the pattern starts with ^, AND involves
some backtracking, the LiteralPrefix() seems to always be the
empty string.
The side-effect of this is that we rely on having a helpful
prefix, to reduce the number of terms in the term dictionary
that need to be visited.
This change now makes the searcher enforce start/end on the term
directly, by using FindStringIndex() instead of Match().
Next, we also modified WildcardQuery and RegexpQuery to no
longer include the ^ and $ modifiers.
Documentation was also udpated to instruct users that they should
not include the ^ and $ modifiers in their patterns.