Previously, the PrefixIterator() for moss was implemented by comparing
the prefix bytes on every Next().
With this optimization, the next larger endKeyExclusive is computed at
the iterator's initialization, which allows us to avoid all those
prefix comparisons.
This optimization changes the index.TermFieldReader.Next() interface
API, adding an optional, pre-allocated *TermFieldDoc parameter, which
can help prevent garbage creation.
This optimization changes the search.Search.Next() interface API,
adding an optional, pre-allocated *DocumentMatch parameter.
When it's non-nil, the TermSearcher and TermQueryScorer will use that
pre-allocated *DocumentMatch, instead of allocating a brand new
DocumentMatch instance.
Before this change, upside down's reader would alloc a new
TermFrequencyRow on every Next(), which would be immediately
transformed into an index.TermFieldDoc{}. This change reuses a
pre-allocated TermFrequencyRow that's a field in the reader.
From some bleve-query perf profiling, term field vectors appeared to
be alloc'ed, which was unnecessary as term field vectors are disabled
in the bleve-blast/bleve-query tests.
these searchers incorrectly called Next() on their underlying
searcher, instead of Advance(). this can cause values to be
returned with an ID less than the one that was Advanced() to,
which violates the contract, and causes other incorrect behavior.
fixes#342
the behavior has been defined in a way that is compatible with
encoding/json. this behavior is as follows:
anonymous fields which are structs will have struct fields get
field names as if they were directly in the parent struct.
anonymous fields which are not structs, or which are interfaces
which may or may not point to structs will get field names that
correspond to the name of the type
the exception to the rules above is that you can always override
this behavior by using a JSON struct tag
fixes#101
change cjk bigram analyzer to work with multi-rune terms
add cjk width filter replaces full unicode normailzation
these changes make the cjk analyzer behave more like elasticsearch
they also remove the depenency on the whitespace analyzer
which is now free to also behave more like lucene/es
fixes#33
fixes#378
this bug was introduced by:
f2aba116c4
theory of operation for this collector (top N, skip K)
- collect the highest scoring N+K results
- if K > 0, skip K and return the next N
internal details
- the top N+K are kept in a list
- the list is ordered from lowest scoring (first) to highest scoring (last)
- as a hit comes in, we find where this new hit would fit into this list
- if this caused the list to get too big, trim off the head (lowest scoring hit)
theory of the optimization
- we were not tracking the lowest score in the list
- so if the score was lower than the lowest score, we would add/remove it
- by keeping track of the lowest score in the list, we can avoid these ops
problem with the optimization
- the optimization worked by returning early
- by returning early there was a subtle change to documents which had the same score
- the reason is that which docs end up in the top N+K changed by returning early
- why was that? docs are coming in, in order by key ascending
- when finding the correct position to insert a hit into the list, we checked <, not <= the score
- this has the subtle effect that docs with the same score end up in reverse order
for example consider the following in progress list:
doc ids [ c a b ]
scores [ 1 5 9 ]
if we now see doc d with score 5, we get:
doc ids [ c a d b ]
scores [ 1 5 5 9 ]
While that appears in order (a, d) it is actually reverse order, because when we
produce the top N we start at the end.
theory of the fix
- previous pagination depended on later hits with the same score "bumping" earlier
hits with the same score off the bottom of the list
- however, if we change the logic to <= instead of <, now the list in the previous
example would look like:
doc ids [ c d a b ]
scores [ 1 5 5 9 ]
- this small change means that now earlier (lower id) will score higher, and
thus we no longer depend on later hits bumping things down, which means returning
early is a valid thing to do
NOTE: this does depend on the hits coming back in order by ID. this is not
something strictly guaranteed, but it was the same assumption that allowed the
original behavior
This also has the side-effect that 2 hits with the same score come back in
ascending ID order, which is somehow more pleasing to me than reverse order.