On my dev laptop, the bleve-query benchmark of query-string
"+text:afternoon +text:coffee" (which gets parsed into a conjection of
disjunctions) had throughput of 308qps before this change, and after
this change was 342qps.
This change simplifies and removes the DisjunctionSearcher.currentID
tracking, and instead utilizes the the matching/matchingIdxs slices
for tracking the required information.
As the core of the optimization, the previous code used two loop
passses to compare the internal ID's to the currentID field. This
commit instead optimizes to have a single pass to both compare the
internalID's and to also maintain the matching/matchingIdxs arrays.
On my dev box, using a bleve-query benchmark on a wiki dataset, with
query-string of "text:afternoon text:coffee", the previous code had
throughput of 958qps, and this commit has 1174qps.
A common search case is when a user performs a query-string query,
such as for "the lazy dog". That would be parsed into a boolean query
with a nil Must child, a nil MustNot child, and a non-nil Should child
(a disjunction query for "the", "lazy", "dog").
The optimization in this case is to return just the Should child
directly, skipping any additional Must and MustNot overhead.
On a dev box bleve-query benchmark on a wiki index with a query string
of "text:afternoon text:coffee", the throughput was previously 873qps
and with this change hits 940qps.
Optimization for DisjunctionSearcher, where an extra matchingIdxs
helps track the currs that were matching. This avoids the previous
code's second loop through the currs slice.
This commit modifies the upside_down TermFrequencyRow parseKDoc() to
skip the ByteSeparator (0xFF) scan, as we already know the term's
length in the UpsideDownCouchTermFieldReader.
On my dev box, results from bleve-query test on high frequency terms
went from previous 107qps to 124qps.
The DumpXXX() methods were always documented as internal and
unsupported. However, now they are being removed from the
public top-level API. They are still available on the internal
IndexReader, which can be accessed using the Advanced() method.
The DocCount() and DumpXXX() methods on the internal index
have moved to the internal index reader, since they logically
operate on a snapshot of an index.
the test had incorreclty been updated to compare the internal
document ids, but these are opaque and may not be the expected
ids in some cases, the test should simply check that it
corresponds to the correct external ids
1. porter stemmer offers method to NOT do lowercasing, however
to use this we must convert to runes first ourself, so we did this
2. now we can invoke the version that skips lowercasing, we
already do this ourselves before stemming through separate filter
due to the fact that the stemmer modifies the runes in place
we have no way to know if there were changes, thus we must
always encode back into the term byte slice
added unit test which catches the problem found
NOTE this uses analysis.BuildTermFromRunes so perf gain is
only visible with other PR also merged
future gains are possible if we udpate the stemmer to let us
know if changes were made, thus skipping re-encoding to
[]byte when no changes were actually made
avoid allocating unnecessary intermediate buffer
also introduce new method to let a user optimistically
try and encode back into an existing buffer, if it isn't
large enough, it silently allocates a new one and returns it