Optimization for DisjunctionSearcher, where an extra matchingIdxs
helps track the currs that were matching. This avoids the previous
code's second loop through the currs slice.
This commit modifies the upside_down TermFrequencyRow parseKDoc() to
skip the ByteSeparator (0xFF) scan, as we already know the term's
length in the UpsideDownCouchTermFieldReader.
On my dev box, results from bleve-query test on high frequency terms
went from previous 107qps to 124qps.
The DumpXXX() methods were always documented as internal and
unsupported. However, now they are being removed from the
public top-level API. They are still available on the internal
IndexReader, which can be accessed using the Advanced() method.
The DocCount() and DumpXXX() methods on the internal index
have moved to the internal index reader, since they logically
operate on a snapshot of an index.
the test had incorreclty been updated to compare the internal
document ids, but these are opaque and may not be the expected
ids in some cases, the test should simply check that it
corresponds to the correct external ids
1. porter stemmer offers method to NOT do lowercasing, however
to use this we must convert to runes first ourself, so we did this
2. now we can invoke the version that skips lowercasing, we
already do this ourselves before stemming through separate filter
due to the fact that the stemmer modifies the runes in place
we have no way to know if there were changes, thus we must
always encode back into the term byte slice
added unit test which catches the problem found
NOTE this uses analysis.BuildTermFromRunes so perf gain is
only visible with other PR also merged
future gains are possible if we udpate the stemmer to let us
know if changes were made, thus skipping re-encoding to
[]byte when no changes were actually made
avoid allocating unnecessary intermediate buffer
also introduce new method to let a user optimistically
try and encode back into an existing buffer, if it isn't
large enough, it silently allocates a new one and returns it
previous impl always did full utf8 decode of rune
if we assume most tokens are not possessive this is unnecessary
and even if they are, we only need to chop off last to runes
so, now we only decode last rune of token, and if it looks like
s/S then we proceed to decode second to last rune, and then
only if it looks like any form of apostrophe, do we make any
changes to token, again by just reslicing original to chop
off the possessive extension
the token stream resulting from the removal of stop words must
be shorter or the same length as the original, so we just
reuse it and truncate it at the end.
This change depends on the recently introduced mossStore Stats() API
in github.com/couchbase/moss 564bdbc0 commit. So, gvt for moss has
been updated as part of this change.
Most of the change involves propagating the mossStore instance (the
statsFunc callback) so that it's accessible to the KVStore.Stats()
method.
See also: http://review.couchbase.org/#/c/67524/
this improvement was started to improve code coverage
but also improves performance and adds support for escaping
escaping:
The following quoted string enumerates the characters which
may be escaped.
"+-=&|><!(){}[]^\"~*?:\\/ "
Note that this list includes space.
In order to escape these characters, they are prefixed with the \
(backslash) character. In all cases, using the escaped version
produces the character itself and is not interpretted by the
lexer.
Two simple examples:
my\ name
Will be interpretted as a single argument to a match query
with the value "my name".
"contains a\" character"
Will be interpretted as a single argument to a phrase query
with the value `contains a " character`.
Performance:
before$ go test -v -run=xxx -bench=BenchmarkLexer
BenchmarkLexer-4 100000 13991 ns/op
PASS
ok github.com/blevesearch/bleve 1.570s
after$ go test -v -run=xxx -bench=BenchmarkLexer
BenchmarkLexer-4 500000 3387 ns/op
PASS
ok github.com/blevesearch/bleve 1.740s
the collector has optimizations to avoid allocation and reslicing
during the common case of searching for top hits
however, in some cases users request an a very large number of
search hits to be returned (attempting to get them all) this
caused unnecessary allocation of ram.
to address this we introduce a new constant PreAllocSizeSkipCap
it defaults the value of 1000. if your search+skip is less than
this constant, you get the optimized behavior. if your
search+skip is greater than this, we cap the preallcations to
this lower value. additional space is acquired on an as needed
basis by growing the DocumentMatchPool and reslicing the
collector backing slice
applications can change the value of PreAllocSizeSkipCap to suit
their own needs
fixes#408