previous impl always did full utf8 decode of rune
if we assume most tokens are not possessive this is unnecessary
and even if they are, we only need to chop off last to runes
so, now we only decode last rune of token, and if it looks like
s/S then we proceed to decode second to last rune, and then
only if it looks like any form of apostrophe, do we make any
changes to token, again by just reslicing original to chop
off the possessive extension
this improvement was started to improve code coverage
but also improves performance and adds support for escaping
escaping:
The following quoted string enumerates the characters which
may be escaped.
"+-=&|><!(){}[]^\"~*?:\\/ "
Note that this list includes space.
In order to escape these characters, they are prefixed with the \
(backslash) character. In all cases, using the escaped version
produces the character itself and is not interpretted by the
lexer.
Two simple examples:
my\ name
Will be interpretted as a single argument to a match query
with the value "my name".
"contains a\" character"
Will be interpretted as a single argument to a phrase query
with the value `contains a " character`.
Performance:
before$ go test -v -run=xxx -bench=BenchmarkLexer
BenchmarkLexer-4 100000 13991 ns/op
PASS
ok github.com/blevesearch/bleve 1.570s
after$ go test -v -run=xxx -bench=BenchmarkLexer
BenchmarkLexer-4 500000 3387 ns/op
PASS
ok github.com/blevesearch/bleve 1.740s
the collector has optimizations to avoid allocation and reslicing
during the common case of searching for top hits
however, in some cases users request an a very large number of
search hits to be returned (attempting to get them all) this
caused unnecessary allocation of ram.
to address this we introduce a new constant PreAllocSizeSkipCap
it defaults the value of 1000. if your search+skip is less than
this constant, you get the optimized behavior. if your
search+skip is greater than this, we cap the preallcations to
this lower value. additional space is acquired on an as needed
basis by growing the DocumentMatchPool and reslicing the
collector backing slice
applications can change the value of PreAllocSizeSkipCap to suit
their own needs
fixes#408
counter-intuitively the list impl was faster than the heap
the theory was the heap did more comparisons and swapping
so even though it benefited from no interface and some cache
locality, it was still slower
the idea was to just use a raw slice kept in order
this avoids the need for interface, but can take same comparison
approach as the list
it seems to work out:
go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out
BenchmarkTop10of100000Scores-4 5000 299959 ns/op 2600 B/op 36 allocs/op
BenchmarkTop100of100000Scores-4 2000 601104 ns/op 20720 B/op 216 allocs/op
BenchmarkTop10of1000000Scores-4 500 3450196 ns/op 2616 B/op 36 allocs/op
BenchmarkTop100of1000000Scores-4 500 3874276 ns/op 20856 B/op 216 allocs/op
PASS
ok github.com/blevesearch/bleve/search/collectors 7.440s
the TopNCollector now can either use a heap or a list
i did not code it to use an interface, because this is a very hot
loop during searching. rather, it lets bleve developers easily
toggle between the two (or other ideas) by changing 2 lines
The list is faster in the benchmark, but causes more allocations.
The list is once again the default (for now).
To switch to the heap implementation, change:
store *collectStoreList
to
store *collectStoreHeap
and
newStoreList(...
to
newStoreHeap(...
primary change is going back to sort values be []string
and not []interface{}, this avoid allocatiosn converting
into the interface{}
that sounds obvious, so why didn't we just do that first?
because a common (default) sort is score, which is naturally
a number, not a string (like terms). converting into the
number was also expensive, and the common case.
so, this solution also makes the change to NOT put the score
into the sort value list. instead you see the dummy value
"_score". this is just a placeholder, the actual sort impl
knows that field of the sort is the score, and will sort
using the actual score.
also, several other aspets of the benchmark were cleaned up
so that unnecessary allocations do not pollute the cpu profiles
Here are the updated benchmarks:
$ go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out
BenchmarkTop10of100000Scores-4 3000 465809 ns/op 2548 B/op 33 allocs/op
BenchmarkTop100of100000Scores-4 2000 626488 ns/op 21484 B/op 213 allocs/op
BenchmarkTop10of1000000Scores-4 300 5107658 ns/op 2560 B/op 33 allocs/op
BenchmarkTop100of1000000Scores-4 300 5275403 ns/op 21624 B/op 213 allocs/op
PASS
ok github.com/blevesearch/bleve/search/collectors 7.188s
Prior to this PR, master reported:
$ go test -run=xxx -bench=. -benchmem
BenchmarkTop10of100000Scores-4 3000 453269 ns/op 360161 B/op 42 allocs/op
BenchmarkTop100of100000Scores-4 2000 519131 ns/op 388275 B/op 219 allocs/op
BenchmarkTop10of1000000Scores-4 200 7459004 ns/op 4628236 B/op 52 allocs/op
BenchmarkTop100of1000000Scores-4 200 8064864 ns/op 4656596 B/op 232 allocs/op
PASS
ok github.com/blevesearch/bleve/search/collectors 7.385s
So, we're pretty close on the smaller datasets, and we scale better on the larger datasets.
We also show fewer allocations and bytes in all cases (some of this is artificial due to test cleanup).
this change means simple sort requirements no longer require
importing the search package (high-level API goal)
also the sort test at the top-level was changed to use this form
previously from JSON we would just deserialize strings like
"-abv" or "city" or "_id" or "_score" as simple sorts
on fields, ids or scores respectively
while this is simple and compact, it can be ambiguous (for
example if you have a field starting with - or if you have a field
named "_id" already. also, this simple syntax doesnt allow us
to specify more cmoplex options to deal with type/mode/missing
we keep support for the simple string syntax, but now also
recognize a more expressive syntax like:
{
"by": "field",
"field": "abv",
"desc": true,
"type": "string",
"mode": "min",
"missing": "first"
}
type, mode and missing are optional and default to
"auto", "default", and "last" respectively
The syntax used is an array of strings. The strings "_id" and
"_score" are special and reserved to mean sorting on the document
id and score repsectively. All other strings refer to the literal
field name with that value. If the string is prefixed with "-"
the order of that sort is descending, without it, it defaults to
ascending.
Examples:
"sort":["-abv","-_score"]
This will sort results in decreasing order of the "abv" field.
Results which have the same value of the "abv" field will then
be sorted by their score, also decreasing.
If no value for "sort" is provided in the search request the
default soring is the same as before, which is decreasing score.
the motivation for this commit is long and detailed and has been
documented externally here:
https://gist.github.com/mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2
the core of the changes are:
1. recognize that collector/searcher need only a fixed number
of DocumentMatch instances, and this number can be determined
from the structure of the query, not the size of the data
2. knowing this, instances can be allocated in bulk, up front
and they can be reused without locking (since all search
operations take place in a single goroutine
3. combined with previous commits which enabled reuse of
the IndexInternalID []byte, this allows for no allocation/copy
of these bytes as well (by using DocumentMatch Reset() method
when returning entries to the pool
in a recent commit, we changed the code to reuse
TermFrequencyRow objects intsead of constantly allocating new
ones. unfortunately, one of the original methods was not coded
with this reuse in mind, and a lazy initialization cause us to
leak data from previous uses of the same object.
in particular this caused term vector information from previous
hits to still be applied to subsequent hits. eventually this
causes the highlighter to try and highlight invalid regions
of a slice.
fixes#404