0
0
Commit Graph

1155 Commits

Author SHA1 Message Date
Marty Schoch
f531835d5c Merge pull request #420 from steveyen/MB-20590
index/store/moss KV backend propagates mossStore's Stats()
2016-09-11 20:28:29 -04:00
Marty Schoch
5cf50ec338 Merge pull request #418 from dtylman/master
fix for #416
2016-09-11 20:26:24 -04:00
Marty Schoch
ee61b2e866 Merge pull request #425 from mschoch/porterfaster
improve perf of porter stemmer
2016-09-11 20:22:23 -04:00
Marty Schoch
f8e8c9d065 Merge pull request #426 from mschoch/fasterbuildterms
encode runes directly into buffer
2016-09-11 20:19:09 -04:00
Marty Schoch
44ff6ced8a improve perf of porter stemmer
1.  porter stemmer offers method to NOT do lowercasing, however
to use this we must convert to runes first ourself, so we did this

2.  now we can invoke the version that skips lowercasing, we
already do this ourselves before stemming through separate filter

due to the fact that the stemmer modifies the runes in place
we have no way to know if there were changes, thus we must
always encode back into the term byte slice

added unit test which catches the problem found

NOTE this uses analysis.BuildTermFromRunes so perf gain is
only visible with other PR also merged

future gains are possible if we udpate the stemmer to let us
know if changes were made, thus skipping re-encoding to
[]byte when no changes were actually made
2016-09-11 20:13:15 -04:00
Marty Schoch
c13626be45 encode runes directly into buffer
avoid allocating unnecessary intermediate buffer

also introduce new method to let a user optimistically
try and encode back into an existing buffer, if it isn't
large enough, it silently allocates a new one and returns it
2016-09-11 20:10:03 -04:00
Marty Schoch
56c7b9f831 Merge pull request #423 from mschoch/stopfilterfaster
avoid allocation in stop token filter
2016-09-11 13:59:31 -04:00
Marty Schoch
5ed9f67b0b Merge pull request #424 from mschoch/possessivefaster
speed up english possessive filter
2016-09-11 13:26:50 -04:00
Marty Schoch
9e9f172f81 speed up english possessive filter
previous impl always did full utf8 decode of rune
if we assume most tokens are not possessive this is unnecessary
and even if they are, we only need to chop off last to runes
so, now we only decode last rune of token, and if it looks like
s/S then we proceed to decode second to last rune, and then
only if it looks like any form of apostrophe, do we make any
changes to token, again by just reslicing original to chop
off the possessive extension
2016-09-11 12:55:03 -04:00
Marty Schoch
faa07ac3a6 avoid allocation in stop token filter
the token stream resulting from the removal of stop words must
be shorter or the same length as the original, so we just
reuse it and truncate it at the end.
2016-09-11 12:29:33 -04:00
Steve Yen
e8cc3c6bdd index/store/moss KV backend propagates mossStore's Stats()
This change depends on the recently introduced mossStore Stats() API
in github.com/couchbase/moss 564bdbc0 commit.  So, gvt for moss has
been updated as part of this change.

Most of the change involves propagating the mossStore instance (the
statsFunc callback) so that it's accessible to the KVStore.Stats()
method.

See also: http://review.couchbase.org/#/c/67524/
2016-09-08 17:12:04 -07:00
Danny Tylman
6c52907f2b fixes #416:
panic in collector_heap
2016-09-08 11:40:53 +03:00
Marty Schoch
b961d742c1 Merge branch 'bcampbell-sedtweak' 2016-09-01 13:56:11 -04:00
Marty Schoch
67755618e9 Merge branch 'sedtweak' of https://github.com/bcampbell/bleve into bcampbell-sedtweak 2016-09-01 13:55:15 -04:00
Marty Schoch
5023993895 replaced nex lexer with custom lexer
this improvement was started to improve code coverage
but also improves performance and adds support for escaping

escaping:

The following quoted string enumerates the characters which
may be escaped.

"+-=&|><!(){}[]^\"~*?:\\/ "

Note that this list includes space.

In order to escape these characters, they are prefixed with the \
(backslash) character.  In all cases, using the escaped version
produces the character itself and is not interpretted by the
lexer.

Two simple examples:

my\ name

Will be interpretted as a single argument to a match query
with the value "my name".

"contains a\" character"

Will be interpretted as a single argument to a phrase query
with the value `contains a " character`.

Performance:

before$ go test -v -run=xxx -bench=BenchmarkLexer
BenchmarkLexer-4   	  100000	     13991 ns/op
PASS
ok  	github.com/blevesearch/bleve	1.570s

after$ go test -v -run=xxx -bench=BenchmarkLexer
BenchmarkLexer-4   	  500000	      3387 ns/op
PASS
ok  	github.com/blevesearch/bleve	1.740s
2016-09-01 13:16:07 -04:00
Marty Schoch
46f70bfa12 streamline boost just like tilde 2016-08-31 22:10:44 -04:00
Marty Schoch
37d3750157 simplify parser rules 2016-08-31 21:57:44 -04:00
Marty Schoch
bb285cd0f2 more lexer/parser simplification 2016-08-31 21:53:49 -04:00
Marty Schoch
6c75b7c646 tightening up lexer/parser to prep future work 2016-08-31 21:23:03 -04:00
Marty Schoch
c5465eccb1 change from const to var so apps can adjust value 2016-08-31 16:43:50 -04:00
Marty Schoch
521003d543 Merge pull request #415 from mschoch/fixbug408
cap preallocation by the collector to reasonable value
2016-08-31 16:05:30 -04:00
Marty Schoch
60efecc8e9 cap preallocation by the collector to reasonable value
the collector has optimizations to avoid allocation and reslicing
during the common case of searching for top hits

however, in some cases users request an a very large number of
search hits to be returned (attempting to get them all)  this
caused unnecessary allocation of ram.

to address this we introduce a new constant PreAllocSizeSkipCap
it defaults the value of 1000.  if your search+skip is less than
this constant, you get the optimized behavior.  if your
search+skip is greater than this, we cap the preallcations to
this lower value.  additional space is acquired on an as needed
basis by growing the DocumentMatchPool and reslicing the
collector backing slice

applications can change the value of PreAllocSizeSkipCap to suit
their own needs

fixes #408
2016-08-31 15:25:17 -04:00
Marty Schoch
a771e344ae dont count code coverage support tool in project coverage 2016-08-31 13:52:19 -04:00
Marty Schoch
81282b3c06 remove unused code 2016-08-31 13:52:02 -04:00
Marty Schoch
83a3eecb22 don't count kv store test against code coverage 2016-08-31 13:27:12 -04:00
Marty Schoch
ae4b354c72 Merge pull request #411 from steveyen/master
tighter moss KV store iterator handling
2016-08-27 08:00:45 -04:00
Marty Schoch
56d7bbfe1c fix benchmark names to match values used 2016-08-26 18:09:03 -04:00
Marty Schoch
4a25034ddd Merge branch 'sort-by-field-try2' 2016-08-26 17:58:38 -04:00
Marty Schoch
b1b93d5ff9 remove unneeded code
fixes code review comment from @steveyen
2016-08-26 17:27:19 -04:00
Marty Schoch
c9310b906d introduced new collector store impl based on slice
counter-intuitively the list impl was faster than the heap
the theory was the heap did more comparisons and swapping
so even though it benefited from no interface and some cache
locality, it was still slower

the idea was to just use a raw slice kept in order
this avoids the need for interface, but can take same comparison
approach as the list

it seems to work out:

 go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out
BenchmarkTop10of100000Scores-4     	    5000	    299959 ns/op	    2600 B/op	      36 allocs/op
BenchmarkTop100of100000Scores-4    	    2000	    601104 ns/op	   20720 B/op	     216 allocs/op
BenchmarkTop10of1000000Scores-4    	     500	   3450196 ns/op	    2616 B/op	      36 allocs/op
BenchmarkTop100of1000000Scores-4   	     500	   3874276 ns/op	   20856 B/op	     216 allocs/op
PASS
ok  	github.com/blevesearch/bleve/search/collectors	7.440s
2016-08-26 11:52:49 -04:00
Marty Schoch
47c239ca7b refactored data structure out of collector
the TopNCollector now can either use a heap or a list

i did not code it to use an interface, because this is a very hot
loop during searching.  rather, it lets bleve developers easily
toggle between the two (or other ideas) by changing 2 lines

The list is faster in the benchmark, but causes more allocations.
The list is once again the default (for now).

To switch to the heap implementation, change:

store *collectStoreList
to
store *collectStoreHeap

and

newStoreList(...
to
newStoreHeap(...
2016-08-26 10:29:50 -04:00
Marty Schoch
3f8757c05b slight fixup to last change to set the sort value
i'd like the sort value to be correct even with the optimizations
not using it
2016-08-25 23:13:22 -04:00
Marty Schoch
931ec677c4 completely avoid dynamic dispatch if only sorting on score 2016-08-25 22:59:08 -04:00
Marty Schoch
127f37212b cache values to avoid dynamic dispatch inside hot loop 2016-08-25 16:24:26 -04:00
Marty Schoch
60750c1614 improved implementation to address perf regressions
primary change is going back to sort values be []string
and not []interface{}, this avoid allocatiosn converting
into the interface{}

that sounds obvious, so why didn't we just do that first?
because a common (default) sort is score, which is naturally
a number, not a string (like terms).  converting into the
number was also expensive, and the common case.

so, this solution also makes the change to NOT put the score
into the sort value list.  instead you see the dummy value
"_score".  this is just a placeholder, the actual sort impl
knows that field of the sort is the score, and will sort
using the actual score.

also, several other aspets of the benchmark were cleaned up
so that unnecessary allocations do not pollute the cpu profiles

Here are the updated benchmarks:

$ go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out
BenchmarkTop10of100000Scores-4     	    3000	    465809 ns/op	    2548 B/op	      33 allocs/op
BenchmarkTop100of100000Scores-4    	    2000	    626488 ns/op	   21484 B/op	     213 allocs/op
BenchmarkTop10of1000000Scores-4    	     300	   5107658 ns/op	    2560 B/op	      33 allocs/op
BenchmarkTop100of1000000Scores-4   	     300	   5275403 ns/op	   21624 B/op	     213 allocs/op
PASS
ok  	github.com/blevesearch/bleve/search/collectors	7.188s

Prior to this PR, master reported:

$ go test -run=xxx -bench=. -benchmem
BenchmarkTop10of100000Scores-4          3000        453269 ns/op      360161 B/op         42 allocs/op
BenchmarkTop100of100000Scores-4         2000        519131 ns/op      388275 B/op        219 allocs/op
BenchmarkTop10of1000000Scores-4          200       7459004 ns/op     4628236 B/op         52 allocs/op
BenchmarkTop100of1000000Scores-4         200       8064864 ns/op     4656596 B/op        232 allocs/op
PASS
ok      github.com/blevesearch/bleve/search/collectors  7.385s

So, we're pretty close on the smaller datasets, and we scale better on the larger datasets.
We also show fewer allocations and bytes in all cases (some of this is artificial due to test cleanup).
2016-08-25 15:47:07 -04:00
Marty Schoch
ce0b299d6f switch sort impl to use interface
this improves perf in the case where we're not doing any sorting
as we avoid allocating memory and converting scores into
numeric terms
2016-08-24 19:02:22 -04:00
Marty Schoch
5e94145cf4 apply same colletor benchmark change 2016-08-24 15:56:26 -04:00
Marty Schoch
94489fa778 change collector benchmark to not reuse collector instances
they are never reused in practice
and the original design did not consider reuse
future alternate implementations are not reusable
2016-08-24 15:14:40 -04:00
Marty Schoch
0322ecd441 adjust new sort functionality to also work with MultiSearch 2016-08-24 14:07:10 -04:00
Marty Schoch
1ae938b781 add integration tests for sorting 2016-08-20 14:45:53 -04:00
Steve Yen
eaa59621ff tighter moss KV store iterator handling 2016-08-19 09:10:03 -07:00
Marty Schoch
2311d060d1 add example usage of SortBy and SortByCustom 2016-08-18 13:03:48 -07:00
Marty Schoch
27f5c6ec92 expose simple string slice based sorting to top-level bleve
this change means simple sort requirements no longer require
importing the search package (high-level API goal)

also the sort test at the top-level was changed to use this form
2016-08-17 14:49:06 -07:00
Marty Schoch
27ba6187bc adds support for more complex field sorts with object (not string)
previously from JSON we would just deserialize strings like
"-abv" or "city" or "_id" or "_score" as simple sorts
on fields, ids or scores respectively

while this is simple and compact, it can be ambiguous (for
example if you have a field starting with - or if you have a field
named "_id" already.  also, this simple syntax doesnt allow us
to specify more cmoplex options to deal with type/mode/missing

we keep support for the simple string syntax, but now also
recognize a more expressive syntax like:

{
  "by": "field",
  "field": "abv",
  "desc": true,
  "type": "string",
  "mode": "min",
  "missing": "first"
}

type, mode and missing are optional and default to
"auto", "default", and "last" respectively
2016-08-17 14:33:51 -07:00
Marty Schoch
750e0ac16c change sort field impl to use indexed values not stored values 2016-08-17 09:20:44 -07:00
Marty Schoch
0d873916f0 support JSON marshal/unmarshal of search request sort
The syntax used is an array of strings.  The strings "_id" and
"_score" are special and reserved to mean sorting on the document
id and score repsectively.  All other strings refer to the literal
field name with that value.  If the string is prefixed with "-"
the order of that sort is descending, without it, it defaults to
ascending.

Examples:

"sort":["-abv","-_score"]

This will sort results in decreasing order of the "abv" field.
Results which have the same value of the "abv" field will then
be sorted by their score, also decreasing.

If no value for "sort" is provided in the search request the
default soring is the same as before, which is decreasing score.
2016-08-12 19:16:24 -04:00
Marty Schoch
be56380833 fix SearchRequest parsing to default to proper default sort order 2016-08-12 14:49:22 -04:00
Marty Schoch
0bb69a9a1c Merge branch 'master' of https://github.com/dtylman/bleve into sort-by-field-try2 2016-08-12 14:23:55 -04:00
Danny Tylman
b585c5786b removing mock data generation packages from unit-tests
fixing wrong sort order on certain fields
2016-08-11 11:35:08 +03:00
Marty Schoch
5f1454106d Merge pull request #402 from mschoch/indexapiwork
Index/Search API work
2016-08-10 12:41:51 -04:00