bleve

Author	SHA1	Message	Date
Marty Schoch	5ed9f67b0b	Merge pull request #424 from mschoch/possessivefaster speed up english possessive filter	2016-09-11 13:26:50 -04:00
Marty Schoch	9e9f172f81	speed up english possessive filter previous impl always did full utf8 decode of rune if we assume most tokens are not possessive this is unnecessary and even if they are, we only need to chop off last to runes so, now we only decode last rune of token, and if it looks like s/S then we proceed to decode second to last rune, and then only if it looks like any form of apostrophe, do we make any changes to token, again by just reslicing original to chop off the possessive extension	2016-09-11 12:55:03 -04:00
Marty Schoch	b961d742c1	Merge branch 'bcampbell-sedtweak'	2016-09-01 13:56:11 -04:00
Marty Schoch	67755618e9	Merge branch 'sedtweak' of https://github.com/bcampbell/bleve into bcampbell-sedtweak	2016-09-01 13:55:15 -04:00
Marty Schoch	5023993895	replaced nex lexer with custom lexer this improvement was started to improve code coverage but also improves performance and adds support for escaping escaping: The following quoted string enumerates the characters which may be escaped. "+-=&\|><!(){}[]^\"~*?:\\/ " Note that this list includes space. In order to escape these characters, they are prefixed with the \ (backslash) character. In all cases, using the escaped version produces the character itself and is not interpretted by the lexer. Two simple examples: my\ name Will be interpretted as a single argument to a match query with the value "my name". "contains a\" character" Will be interpretted as a single argument to a phrase query with the value `contains a " character`. Performance: before$ go test -v -run=xxx -bench=BenchmarkLexer BenchmarkLexer-4 100000 13991 ns/op PASS ok github.com/blevesearch/bleve 1.570s after$ go test -v -run=xxx -bench=BenchmarkLexer BenchmarkLexer-4 500000 3387 ns/op PASS ok github.com/blevesearch/bleve 1.740s	2016-09-01 13:16:07 -04:00
Marty Schoch	46f70bfa12	streamline boost just like tilde	2016-08-31 22:10:44 -04:00
Marty Schoch	37d3750157	simplify parser rules	2016-08-31 21:57:44 -04:00
Marty Schoch	bb285cd0f2	more lexer/parser simplification	2016-08-31 21:53:49 -04:00
Marty Schoch	6c75b7c646	tightening up lexer/parser to prep future work	2016-08-31 21:23:03 -04:00
Marty Schoch	c5465eccb1	change from const to var so apps can adjust value	2016-08-31 16:43:50 -04:00
Marty Schoch	521003d543	Merge pull request #415 from mschoch/fixbug408 cap preallocation by the collector to reasonable value	2016-08-31 16:05:30 -04:00
Marty Schoch	60efecc8e9	cap preallocation by the collector to reasonable value the collector has optimizations to avoid allocation and reslicing during the common case of searching for top hits however, in some cases users request an a very large number of search hits to be returned (attempting to get them all) this caused unnecessary allocation of ram. to address this we introduce a new constant PreAllocSizeSkipCap it defaults the value of 1000. if your search+skip is less than this constant, you get the optimized behavior. if your search+skip is greater than this, we cap the preallcations to this lower value. additional space is acquired on an as needed basis by growing the DocumentMatchPool and reslicing the collector backing slice applications can change the value of PreAllocSizeSkipCap to suit their own needs fixes #408	2016-08-31 15:25:17 -04:00
Marty Schoch	a771e344ae	dont count code coverage support tool in project coverage	2016-08-31 13:52:19 -04:00
Marty Schoch	81282b3c06	remove unused code	2016-08-31 13:52:02 -04:00
Marty Schoch	83a3eecb22	don't count kv store test against code coverage	2016-08-31 13:27:12 -04:00
Marty Schoch	ae4b354c72	Merge pull request #411 from steveyen/master tighter moss KV store iterator handling	2016-08-27 08:00:45 -04:00
Marty Schoch	56d7bbfe1c	fix benchmark names to match values used	2016-08-26 18:09:03 -04:00
Marty Schoch	4a25034ddd	Merge branch 'sort-by-field-try2'	2016-08-26 17:58:38 -04:00
Marty Schoch	b1b93d5ff9	remove unneeded code fixes code review comment from @steveyen	2016-08-26 17:27:19 -04:00
Marty Schoch	c9310b906d	introduced new collector store impl based on slice counter-intuitively the list impl was faster than the heap the theory was the heap did more comparisons and swapping so even though it benefited from no interface and some cache locality, it was still slower the idea was to just use a raw slice kept in order this avoids the need for interface, but can take same comparison approach as the list it seems to work out: go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out BenchmarkTop10of100000Scores-4 5000 299959 ns/op 2600 B/op 36 allocs/op BenchmarkTop100of100000Scores-4 2000 601104 ns/op 20720 B/op 216 allocs/op BenchmarkTop10of1000000Scores-4 500 3450196 ns/op 2616 B/op 36 allocs/op BenchmarkTop100of1000000Scores-4 500 3874276 ns/op 20856 B/op 216 allocs/op PASS ok github.com/blevesearch/bleve/search/collectors 7.440s	2016-08-26 11:52:49 -04:00
Marty Schoch	47c239ca7b	refactored data structure out of collector the TopNCollector now can either use a heap or a list i did not code it to use an interface, because this is a very hot loop during searching. rather, it lets bleve developers easily toggle between the two (or other ideas) by changing 2 lines The list is faster in the benchmark, but causes more allocations. The list is once again the default (for now). To switch to the heap implementation, change: store collectStoreList to store collectStoreHeap and newStoreList(... to newStoreHeap(...	2016-08-26 10:29:50 -04:00
Marty Schoch	3f8757c05b	slight fixup to last change to set the sort value i'd like the sort value to be correct even with the optimizations not using it	2016-08-25 23:13:22 -04:00
Marty Schoch	931ec677c4	completely avoid dynamic dispatch if only sorting on score	2016-08-25 22:59:08 -04:00
Marty Schoch	127f37212b	cache values to avoid dynamic dispatch inside hot loop	2016-08-25 16:24:26 -04:00
Marty Schoch	60750c1614	improved implementation to address perf regressions primary change is going back to sort values be []string and not []interface{}, this avoid allocatiosn converting into the interface{} that sounds obvious, so why didn't we just do that first? because a common (default) sort is score, which is naturally a number, not a string (like terms). converting into the number was also expensive, and the common case. so, this solution also makes the change to NOT put the score into the sort value list. instead you see the dummy value "_score". this is just a placeholder, the actual sort impl knows that field of the sort is the score, and will sort using the actual score. also, several other aspets of the benchmark were cleaned up so that unnecessary allocations do not pollute the cpu profiles Here are the updated benchmarks: $ go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out BenchmarkTop10of100000Scores-4 3000 465809 ns/op 2548 B/op 33 allocs/op BenchmarkTop100of100000Scores-4 2000 626488 ns/op 21484 B/op 213 allocs/op BenchmarkTop10of1000000Scores-4 300 5107658 ns/op 2560 B/op 33 allocs/op BenchmarkTop100of1000000Scores-4 300 5275403 ns/op 21624 B/op 213 allocs/op PASS ok github.com/blevesearch/bleve/search/collectors 7.188s Prior to this PR, master reported: $ go test -run=xxx -bench=. -benchmem BenchmarkTop10of100000Scores-4 3000 453269 ns/op 360161 B/op 42 allocs/op BenchmarkTop100of100000Scores-4 2000 519131 ns/op 388275 B/op 219 allocs/op BenchmarkTop10of1000000Scores-4 200 7459004 ns/op 4628236 B/op 52 allocs/op BenchmarkTop100of1000000Scores-4 200 8064864 ns/op 4656596 B/op 232 allocs/op PASS ok github.com/blevesearch/bleve/search/collectors 7.385s So, we're pretty close on the smaller datasets, and we scale better on the larger datasets. We also show fewer allocations and bytes in all cases (some of this is artificial due to test cleanup).	2016-08-25 15:47:07 -04:00
Marty Schoch	ce0b299d6f	switch sort impl to use interface this improves perf in the case where we're not doing any sorting as we avoid allocating memory and converting scores into numeric terms	2016-08-24 19:02:22 -04:00
Marty Schoch	5e94145cf4	apply same colletor benchmark change	2016-08-24 15:56:26 -04:00
Marty Schoch	94489fa778	change collector benchmark to not reuse collector instances they are never reused in practice and the original design did not consider reuse future alternate implementations are not reusable	2016-08-24 15:14:40 -04:00
Marty Schoch	0322ecd441	adjust new sort functionality to also work with MultiSearch	2016-08-24 14:07:10 -04:00
Marty Schoch	1ae938b781	add integration tests for sorting	2016-08-20 14:45:53 -04:00
Steve Yen	eaa59621ff	tighter moss KV store iterator handling	2016-08-19 09:10:03 -07:00
Marty Schoch	2311d060d1	add example usage of SortBy and SortByCustom	2016-08-18 13:03:48 -07:00
Marty Schoch	27f5c6ec92	expose simple string slice based sorting to top-level bleve this change means simple sort requirements no longer require importing the search package (high-level API goal) also the sort test at the top-level was changed to use this form	2016-08-17 14:49:06 -07:00
Marty Schoch	27ba6187bc	adds support for more complex field sorts with object (not string) previously from JSON we would just deserialize strings like "-abv" or "city" or "_id" or "_score" as simple sorts on fields, ids or scores respectively while this is simple and compact, it can be ambiguous (for example if you have a field starting with - or if you have a field named "_id" already. also, this simple syntax doesnt allow us to specify more cmoplex options to deal with type/mode/missing we keep support for the simple string syntax, but now also recognize a more expressive syntax like: { "by": "field", "field": "abv", "desc": true, "type": "string", "mode": "min", "missing": "first" } type, mode and missing are optional and default to "auto", "default", and "last" respectively	2016-08-17 14:33:51 -07:00
Marty Schoch	750e0ac16c	change sort field impl to use indexed values not stored values	2016-08-17 09:20:44 -07:00
Marty Schoch	0d873916f0	support JSON marshal/unmarshal of search request sort The syntax used is an array of strings. The strings "_id" and "_score" are special and reserved to mean sorting on the document id and score repsectively. All other strings refer to the literal field name with that value. If the string is prefixed with "-" the order of that sort is descending, without it, it defaults to ascending. Examples: "sort":["-abv","-_score"] This will sort results in decreasing order of the "abv" field. Results which have the same value of the "abv" field will then be sorted by their score, also decreasing. If no value for "sort" is provided in the search request the default soring is the same as before, which is decreasing score.	2016-08-12 19:16:24 -04:00
Marty Schoch	be56380833	fix SearchRequest parsing to default to proper default sort order	2016-08-12 14:49:22 -04:00
Marty Schoch	0bb69a9a1c	Merge branch 'master' of https://github.com/dtylman/bleve into sort-by-field-try2	2016-08-12 14:23:55 -04:00
Danny Tylman	b585c5786b	removing mock data generation packages from unit-tests fixing wrong sort order on certain fields	2016-08-11 11:35:08 +03:00
Marty Schoch	5f1454106d	Merge pull request #402 from mschoch/indexapiwork Index/Search API work	2016-08-10 12:41:51 -04:00
Danny Tylman	7d4b7fb67d	Adding sort to SearchRequest.	2016-08-10 16:16:58 +03:00
Danny Tylman	0d6a2b565f	closes #110	2016-08-10 11:44:31 +03:00
Danny Tylman	cb8a1e95a8	closes #110	2016-08-10 11:41:16 +03:00
Danny Tylman	176a6e7a0d	Adding sort to SearchRequest.	2016-08-10 11:35:47 +03:00
Danny Tylman	154d1b904b	Adding sort to SearchRequest.	2016-08-10 11:13:38 +03:00
Marty Schoch	9333bac2c8	added test case for DocumentMatchPool	2016-08-09 11:35:12 -04:00
Danny Tylman	5164e70f6e	Adding sort to SearchRequest.	2016-08-09 16:18:53 +03:00
Marty Schoch	24a2b57e29	refactor search package to reuse DocumentMatch and ID []byte's the motivation for this commit is long and detailed and has been documented externally here: https://gist.github.com/mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2 the core of the changes are: 1. recognize that collector/searcher need only a fixed number of DocumentMatch instances, and this number can be determined from the structure of the query, not the size of the data 2. knowing this, instances can be allocated in bulk, up front and they can be reused without locking (since all search operations take place in a single goroutine 3. combined with previous commits which enabled reuse of the IndexInternalID []byte, this allows for no allocation/copy of these bytes as well (by using DocumentMatch Reset() method when returning entries to the pool	2016-08-08 22:21:47 -04:00
Marty Schoch	aa3ae3d39c	enable read_only mode for boltdb indexes fixes #405	2016-08-06 10:47:34 -04:00
Marty Schoch	da794d3762	fix bug introduced by reuse of TermFrequencyRow values in a recent commit, we changed the code to reuse TermFrequencyRow objects intsead of constantly allocating new ones. unfortunately, one of the original methods was not coded with this reuse in mind, and a lazy initialization cause us to leak data from previous uses of the same object. in particular this caused term vector information from previous hits to still be applied to subsequent hits. eventually this causes the highlighter to try and highlight invalid regions of a slice. fixes #404	2016-08-05 08:33:04 -04:00

1 2 3 4 5 ...

1045 Commits