bleve

Author	SHA1	Message	Date
Marty Schoch	47c239ca7b	refactored data structure out of collector the TopNCollector now can either use a heap or a list i did not code it to use an interface, because this is a very hot loop during searching. rather, it lets bleve developers easily toggle between the two (or other ideas) by changing 2 lines The list is faster in the benchmark, but causes more allocations. The list is once again the default (for now). To switch to the heap implementation, change: store collectStoreList to store collectStoreHeap and newStoreList(... to newStoreHeap(...	2016-08-26 10:29:50 -04:00
Marty Schoch	3f8757c05b	slight fixup to last change to set the sort value i'd like the sort value to be correct even with the optimizations not using it	2016-08-25 23:13:22 -04:00
Marty Schoch	931ec677c4	completely avoid dynamic dispatch if only sorting on score	2016-08-25 22:59:08 -04:00
Marty Schoch	127f37212b	cache values to avoid dynamic dispatch inside hot loop	2016-08-25 16:24:26 -04:00
Marty Schoch	60750c1614	improved implementation to address perf regressions primary change is going back to sort values be []string and not []interface{}, this avoid allocatiosn converting into the interface{} that sounds obvious, so why didn't we just do that first? because a common (default) sort is score, which is naturally a number, not a string (like terms). converting into the number was also expensive, and the common case. so, this solution also makes the change to NOT put the score into the sort value list. instead you see the dummy value "_score". this is just a placeholder, the actual sort impl knows that field of the sort is the score, and will sort using the actual score. also, several other aspets of the benchmark were cleaned up so that unnecessary allocations do not pollute the cpu profiles Here are the updated benchmarks: $ go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out BenchmarkTop10of100000Scores-4 3000 465809 ns/op 2548 B/op 33 allocs/op BenchmarkTop100of100000Scores-4 2000 626488 ns/op 21484 B/op 213 allocs/op BenchmarkTop10of1000000Scores-4 300 5107658 ns/op 2560 B/op 33 allocs/op BenchmarkTop100of1000000Scores-4 300 5275403 ns/op 21624 B/op 213 allocs/op PASS ok github.com/blevesearch/bleve/search/collectors 7.188s Prior to this PR, master reported: $ go test -run=xxx -bench=. -benchmem BenchmarkTop10of100000Scores-4 3000 453269 ns/op 360161 B/op 42 allocs/op BenchmarkTop100of100000Scores-4 2000 519131 ns/op 388275 B/op 219 allocs/op BenchmarkTop10of1000000Scores-4 200 7459004 ns/op 4628236 B/op 52 allocs/op BenchmarkTop100of1000000Scores-4 200 8064864 ns/op 4656596 B/op 232 allocs/op PASS ok github.com/blevesearch/bleve/search/collectors 7.385s So, we're pretty close on the smaller datasets, and we scale better on the larger datasets. We also show fewer allocations and bytes in all cases (some of this is artificial due to test cleanup).	2016-08-25 15:47:07 -04:00
Marty Schoch	ce0b299d6f	switch sort impl to use interface this improves perf in the case where we're not doing any sorting as we avoid allocating memory and converting scores into numeric terms	2016-08-24 19:02:22 -04:00
Marty Schoch	5e94145cf4	apply same colletor benchmark change	2016-08-24 15:56:26 -04:00
Marty Schoch	94489fa778	change collector benchmark to not reuse collector instances they are never reused in practice and the original design did not consider reuse future alternate implementations are not reusable	2016-08-24 15:14:40 -04:00
Marty Schoch	0322ecd441	adjust new sort functionality to also work with MultiSearch	2016-08-24 14:07:10 -04:00
Marty Schoch	1ae938b781	add integration tests for sorting	2016-08-20 14:45:53 -04:00
Steve Yen	eaa59621ff	tighter moss KV store iterator handling	2016-08-19 09:10:03 -07:00
Marty Schoch	2311d060d1	add example usage of SortBy and SortByCustom	2016-08-18 13:03:48 -07:00
Marty Schoch	27f5c6ec92	expose simple string slice based sorting to top-level bleve this change means simple sort requirements no longer require importing the search package (high-level API goal) also the sort test at the top-level was changed to use this form	2016-08-17 14:49:06 -07:00
Marty Schoch	27ba6187bc	adds support for more complex field sorts with object (not string) previously from JSON we would just deserialize strings like "-abv" or "city" or "_id" or "_score" as simple sorts on fields, ids or scores respectively while this is simple and compact, it can be ambiguous (for example if you have a field starting with - or if you have a field named "_id" already. also, this simple syntax doesnt allow us to specify more cmoplex options to deal with type/mode/missing we keep support for the simple string syntax, but now also recognize a more expressive syntax like: { "by": "field", "field": "abv", "desc": true, "type": "string", "mode": "min", "missing": "first" } type, mode and missing are optional and default to "auto", "default", and "last" respectively	2016-08-17 14:33:51 -07:00
Marty Schoch	750e0ac16c	change sort field impl to use indexed values not stored values	2016-08-17 09:20:44 -07:00
Marty Schoch	0d873916f0	support JSON marshal/unmarshal of search request sort The syntax used is an array of strings. The strings "_id" and "_score" are special and reserved to mean sorting on the document id and score repsectively. All other strings refer to the literal field name with that value. If the string is prefixed with "-" the order of that sort is descending, without it, it defaults to ascending. Examples: "sort":["-abv","-_score"] This will sort results in decreasing order of the "abv" field. Results which have the same value of the "abv" field will then be sorted by their score, also decreasing. If no value for "sort" is provided in the search request the default soring is the same as before, which is decreasing score.	2016-08-12 19:16:24 -04:00
Marty Schoch	be56380833	fix SearchRequest parsing to default to proper default sort order	2016-08-12 14:49:22 -04:00
Marty Schoch	0bb69a9a1c	Merge branch 'master' of https://github.com/dtylman/bleve into sort-by-field-try2	2016-08-12 14:23:55 -04:00
Danny Tylman	b585c5786b	removing mock data generation packages from unit-tests fixing wrong sort order on certain fields	2016-08-11 11:35:08 +03:00
Marty Schoch	5f1454106d	Merge pull request #402 from mschoch/indexapiwork Index/Search API work	2016-08-10 12:41:51 -04:00
Danny Tylman	7d4b7fb67d	Adding sort to SearchRequest.	2016-08-10 16:16:58 +03:00
Danny Tylman	0d6a2b565f	closes #110	2016-08-10 11:44:31 +03:00
Danny Tylman	cb8a1e95a8	closes #110	2016-08-10 11:41:16 +03:00
Danny Tylman	176a6e7a0d	Adding sort to SearchRequest.	2016-08-10 11:35:47 +03:00
Danny Tylman	154d1b904b	Adding sort to SearchRequest.	2016-08-10 11:13:38 +03:00
Marty Schoch	9333bac2c8	added test case for DocumentMatchPool	2016-08-09 11:35:12 -04:00
Danny Tylman	5164e70f6e	Adding sort to SearchRequest.	2016-08-09 16:18:53 +03:00
Marty Schoch	24a2b57e29	refactor search package to reuse DocumentMatch and ID []byte's the motivation for this commit is long and detailed and has been documented externally here: https://gist.github.com/mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2 the core of the changes are: 1. recognize that collector/searcher need only a fixed number of DocumentMatch instances, and this number can be determined from the structure of the query, not the size of the data 2. knowing this, instances can be allocated in bulk, up front and they can be reused without locking (since all search operations take place in a single goroutine 3. combined with previous commits which enabled reuse of the IndexInternalID []byte, this allows for no allocation/copy of these bytes as well (by using DocumentMatch Reset() method when returning entries to the pool	2016-08-08 22:21:47 -04:00
Marty Schoch	aa3ae3d39c	enable read_only mode for boltdb indexes fixes #405	2016-08-06 10:47:34 -04:00
Marty Schoch	da794d3762	fix bug introduced by reuse of TermFrequencyRow values in a recent commit, we changed the code to reuse TermFrequencyRow objects intsead of constantly allocating new ones. unfortunately, one of the original methods was not coded with this reuse in mind, and a lazy initialization cause us to leak data from previous uses of the same object. in particular this caused term vector information from previous hits to still be applied to subsequent hits. eventually this causes the highlighter to try and highlight invalid regions of a slice. fixes #404	2016-08-05 08:33:04 -04:00
Marty Schoch	b857769217	document Reset behavior as its non-obvious	2016-08-03 17:16:15 -04:00
Marty Schoch	d7405a4d79	updated attempt to reuse []byte previous attempt was flawed (but maked by Reset() method) new approach is to do this work in the Reset() method itself, logically this is where it belongs. but further we acknowledge that IndexInternalID []byte lifetime lives beyond the TermFieldDoc, so another copy is made into the DocumentMatch. Although this introduces yet another copy the theory being tested is that it allows each of these structuress to reuse memory without additional allocation.	2016-08-03 17:01:27 -04:00
Marty Schoch	89d83cb5a1	reuse memory already allocated for copies of docids when the term field reader is copying ID values out of the kv store's iterator, it is already attempting to reuse the term frequency row data structure. this change allows us to also attempt to reuse the []byte allocated for previous copies of the docid. we reset the slice length to zero then copy the data into the existing slice, avoiding new allocation and garbage collection in the cases where there is already enough space	2016-08-03 13:45:48 -04:00
Marty Schoch	4b1b866e0f	remove commented out old code	2016-08-02 16:48:00 -04:00
Marty Schoch	36de4a7097	cleaner fix for the TermFrequencyRow reuse bug reset to nil first, let remaining logic work as before	2016-08-01 17:17:29 -04:00
Marty Schoch	cfce9c5fc5	initialize term vector list in parseV otherwise reusing previous term frequency row causes us to keep tacking on to one gigantic list	2016-08-01 17:01:34 -04:00
Marty Schoch	172ca7e69e	need to copy the doc ID for it to survive past next iteration	2016-08-01 17:01:04 -04:00
Marty Schoch	e188fe35f7	switch back to single DocumentMatch struct instead of separate DocumentMatch/DocumentMatchInternal rules are simple, everything operates on the IndexInternalID field until the results are returned, then ID is set correctly the IndexInternalID field is not exported to JSON	2016-08-01 14:58:02 -04:00
Marty Schoch	1aacd9bad5	changed approach IndexInternalID is now []byte this is still opaque, and should still work for any future index implementations as it is a least common denominator choice, all implementations must internally represent the id as []byte at some point for storage to disk	2016-08-01 14:26:50 -04:00
Marty Schoch	5aa9e95468	major refactor of index/search API index id's are now opaque (until finally returned to top-level user) - the TermFieldDoc's returned by TermFieldReader no longer contain doc id - instead they return an opaque IndexInternalID - items returned are still in the "natural index order" - but that is no longer guaranteed to be "doc id order" - correct behavior requires that they all follow the same order - but not any particular order - new API FinalizeDocID which converts index internal ID's to public string ID - APIs used internally which previously took doc id now take IndexInternalID - that is DocumentFieldTerms() and DocumentFieldTermsForFields() - however, APIs that are used externally do not reflect this change - that is Document() - DocumentIDReader follows the same changes, but this is less obvious - behavior clarified, used to iterate doc ids, BUT NOT in doc id order - method STILL available to iterate doc ids in range - but again, you won't get them in any meaningful order - new method to iterate actual doc ids from list of possible ids - this was introduced to make the DocIDSearcher continue working searchers now work with the new opaque index internal doc ids - they return new DocumentMatchInternal (which does not have string ID) scorerers also work with these opaque index internal doc ids - they return DocumentMatchInternal (which does not have string ID) collectors now also perform a final step of converting the final result - they STILL return traditional DocumentMatch (with string ID) - but they now also require an IndexReader (so that they can do the conversion)	2016-07-31 13:46:18 -04:00
Marty Schoch	47ee69ae82	term field reader supports optionally omitting 3 details at the time you create the term field reader, you can specify that you don't need the term freq, the norm, or the term vectors in that case, the index implementation can choose to not return them in its subsequently returned values this is advisory only, some simple implementations may ignore this and continue to return the values anyway (as the current impl of upside_down does today) this change will allow future index implementations the opportunity to do less work when it isn't required	2016-07-30 10:26:42 -04:00
Marty Schoch	389e18a779	attempt to support google app engine the default configuration, which sets the default kv engine to boltdb is now done in file protected with the !appengine build tag. this at least lets the analysis-wizzard app run locally in the appengine simulator. this still has not been tested on the real appengine, and further changes may be required.	2016-07-29 21:29:05 -04:00
Marty Schoch	b158fb147d	Merge pull request #400 from steveyen/WIP-search-optimizations search optimizations	2016-07-29 17:30:35 -04:00
Steve Yen	4822cff63a	optimize Advance() with pre-allocated in-out param This perf-related change helps the code and API reach more similarity with the Next() methods, which now take a pre-allocate param.	2016-07-29 14:15:00 -07:00
Steve Yen	3c82086805	optimize upside_down reader & 64-bit struct alignments The UpsideDownCouchTermFieldReader.Next() only needs the doc ID from the key, so this change provides a specialized parseKDoc() method for that optimization. Additionally, fields in various structs are more 64-bit aligned, in an attempt to reduce the invocations of runtime.typedmemmove() and runtime.heapBitsBulkBarrier(), which the go compiler seems to automatically insert to transparently handle misaligned data.	2016-07-23 10:37:40 -07:00
Steve Yen	e33ae65cd2	optimize SqrtCache as just-an-array	2016-07-21 19:41:33 -07:00
Steve Yen	5094d2d097	optimize moss PrefixIterator Previously, the PrefixIterator() for moss was implemented by comparing the prefix bytes on every Next(). With this optimization, the next larger endKeyExclusive is computed at the iterator's initialization, which allows us to avoid all those prefix comparisons.	2016-07-21 18:33:34 -07:00
Steve Yen	5271a0f62b	optimize termFieldVectorsFromTermVectors when empty	2016-07-21 11:46:14 -07:00
Steve Yen	b8c8478783	optimize collector to check ctx.Done() only occasionally	2016-07-21 11:10:49 -07:00
Steve Yen	cbb174b074	optimize moss iterator Next() done/k/v maintenance	2016-07-21 11:10:49 -07:00

... 15 16 17 18 19 ...

1825 Commits