bleve

Author	SHA1	Message	Date
Marty Schoch	da9339bcdf	refactor FinalizeID into ExternalID and InternalID	2016-09-11 20:29:14 -04:00
Danny Tylman	6c52907f2b	fixes #416 : panic in collector_heap	2016-09-08 11:40:53 +03:00
Marty Schoch	c5465eccb1	change from const to var so apps can adjust value	2016-08-31 16:43:50 -04:00
Marty Schoch	60efecc8e9	cap preallocation by the collector to reasonable value the collector has optimizations to avoid allocation and reslicing during the common case of searching for top hits however, in some cases users request an a very large number of search hits to be returned (attempting to get them all) this caused unnecessary allocation of ram. to address this we introduce a new constant PreAllocSizeSkipCap it defaults the value of 1000. if your search+skip is less than this constant, you get the optimized behavior. if your search+skip is greater than this, we cap the preallcations to this lower value. additional space is acquired on an as needed basis by growing the DocumentMatchPool and reslicing the collector backing slice applications can change the value of PreAllocSizeSkipCap to suit their own needs fixes #408	2016-08-31 15:25:17 -04:00
Marty Schoch	56d7bbfe1c	fix benchmark names to match values used	2016-08-26 18:09:03 -04:00
Marty Schoch	b1b93d5ff9	remove unneeded code fixes code review comment from @steveyen	2016-08-26 17:27:19 -04:00
Marty Schoch	c9310b906d	introduced new collector store impl based on slice counter-intuitively the list impl was faster than the heap the theory was the heap did more comparisons and swapping so even though it benefited from no interface and some cache locality, it was still slower the idea was to just use a raw slice kept in order this avoids the need for interface, but can take same comparison approach as the list it seems to work out: go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out BenchmarkTop10of100000Scores-4 5000 299959 ns/op 2600 B/op 36 allocs/op BenchmarkTop100of100000Scores-4 2000 601104 ns/op 20720 B/op 216 allocs/op BenchmarkTop10of1000000Scores-4 500 3450196 ns/op 2616 B/op 36 allocs/op BenchmarkTop100of1000000Scores-4 500 3874276 ns/op 20856 B/op 216 allocs/op PASS ok github.com/blevesearch/bleve/search/collectors 7.440s	2016-08-26 11:52:49 -04:00
Marty Schoch	47c239ca7b	refactored data structure out of collector the TopNCollector now can either use a heap or a list i did not code it to use an interface, because this is a very hot loop during searching. rather, it lets bleve developers easily toggle between the two (or other ideas) by changing 2 lines The list is faster in the benchmark, but causes more allocations. The list is once again the default (for now). To switch to the heap implementation, change: store collectStoreList to store collectStoreHeap and newStoreList(... to newStoreHeap(...	2016-08-26 10:29:50 -04:00
Marty Schoch	3f8757c05b	slight fixup to last change to set the sort value i'd like the sort value to be correct even with the optimizations not using it	2016-08-25 23:13:22 -04:00
Marty Schoch	931ec677c4	completely avoid dynamic dispatch if only sorting on score	2016-08-25 22:59:08 -04:00
Marty Schoch	127f37212b	cache values to avoid dynamic dispatch inside hot loop	2016-08-25 16:24:26 -04:00
Marty Schoch	60750c1614	improved implementation to address perf regressions primary change is going back to sort values be []string and not []interface{}, this avoid allocatiosn converting into the interface{} that sounds obvious, so why didn't we just do that first? because a common (default) sort is score, which is naturally a number, not a string (like terms). converting into the number was also expensive, and the common case. so, this solution also makes the change to NOT put the score into the sort value list. instead you see the dummy value "_score". this is just a placeholder, the actual sort impl knows that field of the sort is the score, and will sort using the actual score. also, several other aspets of the benchmark were cleaned up so that unnecessary allocations do not pollute the cpu profiles Here are the updated benchmarks: $ go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out BenchmarkTop10of100000Scores-4 3000 465809 ns/op 2548 B/op 33 allocs/op BenchmarkTop100of100000Scores-4 2000 626488 ns/op 21484 B/op 213 allocs/op BenchmarkTop10of1000000Scores-4 300 5107658 ns/op 2560 B/op 33 allocs/op BenchmarkTop100of1000000Scores-4 300 5275403 ns/op 21624 B/op 213 allocs/op PASS ok github.com/blevesearch/bleve/search/collectors 7.188s Prior to this PR, master reported: $ go test -run=xxx -bench=. -benchmem BenchmarkTop10of100000Scores-4 3000 453269 ns/op 360161 B/op 42 allocs/op BenchmarkTop100of100000Scores-4 2000 519131 ns/op 388275 B/op 219 allocs/op BenchmarkTop10of1000000Scores-4 200 7459004 ns/op 4628236 B/op 52 allocs/op BenchmarkTop100of1000000Scores-4 200 8064864 ns/op 4656596 B/op 232 allocs/op PASS ok github.com/blevesearch/bleve/search/collectors 7.385s So, we're pretty close on the smaller datasets, and we scale better on the larger datasets. We also show fewer allocations and bytes in all cases (some of this is artificial due to test cleanup).	2016-08-25 15:47:07 -04:00
Marty Schoch	ce0b299d6f	switch sort impl to use interface this improves perf in the case where we're not doing any sorting as we avoid allocating memory and converting scores into numeric terms	2016-08-24 19:02:22 -04:00
Marty Schoch	5e94145cf4	apply same colletor benchmark change	2016-08-24 15:56:26 -04:00
Marty Schoch	0322ecd441	adjust new sort functionality to also work with MultiSearch	2016-08-24 14:07:10 -04:00
Marty Schoch	750e0ac16c	change sort field impl to use indexed values not stored values	2016-08-17 09:20:44 -07:00
Marty Schoch	0bb69a9a1c	Merge branch 'master' of https://github.com/dtylman/bleve into sort-by-field-try2	2016-08-12 14:23:55 -04:00
Danny Tylman	b585c5786b	removing mock data generation packages from unit-tests fixing wrong sort order on certain fields	2016-08-11 11:35:08 +03:00
Danny Tylman	7d4b7fb67d	Adding sort to SearchRequest.	2016-08-10 16:16:58 +03:00
Danny Tylman	cb8a1e95a8	closes #110	2016-08-10 11:41:16 +03:00
Danny Tylman	176a6e7a0d	Adding sort to SearchRequest.	2016-08-10 11:35:47 +03:00
Danny Tylman	154d1b904b	Adding sort to SearchRequest.	2016-08-10 11:13:38 +03:00
Danny Tylman	5164e70f6e	Adding sort to SearchRequest.	2016-08-09 16:18:53 +03:00
Marty Schoch	24a2b57e29	refactor search package to reuse DocumentMatch and ID []byte's the motivation for this commit is long and detailed and has been documented externally here: https://gist.github.com/mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2 the core of the changes are: 1. recognize that collector/searcher need only a fixed number of DocumentMatch instances, and this number can be determined from the structure of the query, not the size of the data 2. knowing this, instances can be allocated in bulk, up front and they can be reused without locking (since all search operations take place in a single goroutine 3. combined with previous commits which enabled reuse of the IndexInternalID []byte, this allows for no allocation/copy of these bytes as well (by using DocumentMatch Reset() method when returning entries to the pool	2016-08-08 22:21:47 -04:00
Marty Schoch	e188fe35f7	switch back to single DocumentMatch struct instead of separate DocumentMatch/DocumentMatchInternal rules are simple, everything operates on the IndexInternalID field until the results are returned, then ID is set correctly the IndexInternalID field is not exported to JSON	2016-08-01 14:58:02 -04:00
Marty Schoch	1aacd9bad5	changed approach IndexInternalID is now []byte this is still opaque, and should still work for any future index implementations as it is a least common denominator choice, all implementations must internally represent the id as []byte at some point for storage to disk	2016-08-01 14:26:50 -04:00
Marty Schoch	5aa9e95468	major refactor of index/search API index id's are now opaque (until finally returned to top-level user) - the TermFieldDoc's returned by TermFieldReader no longer contain doc id - instead they return an opaque IndexInternalID - items returned are still in the "natural index order" - but that is no longer guaranteed to be "doc id order" - correct behavior requires that they all follow the same order - but not any particular order - new API FinalizeDocID which converts index internal ID's to public string ID - APIs used internally which previously took doc id now take IndexInternalID - that is DocumentFieldTerms() and DocumentFieldTermsForFields() - however, APIs that are used externally do not reflect this change - that is Document() - DocumentIDReader follows the same changes, but this is less obvious - behavior clarified, used to iterate doc ids, BUT NOT in doc id order - method STILL available to iterate doc ids in range - but again, you won't get them in any meaningful order - new method to iterate actual doc ids from list of possible ids - this was introduced to make the DocIDSearcher continue working searchers now work with the new opaque index internal doc ids - they return new DocumentMatchInternal (which does not have string ID) scorerers also work with these opaque index internal doc ids - they return DocumentMatchInternal (which does not have string ID) collectors now also perform a final step of converting the final result - they STILL return traditional DocumentMatch (with string ID) - but they now also require an IndexReader (so that they can do the conversion)	2016-07-31 13:46:18 -04:00
Steve Yen	4822cff63a	optimize Advance() with pre-allocated in-out param This perf-related change helps the code and API reach more similarity with the Next() methods, which now take a pre-allocate param.	2016-07-29 14:15:00 -07:00
Steve Yen	b8c8478783	optimize collector to check ctx.Done() only occasionally	2016-07-21 11:10:49 -07:00
Steve Yen	b564ebbfbe	optimization comments on DocumentMatch instance reuse	2016-07-21 11:10:49 -07:00
Steve Yen	988ca62182	optimize upside_down reader Next() with doc match reuse This optimization changes the search.Search.Next() interface API, adding an optional, pre-allocated DocumentMatch parameter. When it's non-nil, the TermSearcher and TermQueryScorer will use that pre-allocated DocumentMatch, instead of allocating a brand new DocumentMatch instance.	2016-07-21 11:10:49 -07:00
Marty Schoch	2043bb4bf8	fix pagination bug introduced by collector optimization fixes #378 this bug was introduced by: `f2aba116c4` theory of operation for this collector (top N, skip K) - collect the highest scoring N+K results - if K > 0, skip K and return the next N internal details - the top N+K are kept in a list - the list is ordered from lowest scoring (first) to highest scoring (last) - as a hit comes in, we find where this new hit would fit into this list - if this caused the list to get too big, trim off the head (lowest scoring hit) theory of the optimization - we were not tracking the lowest score in the list - so if the score was lower than the lowest score, we would add/remove it - by keeping track of the lowest score in the list, we can avoid these ops problem with the optimization - the optimization worked by returning early - by returning early there was a subtle change to documents which had the same score - the reason is that which docs end up in the top N+K changed by returning early - why was that? docs are coming in, in order by key ascending - when finding the correct position to insert a hit into the list, we checked <, not <= the score - this has the subtle effect that docs with the same score end up in reverse order for example consider the following in progress list: doc ids [ c a b ] scores [ 1 5 9 ] if we now see doc d with score 5, we get: doc ids [ c a d b ] scores [ 1 5 5 9 ] While that appears in order (a, d) it is actually reverse order, because when we produce the top N we start at the end. theory of the fix - previous pagination depended on later hits with the same score "bumping" earlier hits with the same score off the bottom of the list - however, if we change the logic to <= instead of <, now the list in the previous example would look like: doc ids [ c d a b ] scores [ 1 5 5 9 ] - this small change means that now earlier (lower id) will score higher, and thus we no longer depend on later hits bumping things down, which means returning early is a valid thing to do NOTE: this does depend on the hits coming back in order by ID. this is not something strictly guaranteed, but it was the same assumption that allowed the original behavior This also has the side-effect that 2 hits with the same score come back in ascending ID order, which is somehow more pleasing to me than reverse order.	2016-06-01 11:35:18 -04:00
slavikm	f2aba116c4	Make top score collector about 7 times faster	2016-04-29 09:46:47 -07:00
Marty Schoch	0b2380d9bf	introduce ability for searches to timeout or be cancelled our implementation uses: golang.org/x/net/context New method SearchInContext() allows the user to run a search in the provided context. If that context is cancelled or exceeds its deadline Bleve will attempt to stop and return as soon as possible. This is a best effort attempt at this time and may not be in a timely manner. If the caller must return very near the timeout, the call should also be wrapped in a goroutine. The IndexAlias implementation is affected in a slightly more complex way. In order to return partial results when a timeout occurs on some indexes, the timeout is strictly enforced, and at the moment this does introduce an additional goroutine. The Bleve implementation honoring the context is currently very course-grained. Specifically we check the Done() channel between each DocumentMatch produced during the search. In the future we will propogate the context deeper into the internals of Bleve, and this will allow finer-grained timeout behavior.	2016-03-02 17:30:21 -05:00
Marty Schoch	c8d974048a	fix issues identified by errcheck part of #169	2015-04-07 14:59:35 -04:00
Marty Schoch	300ec79c96	first pass at checking errors that were ignored part of #169	2015-03-06 14:46:29 -05:00
Marty Schoch	7284c10020	added benchmark to collector	2015-03-06 12:59:44 -05:00
Marty Schoch	e1b77956d4	more golint cleanups	2014-09-03 18:47:02 -04:00
Marty Schoch	7a7eb2e94c	add newline between license and package this avoids cluttering godocs with the license	2014-09-02 10:54:50 -04:00
Marty Schoch	2ee7289bc8	major refactor of search package this started initially to relocate highlighting into a self contained package, which would then also use the registry however, it turned into a much larger refactor in order to avoid cyclic imports now facets, searchers, scorers and collectors are also broken out into subpackages of search	2014-09-01 11:15:38 -04:00

40 Commits