0
0
Commit Graph

36 Commits

Author SHA1 Message Date
Marty Schoch
56d7bbfe1c fix benchmark names to match values used 2016-08-26 18:09:03 -04:00
Marty Schoch
b1b93d5ff9 remove unneeded code
fixes code review comment from @steveyen
2016-08-26 17:27:19 -04:00
Marty Schoch
c9310b906d introduced new collector store impl based on slice
counter-intuitively the list impl was faster than the heap
the theory was the heap did more comparisons and swapping
so even though it benefited from no interface and some cache
locality, it was still slower

the idea was to just use a raw slice kept in order
this avoids the need for interface, but can take same comparison
approach as the list

it seems to work out:

 go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out
BenchmarkTop10of100000Scores-4     	    5000	    299959 ns/op	    2600 B/op	      36 allocs/op
BenchmarkTop100of100000Scores-4    	    2000	    601104 ns/op	   20720 B/op	     216 allocs/op
BenchmarkTop10of1000000Scores-4    	     500	   3450196 ns/op	    2616 B/op	      36 allocs/op
BenchmarkTop100of1000000Scores-4   	     500	   3874276 ns/op	   20856 B/op	     216 allocs/op
PASS
ok  	github.com/blevesearch/bleve/search/collectors	7.440s
2016-08-26 11:52:49 -04:00
Marty Schoch
47c239ca7b refactored data structure out of collector
the TopNCollector now can either use a heap or a list

i did not code it to use an interface, because this is a very hot
loop during searching.  rather, it lets bleve developers easily
toggle between the two (or other ideas) by changing 2 lines

The list is faster in the benchmark, but causes more allocations.
The list is once again the default (for now).

To switch to the heap implementation, change:

store *collectStoreList
to
store *collectStoreHeap

and

newStoreList(...
to
newStoreHeap(...
2016-08-26 10:29:50 -04:00
Marty Schoch
3f8757c05b slight fixup to last change to set the sort value
i'd like the sort value to be correct even with the optimizations
not using it
2016-08-25 23:13:22 -04:00
Marty Schoch
931ec677c4 completely avoid dynamic dispatch if only sorting on score 2016-08-25 22:59:08 -04:00
Marty Schoch
127f37212b cache values to avoid dynamic dispatch inside hot loop 2016-08-25 16:24:26 -04:00
Marty Schoch
60750c1614 improved implementation to address perf regressions
primary change is going back to sort values be []string
and not []interface{}, this avoid allocatiosn converting
into the interface{}

that sounds obvious, so why didn't we just do that first?
because a common (default) sort is score, which is naturally
a number, not a string (like terms).  converting into the
number was also expensive, and the common case.

so, this solution also makes the change to NOT put the score
into the sort value list.  instead you see the dummy value
"_score".  this is just a placeholder, the actual sort impl
knows that field of the sort is the score, and will sort
using the actual score.

also, several other aspets of the benchmark were cleaned up
so that unnecessary allocations do not pollute the cpu profiles

Here are the updated benchmarks:

$ go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out
BenchmarkTop10of100000Scores-4     	    3000	    465809 ns/op	    2548 B/op	      33 allocs/op
BenchmarkTop100of100000Scores-4    	    2000	    626488 ns/op	   21484 B/op	     213 allocs/op
BenchmarkTop10of1000000Scores-4    	     300	   5107658 ns/op	    2560 B/op	      33 allocs/op
BenchmarkTop100of1000000Scores-4   	     300	   5275403 ns/op	   21624 B/op	     213 allocs/op
PASS
ok  	github.com/blevesearch/bleve/search/collectors	7.188s

Prior to this PR, master reported:

$ go test -run=xxx -bench=. -benchmem
BenchmarkTop10of100000Scores-4          3000        453269 ns/op      360161 B/op         42 allocs/op
BenchmarkTop100of100000Scores-4         2000        519131 ns/op      388275 B/op        219 allocs/op
BenchmarkTop10of1000000Scores-4          200       7459004 ns/op     4628236 B/op         52 allocs/op
BenchmarkTop100of1000000Scores-4         200       8064864 ns/op     4656596 B/op        232 allocs/op
PASS
ok      github.com/blevesearch/bleve/search/collectors  7.385s

So, we're pretty close on the smaller datasets, and we scale better on the larger datasets.
We also show fewer allocations and bytes in all cases (some of this is artificial due to test cleanup).
2016-08-25 15:47:07 -04:00
Marty Schoch
ce0b299d6f switch sort impl to use interface
this improves perf in the case where we're not doing any sorting
as we avoid allocating memory and converting scores into
numeric terms
2016-08-24 19:02:22 -04:00
Marty Schoch
5e94145cf4 apply same colletor benchmark change 2016-08-24 15:56:26 -04:00
Marty Schoch
0322ecd441 adjust new sort functionality to also work with MultiSearch 2016-08-24 14:07:10 -04:00
Marty Schoch
750e0ac16c change sort field impl to use indexed values not stored values 2016-08-17 09:20:44 -07:00
Marty Schoch
0bb69a9a1c Merge branch 'master' of https://github.com/dtylman/bleve into sort-by-field-try2 2016-08-12 14:23:55 -04:00
Danny Tylman
b585c5786b removing mock data generation packages from unit-tests
fixing wrong sort order on certain fields
2016-08-11 11:35:08 +03:00
Danny Tylman
7d4b7fb67d Adding sort to SearchRequest. 2016-08-10 16:16:58 +03:00
Danny Tylman
cb8a1e95a8 closes #110 2016-08-10 11:41:16 +03:00
Danny Tylman
176a6e7a0d Adding sort to SearchRequest. 2016-08-10 11:35:47 +03:00
Danny Tylman
154d1b904b Adding sort to SearchRequest. 2016-08-10 11:13:38 +03:00
Danny Tylman
5164e70f6e Adding sort to SearchRequest. 2016-08-09 16:18:53 +03:00
Marty Schoch
24a2b57e29 refactor search package to reuse DocumentMatch and ID []byte's
the motivation for this commit is long and detailed and has been
documented externally here:

https://gist.github.com/mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2

the core of the changes are:

1.  recognize that collector/searcher need only a fixed number
of DocumentMatch instances, and this number can be determined
from the structure of the query, not the size of the data

2.  knowing this, instances can be allocated in bulk, up front
and they can be reused without locking (since all search
operations take place in a single goroutine

3.  combined with previous commits which enabled reuse of
the IndexInternalID []byte, this allows for no allocation/copy
of these bytes as well (by using DocumentMatch Reset() method
when returning entries to the pool
2016-08-08 22:21:47 -04:00
Marty Schoch
e188fe35f7 switch back to single DocumentMatch struct
instead of separate DocumentMatch/DocumentMatchInternal

rules are simple, everything operates on the IndexInternalID field
until the results are returned, then ID is set correctly
the IndexInternalID field is not exported to JSON
2016-08-01 14:58:02 -04:00
Marty Schoch
1aacd9bad5 changed approach
IndexInternalID is now []byte
this is still opaque, and should still work for any future
index implementations as it is a least common denominator
choice, all implementations must internally represent the
id as []byte at some point for storage to disk
2016-08-01 14:26:50 -04:00
Marty Schoch
5aa9e95468 major refactor of index/search API
index id's are now opaque (until finally returned to top-level user)
 - the TermFieldDoc's returned by TermFieldReader no longer contain doc id
 - instead they return an opaque IndexInternalID
 - items returned are still in the "natural index order"
 - but that is no longer guaranteed to be "doc id order"
 - correct behavior requires that they all follow the same order
 - but not any particular order

 - new API FinalizeDocID which converts index internal ID's to public string ID

 - APIs used internally which previously took doc id now take IndexInternalID
     - that is DocumentFieldTerms() and DocumentFieldTermsForFields()
 - however, APIs that are used externally do not reflect this change
     - that is Document()

 - DocumentIDReader follows the same changes, but this is less obvious
     - behavior clarified, used to iterate doc ids, BUT NOT in doc id order
     - method STILL available to iterate doc ids in range
     - but again, you won't get them in any meaningful order
     - new method to iterate actual doc ids from list of possible ids
         - this was introduced to make the DocIDSearcher continue working

searchers now work with the new opaque index internal doc ids
 - they return new DocumentMatchInternal (which does not have string ID)
scorerers also work with these opaque index internal doc ids
 - they return DocumentMatchInternal (which does not have string ID)
collectors now also perform a final step of converting the final result
 - they STILL return traditional DocumentMatch (with string ID)
 - but they now also require an IndexReader (so that they can do the conversion)
2016-07-31 13:46:18 -04:00
Steve Yen
4822cff63a optimize Advance() with pre-allocated in-out param
This perf-related change helps the code and API reach more similarity
with the Next() methods, which now take a pre-allocate param.
2016-07-29 14:15:00 -07:00
Steve Yen
b8c8478783 optimize collector to check ctx.Done() only occasionally 2016-07-21 11:10:49 -07:00
Steve Yen
b564ebbfbe optimization comments on DocumentMatch instance reuse 2016-07-21 11:10:49 -07:00
Steve Yen
988ca62182 optimize upside_down reader Next() with doc match reuse
This optimization changes the search.Search.Next() interface API,
adding an optional, pre-allocated *DocumentMatch parameter.

When it's non-nil, the TermSearcher and TermQueryScorer will use that
pre-allocated *DocumentMatch, instead of allocating a brand new
DocumentMatch instance.
2016-07-21 11:10:49 -07:00
Marty Schoch
2043bb4bf8 fix pagination bug introduced by collector optimization
fixes #378

this bug was introduced by:
f2aba116c4

theory of operation for this collector (top N, skip K)

- collect the highest scoring N+K results
- if K > 0, skip K and return the next N

internal details

- the top N+K are kept in a list
- the list is ordered from lowest scoring (first) to highest scoring (last)
- as a hit comes in, we find where this new hit would fit into this list
- if this caused the list to get too big, trim off the head (lowest scoring hit)

theory of the optimization

- we were not tracking the lowest score in the list
- so if the score was lower than the lowest score, we would add/remove it
- by keeping track of the lowest score in the list, we can avoid these ops

problem with the optimization
- the optimization worked by returning early
- by returning early there was a subtle change to documents which had the same score
- the reason is that which docs end up in the top N+K changed by returning early
- why was that? docs are coming in, in order by key ascending
- when finding the correct position to insert a hit into the list, we checked <, not <= the score
- this has the subtle effect that docs with the same score end up in reverse order

for example consider the following in progress list:

doc ids [   c    a    b  ]
scores  [   1    5    9  ]

if we now see doc d with score 5, we get:

doc ids [   c    a    d    b  ]
scores  [   1    5    5    9  ]

While that appears in order (a, d) it is actually reverse order, because when we
produce the top N we start at the end.

theory of the fix

- previous pagination depended on later hits with the same score "bumping" earlier
hits with the same score off the bottom of the list
- however, if we change the logic to <= instead of <, now the list in the previous
example would look like:

doc ids [   c    d    a    b  ]
scores  [   1    5    5    9  ]

- this small change means that now earlier (lower id) will score higher, and
thus we no longer depend on later hits bumping things down, which means returning
early is a valid thing to do

NOTE: this does depend on the hits coming back in order by ID.  this is not
something strictly guaranteed, but it was the same assumption that allowed the
original behavior

This also has the side-effect that 2 hits with the same score come back in
ascending ID order, which is somehow more pleasing to me than reverse order.
2016-06-01 11:35:18 -04:00
slavikm
f2aba116c4 Make top score collector about 7 times faster 2016-04-29 09:46:47 -07:00
Marty Schoch
0b2380d9bf introduce ability for searches to timeout or be cancelled
our implementation uses: golang.org/x/net/context

New method SearchInContext() allows the user to run a search
in the provided context.  If that context is cancelled or
exceeds its deadline Bleve will attempt to stop and return
as soon as possible.  This is a *best effort* attempt at this
time and may *not* be in a timely manner.  If the caller must
return very near the timeout, the call should also be wrapped
in a goroutine.

The IndexAlias implementation is affected in a slightly more
complex way.  In order to return partial results when a timeout
occurs on some indexes, the timeout is strictly enforced, and
at the moment this does introduce an additional goroutine.

The Bleve implementation honoring the context is currently
very course-grained.  Specifically we check the Done() channel
between each DocumentMatch produced during the search.  In the
future we will propogate the context deeper into the internals
of Bleve, and this will allow finer-grained timeout behavior.
2016-03-02 17:30:21 -05:00
Marty Schoch
c8d974048a fix issues identified by errcheck
part of #169
2015-04-07 14:59:35 -04:00
Marty Schoch
300ec79c96 first pass at checking errors that were ignored
part of #169
2015-03-06 14:46:29 -05:00
Marty Schoch
7284c10020 added benchmark to collector 2015-03-06 12:59:44 -05:00
Marty Schoch
e1b77956d4 more golint cleanups 2014-09-03 18:47:02 -04:00
Marty Schoch
7a7eb2e94c add newline between license and package
this avoids cluttering godocs with the license
2014-09-02 10:54:50 -04:00
Marty Schoch
2ee7289bc8 major refactor of search package
this started initially to relocate highlighting into
a self contained package, which would then also use
the registry
however, it turned into a much larger refactor in
order to avoid cyclic imports
now facets, searchers, scorers and collectors
are also broken out into subpackages of search
2014-09-01 11:15:38 -04:00