0
0
Commit Graph

1133 Commits

Author SHA1 Message Date
Marty Schoch
a771e344ae dont count code coverage support tool in project coverage 2016-08-31 13:52:19 -04:00
Marty Schoch
81282b3c06 remove unused code 2016-08-31 13:52:02 -04:00
Marty Schoch
83a3eecb22 don't count kv store test against code coverage 2016-08-31 13:27:12 -04:00
Marty Schoch
ae4b354c72 Merge pull request #411 from steveyen/master
tighter moss KV store iterator handling
2016-08-27 08:00:45 -04:00
Marty Schoch
56d7bbfe1c fix benchmark names to match values used 2016-08-26 18:09:03 -04:00
Marty Schoch
4a25034ddd Merge branch 'sort-by-field-try2' 2016-08-26 17:58:38 -04:00
Marty Schoch
b1b93d5ff9 remove unneeded code
fixes code review comment from @steveyen
2016-08-26 17:27:19 -04:00
Marty Schoch
c9310b906d introduced new collector store impl based on slice
counter-intuitively the list impl was faster than the heap
the theory was the heap did more comparisons and swapping
so even though it benefited from no interface and some cache
locality, it was still slower

the idea was to just use a raw slice kept in order
this avoids the need for interface, but can take same comparison
approach as the list

it seems to work out:

 go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out
BenchmarkTop10of100000Scores-4     	    5000	    299959 ns/op	    2600 B/op	      36 allocs/op
BenchmarkTop100of100000Scores-4    	    2000	    601104 ns/op	   20720 B/op	     216 allocs/op
BenchmarkTop10of1000000Scores-4    	     500	   3450196 ns/op	    2616 B/op	      36 allocs/op
BenchmarkTop100of1000000Scores-4   	     500	   3874276 ns/op	   20856 B/op	     216 allocs/op
PASS
ok  	github.com/blevesearch/bleve/search/collectors	7.440s
2016-08-26 11:52:49 -04:00
Marty Schoch
47c239ca7b refactored data structure out of collector
the TopNCollector now can either use a heap or a list

i did not code it to use an interface, because this is a very hot
loop during searching.  rather, it lets bleve developers easily
toggle between the two (or other ideas) by changing 2 lines

The list is faster in the benchmark, but causes more allocations.
The list is once again the default (for now).

To switch to the heap implementation, change:

store *collectStoreList
to
store *collectStoreHeap

and

newStoreList(...
to
newStoreHeap(...
2016-08-26 10:29:50 -04:00
Marty Schoch
3f8757c05b slight fixup to last change to set the sort value
i'd like the sort value to be correct even with the optimizations
not using it
2016-08-25 23:13:22 -04:00
Marty Schoch
931ec677c4 completely avoid dynamic dispatch if only sorting on score 2016-08-25 22:59:08 -04:00
Marty Schoch
127f37212b cache values to avoid dynamic dispatch inside hot loop 2016-08-25 16:24:26 -04:00
Marty Schoch
60750c1614 improved implementation to address perf regressions
primary change is going back to sort values be []string
and not []interface{}, this avoid allocatiosn converting
into the interface{}

that sounds obvious, so why didn't we just do that first?
because a common (default) sort is score, which is naturally
a number, not a string (like terms).  converting into the
number was also expensive, and the common case.

so, this solution also makes the change to NOT put the score
into the sort value list.  instead you see the dummy value
"_score".  this is just a placeholder, the actual sort impl
knows that field of the sort is the score, and will sort
using the actual score.

also, several other aspets of the benchmark were cleaned up
so that unnecessary allocations do not pollute the cpu profiles

Here are the updated benchmarks:

$ go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out
BenchmarkTop10of100000Scores-4     	    3000	    465809 ns/op	    2548 B/op	      33 allocs/op
BenchmarkTop100of100000Scores-4    	    2000	    626488 ns/op	   21484 B/op	     213 allocs/op
BenchmarkTop10of1000000Scores-4    	     300	   5107658 ns/op	    2560 B/op	      33 allocs/op
BenchmarkTop100of1000000Scores-4   	     300	   5275403 ns/op	   21624 B/op	     213 allocs/op
PASS
ok  	github.com/blevesearch/bleve/search/collectors	7.188s

Prior to this PR, master reported:

$ go test -run=xxx -bench=. -benchmem
BenchmarkTop10of100000Scores-4          3000        453269 ns/op      360161 B/op         42 allocs/op
BenchmarkTop100of100000Scores-4         2000        519131 ns/op      388275 B/op        219 allocs/op
BenchmarkTop10of1000000Scores-4          200       7459004 ns/op     4628236 B/op         52 allocs/op
BenchmarkTop100of1000000Scores-4         200       8064864 ns/op     4656596 B/op        232 allocs/op
PASS
ok      github.com/blevesearch/bleve/search/collectors  7.385s

So, we're pretty close on the smaller datasets, and we scale better on the larger datasets.
We also show fewer allocations and bytes in all cases (some of this is artificial due to test cleanup).
2016-08-25 15:47:07 -04:00
Marty Schoch
ce0b299d6f switch sort impl to use interface
this improves perf in the case where we're not doing any sorting
as we avoid allocating memory and converting scores into
numeric terms
2016-08-24 19:02:22 -04:00
Marty Schoch
5e94145cf4 apply same colletor benchmark change 2016-08-24 15:56:26 -04:00
Marty Schoch
94489fa778 change collector benchmark to not reuse collector instances
they are never reused in practice
and the original design did not consider reuse
future alternate implementations are not reusable
2016-08-24 15:14:40 -04:00
Marty Schoch
0322ecd441 adjust new sort functionality to also work with MultiSearch 2016-08-24 14:07:10 -04:00
Marty Schoch
1ae938b781 add integration tests for sorting 2016-08-20 14:45:53 -04:00
Steve Yen
eaa59621ff tighter moss KV store iterator handling 2016-08-19 09:10:03 -07:00
Marty Schoch
2311d060d1 add example usage of SortBy and SortByCustom 2016-08-18 13:03:48 -07:00
Marty Schoch
27f5c6ec92 expose simple string slice based sorting to top-level bleve
this change means simple sort requirements no longer require
importing the search package (high-level API goal)

also the sort test at the top-level was changed to use this form
2016-08-17 14:49:06 -07:00
Marty Schoch
27ba6187bc adds support for more complex field sorts with object (not string)
previously from JSON we would just deserialize strings like
"-abv" or "city" or "_id" or "_score" as simple sorts
on fields, ids or scores respectively

while this is simple and compact, it can be ambiguous (for
example if you have a field starting with - or if you have a field
named "_id" already.  also, this simple syntax doesnt allow us
to specify more cmoplex options to deal with type/mode/missing

we keep support for the simple string syntax, but now also
recognize a more expressive syntax like:

{
  "by": "field",
  "field": "abv",
  "desc": true,
  "type": "string",
  "mode": "min",
  "missing": "first"
}

type, mode and missing are optional and default to
"auto", "default", and "last" respectively
2016-08-17 14:33:51 -07:00
Marty Schoch
750e0ac16c change sort field impl to use indexed values not stored values 2016-08-17 09:20:44 -07:00
Marty Schoch
0d873916f0 support JSON marshal/unmarshal of search request sort
The syntax used is an array of strings.  The strings "_id" and
"_score" are special and reserved to mean sorting on the document
id and score repsectively.  All other strings refer to the literal
field name with that value.  If the string is prefixed with "-"
the order of that sort is descending, without it, it defaults to
ascending.

Examples:

"sort":["-abv","-_score"]

This will sort results in decreasing order of the "abv" field.
Results which have the same value of the "abv" field will then
be sorted by their score, also decreasing.

If no value for "sort" is provided in the search request the
default soring is the same as before, which is decreasing score.
2016-08-12 19:16:24 -04:00
Marty Schoch
be56380833 fix SearchRequest parsing to default to proper default sort order 2016-08-12 14:49:22 -04:00
Marty Schoch
0bb69a9a1c Merge branch 'master' of https://github.com/dtylman/bleve into sort-by-field-try2 2016-08-12 14:23:55 -04:00
Danny Tylman
b585c5786b removing mock data generation packages from unit-tests
fixing wrong sort order on certain fields
2016-08-11 11:35:08 +03:00
Marty Schoch
5f1454106d Merge pull request #402 from mschoch/indexapiwork
Index/Search API work
2016-08-10 12:41:51 -04:00
Danny Tylman
7d4b7fb67d Adding sort to SearchRequest. 2016-08-10 16:16:58 +03:00
Danny Tylman
0d6a2b565f closes #110 2016-08-10 11:44:31 +03:00
Danny Tylman
cb8a1e95a8 closes #110 2016-08-10 11:41:16 +03:00
Danny Tylman
176a6e7a0d Adding sort to SearchRequest. 2016-08-10 11:35:47 +03:00
Danny Tylman
154d1b904b Adding sort to SearchRequest. 2016-08-10 11:13:38 +03:00
Marty Schoch
9333bac2c8 added test case for DocumentMatchPool 2016-08-09 11:35:12 -04:00
Danny Tylman
5164e70f6e Adding sort to SearchRequest. 2016-08-09 16:18:53 +03:00
Marty Schoch
24a2b57e29 refactor search package to reuse DocumentMatch and ID []byte's
the motivation for this commit is long and detailed and has been
documented externally here:

https://gist.github.com/mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2

the core of the changes are:

1.  recognize that collector/searcher need only a fixed number
of DocumentMatch instances, and this number can be determined
from the structure of the query, not the size of the data

2.  knowing this, instances can be allocated in bulk, up front
and they can be reused without locking (since all search
operations take place in a single goroutine

3.  combined with previous commits which enabled reuse of
the IndexInternalID []byte, this allows for no allocation/copy
of these bytes as well (by using DocumentMatch Reset() method
when returning entries to the pool
2016-08-08 22:21:47 -04:00
Marty Schoch
aa3ae3d39c enable read_only mode for boltdb indexes
fixes #405
2016-08-06 10:47:34 -04:00
Marty Schoch
da794d3762 fix bug introduced by reuse of TermFrequencyRow values
in a recent commit, we changed the code to reuse
TermFrequencyRow objects intsead of constantly allocating new
ones.  unfortunately, one of the original methods was not coded
with this reuse in mind, and a lazy initialization cause us to
leak data from previous uses of the same object.

in particular this caused term vector information from previous
hits to still be applied to subsequent hits.  eventually this
causes the highlighter to try and highlight invalid regions
of a slice.

fixes #404
2016-08-05 08:33:04 -04:00
Marty Schoch
b857769217 document Reset behavior as its non-obvious 2016-08-03 17:16:15 -04:00
Marty Schoch
d7405a4d79 updated attempt to reuse []byte
previous attempt was flawed (but maked by Reset() method)
new approach is to do this work in the Reset() method itself,
logically this is where it belongs.

but further we acknowledge that IndexInternalID []byte lifetime
lives beyond the TermFieldDoc, so another copy is made into
the DocumentMatch.  Although this introduces yet another copy
the theory being tested is that it allows each of these
structuress to reuse memory without additional allocation.
2016-08-03 17:01:27 -04:00
Marty Schoch
89d83cb5a1 reuse memory already allocated for copies of docids
when the term field reader is copying ID values out of the
kv store's iterator, it is already attempting to reuse the
term frequency row data structure.  this change allows us
to also attempt to reuse the []byte allocated for previous
copies of the docid.  we reset the slice length to zero
then copy the data into the existing slice, avoiding
new allocation and garbage collection in the cases where
there is already enough space
2016-08-03 13:45:48 -04:00
Marty Schoch
4b1b866e0f remove commented out old code 2016-08-02 16:48:00 -04:00
Marty Schoch
36de4a7097 cleaner fix for the TermFrequencyRow reuse bug
reset to nil first, let remaining logic work as before
2016-08-01 17:17:29 -04:00
Marty Schoch
cfce9c5fc5 initialize term vector list in parseV
otherwise reusing previous term frequency row causes us to
keep tacking on to one gigantic list
2016-08-01 17:01:34 -04:00
Marty Schoch
172ca7e69e need to copy the doc ID for it to survive past next iteration 2016-08-01 17:01:04 -04:00
Marty Schoch
e188fe35f7 switch back to single DocumentMatch struct
instead of separate DocumentMatch/DocumentMatchInternal

rules are simple, everything operates on the IndexInternalID field
until the results are returned, then ID is set correctly
the IndexInternalID field is not exported to JSON
2016-08-01 14:58:02 -04:00
Marty Schoch
1aacd9bad5 changed approach
IndexInternalID is now []byte
this is still opaque, and should still work for any future
index implementations as it is a least common denominator
choice, all implementations must internally represent the
id as []byte at some point for storage to disk
2016-08-01 14:26:50 -04:00
Marty Schoch
5aa9e95468 major refactor of index/search API
index id's are now opaque (until finally returned to top-level user)
 - the TermFieldDoc's returned by TermFieldReader no longer contain doc id
 - instead they return an opaque IndexInternalID
 - items returned are still in the "natural index order"
 - but that is no longer guaranteed to be "doc id order"
 - correct behavior requires that they all follow the same order
 - but not any particular order

 - new API FinalizeDocID which converts index internal ID's to public string ID

 - APIs used internally which previously took doc id now take IndexInternalID
     - that is DocumentFieldTerms() and DocumentFieldTermsForFields()
 - however, APIs that are used externally do not reflect this change
     - that is Document()

 - DocumentIDReader follows the same changes, but this is less obvious
     - behavior clarified, used to iterate doc ids, BUT NOT in doc id order
     - method STILL available to iterate doc ids in range
     - but again, you won't get them in any meaningful order
     - new method to iterate actual doc ids from list of possible ids
         - this was introduced to make the DocIDSearcher continue working

searchers now work with the new opaque index internal doc ids
 - they return new DocumentMatchInternal (which does not have string ID)
scorerers also work with these opaque index internal doc ids
 - they return DocumentMatchInternal (which does not have string ID)
collectors now also perform a final step of converting the final result
 - they STILL return traditional DocumentMatch (with string ID)
 - but they now also require an IndexReader (so that they can do the conversion)
2016-07-31 13:46:18 -04:00
Marty Schoch
47ee69ae82 term field reader supports optionally omitting 3 details
at the time you create the term field reader, you can specify
that you don't need the term freq, the norm, or the term vectors

in that case, the index implementation can choose to not return
them in its subsequently returned values

this is advisory only, some simple implementations may ignore this
and continue to return the values anyway (as the current impl of
upside_down does today)

this change will allow future index implementations the
opportunity to do less work when it isn't required
2016-07-30 10:26:42 -04:00
Marty Schoch
389e18a779 attempt to support google app engine
the default configuration, which sets the default kv engine
to boltdb is now done in file protected with the !appengine
build tag.  this at least lets the analysis-wizzard app
run locally in the appengine simulator.

this still has not been tested on the real appengine, and further
changes may be required.
2016-07-29 21:29:05 -04:00