0
0
Fork 0
Commit Graph

37 Commits

Author SHA1 Message Date
abhinavdangeti 7e36109b3c MB-28162: Provide API to estimate memory needed to run a search query
This API (unexported) will estimate the amount of memory needed to execute
a search query over an index before the collector begins data collection.

Sample estimates for certain queries:
{Size: 10, BenchmarkUpsidedownSearchOverhead}
                                                           ESTIMATE    BENCHMEM
TermQuery                                                  4616        4796
MatchQuery                                                 5210        5405
DisjunctionQuery (Match queries)                           7700        8447
DisjunctionQuery (Term queries)                            6514        6591
ConjunctionQuery (Match queries)                           7524        8175
Nested disjunction query (disjunction of disjunctions)     10306       10708
…
2018-03-06 13:53:42 -08:00
Steve Yen f6b506134b import couchbase/vellum instead of couchbaselabs/vellum
Also, scrubbed an old couchbaselabs/moss reference in comments.

Also, go fmt.
2017-12-19 10:49:57 -08:00
Casey Muller 68b07c9e09 Review feedback 2017-05-25 08:32:10 -07:00
Casey Muller 875e19ebd9 Add comments to Location struct
Closes #596
2017-05-25 08:23:39 -07:00
Marty Schoch 0eba2a3f0c reduce garbage created while processing facets
previously we parsed/returned large sections of the documents
back index row in order to compute facet information.  this
would require parsing the protobuf of the entire back index row.
unfortunately this creates considerable garbage.

this new version introduces a visitor/callback approach to
working with data inside the back index row.  the benefit
of this approach is that we can let the higher-level code
see values, prior to any copies of data being made or
intermediate garbage being created.  implementations of
the callback must copy any value which they would like to
retain beyond the callback.

NOTE: this approach is duplicates code from the
automatically generated protobuf code

NOTE: this approach assumes that the "field" field be serialized
before the "terms" field.  This is guaranteed by our currently
generated protobuf encoder, and is recommended by the protobuf
spec.  But, decoders SHOULD support them occuring in any order,
which we do not.
2017-03-02 17:00:46 -05:00
Marty Schoch 8096d9fb90 remove use of float64 to represent int things
this originated from a misunderstanding of mine going back
several years.  the values need not be float64 just because
we plan to serialize them as json.

there are still larger questions about what the right type should
be, and where should any conversions go.  but, this commit
simply attempts to address the most egregious problems
2017-02-09 20:15:59 -05:00
Marty Schoch 232fc80dad add support for phrase slop to internals of phrase searcher
phrase slop is not yet supported on the frontend
added lots of tests around slop
2017-02-09 15:59:51 -05:00
Steve Yen 89a1cefde1 API change: optional SearchRequest.IncludeLocations flag
This is a change in search result behavior in that location
information is no longer provided by default with search results.

Although this looks like a wide-ranging change, it's mostly a
mechanical replacement of the explain bool flag with a new
search.SearcherOptions struct, which holds both the Explain bool flag
and the IncludeTermVectors bool flag.
2017-01-05 21:11:22 -08:00
Steve Yen e72c8be353 simplify TermLocationMap.AddLocation() 2016-10-11 12:15:28 -07:00
Marty Schoch 2332455bd2 nicer formatting of license header 2016-10-02 10:13:14 -04:00
Marty Schoch 60750c1614 improved implementation to address perf regressions
primary change is going back to sort values be []string
and not []interface{}, this avoid allocatiosn converting
into the interface{}

that sounds obvious, so why didn't we just do that first?
because a common (default) sort is score, which is naturally
a number, not a string (like terms).  converting into the
number was also expensive, and the common case.

so, this solution also makes the change to NOT put the score
into the sort value list.  instead you see the dummy value
"_score".  this is just a placeholder, the actual sort impl
knows that field of the sort is the score, and will sort
using the actual score.

also, several other aspets of the benchmark were cleaned up
so that unnecessary allocations do not pollute the cpu profiles

Here are the updated benchmarks:

$ go test -run=xxx -bench=. -benchmem -cpuprofile=cpu.out
BenchmarkTop10of100000Scores-4     	    3000	    465809 ns/op	    2548 B/op	      33 allocs/op
BenchmarkTop100of100000Scores-4    	    2000	    626488 ns/op	   21484 B/op	     213 allocs/op
BenchmarkTop10of1000000Scores-4    	     300	   5107658 ns/op	    2560 B/op	      33 allocs/op
BenchmarkTop100of1000000Scores-4   	     300	   5275403 ns/op	   21624 B/op	     213 allocs/op
PASS
ok  	github.com/blevesearch/bleve/search/collectors	7.188s

Prior to this PR, master reported:

$ go test -run=xxx -bench=. -benchmem
BenchmarkTop10of100000Scores-4          3000        453269 ns/op      360161 B/op         42 allocs/op
BenchmarkTop100of100000Scores-4         2000        519131 ns/op      388275 B/op        219 allocs/op
BenchmarkTop10of1000000Scores-4          200       7459004 ns/op     4628236 B/op         52 allocs/op
BenchmarkTop100of1000000Scores-4         200       8064864 ns/op     4656596 B/op        232 allocs/op
PASS
ok      github.com/blevesearch/bleve/search/collectors  7.385s

So, we're pretty close on the smaller datasets, and we scale better on the larger datasets.
We also show fewer allocations and bytes in all cases (some of this is artificial due to test cleanup).
2016-08-25 15:47:07 -04:00
Marty Schoch ce0b299d6f switch sort impl to use interface
this improves perf in the case where we're not doing any sorting
as we avoid allocating memory and converting scores into
numeric terms
2016-08-24 19:02:22 -04:00
Marty Schoch 0322ecd441 adjust new sort functionality to also work with MultiSearch 2016-08-24 14:07:10 -04:00
Marty Schoch 750e0ac16c change sort field impl to use indexed values not stored values 2016-08-17 09:20:44 -07:00
Marty Schoch 0d873916f0 support JSON marshal/unmarshal of search request sort
The syntax used is an array of strings.  The strings "_id" and
"_score" are special and reserved to mean sorting on the document
id and score repsectively.  All other strings refer to the literal
field name with that value.  If the string is prefixed with "-"
the order of that sort is descending, without it, it defaults to
ascending.

Examples:

"sort":["-abv","-_score"]

This will sort results in decreasing order of the "abv" field.
Results which have the same value of the "abv" field will then
be sorted by their score, also decreasing.

If no value for "sort" is provided in the search request the
default soring is the same as before, which is decreasing score.
2016-08-12 19:16:24 -04:00
Marty Schoch 0bb69a9a1c Merge branch 'master' of https://github.com/dtylman/bleve into sort-by-field-try2 2016-08-12 14:23:55 -04:00
Danny Tylman 5164e70f6e Adding sort to SearchRequest. 2016-08-09 16:18:53 +03:00
Marty Schoch 24a2b57e29 refactor search package to reuse DocumentMatch and ID []byte's
the motivation for this commit is long and detailed and has been
documented externally here:

https://gist.github.com/mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2

the core of the changes are:

1.  recognize that collector/searcher need only a fixed number
of DocumentMatch instances, and this number can be determined
from the structure of the query, not the size of the data

2.  knowing this, instances can be allocated in bulk, up front
and they can be reused without locking (since all search
operations take place in a single goroutine

3.  combined with previous commits which enabled reuse of
the IndexInternalID []byte, this allows for no allocation/copy
of these bytes as well (by using DocumentMatch Reset() method
when returning entries to the pool
2016-08-08 22:21:47 -04:00
Marty Schoch b857769217 document Reset behavior as its non-obvious 2016-08-03 17:16:15 -04:00
Marty Schoch d7405a4d79 updated attempt to reuse []byte
previous attempt was flawed (but maked by Reset() method)
new approach is to do this work in the Reset() method itself,
logically this is where it belongs.

but further we acknowledge that IndexInternalID []byte lifetime
lives beyond the TermFieldDoc, so another copy is made into
the DocumentMatch.  Although this introduces yet another copy
the theory being tested is that it allows each of these
structuress to reuse memory without additional allocation.
2016-08-03 17:01:27 -04:00
Marty Schoch e188fe35f7 switch back to single DocumentMatch struct
instead of separate DocumentMatch/DocumentMatchInternal

rules are simple, everything operates on the IndexInternalID field
until the results are returned, then ID is set correctly
the IndexInternalID field is not exported to JSON
2016-08-01 14:58:02 -04:00
Marty Schoch 5aa9e95468 major refactor of index/search API
index id's are now opaque (until finally returned to top-level user)
 - the TermFieldDoc's returned by TermFieldReader no longer contain doc id
 - instead they return an opaque IndexInternalID
 - items returned are still in the "natural index order"
 - but that is no longer guaranteed to be "doc id order"
 - correct behavior requires that they all follow the same order
 - but not any particular order

 - new API FinalizeDocID which converts index internal ID's to public string ID

 - APIs used internally which previously took doc id now take IndexInternalID
     - that is DocumentFieldTerms() and DocumentFieldTermsForFields()
 - however, APIs that are used externally do not reflect this change
     - that is Document()

 - DocumentIDReader follows the same changes, but this is less obvious
     - behavior clarified, used to iterate doc ids, BUT NOT in doc id order
     - method STILL available to iterate doc ids in range
     - but again, you won't get them in any meaningful order
     - new method to iterate actual doc ids from list of possible ids
         - this was introduced to make the DocIDSearcher continue working

searchers now work with the new opaque index internal doc ids
 - they return new DocumentMatchInternal (which does not have string ID)
scorerers also work with these opaque index internal doc ids
 - they return DocumentMatchInternal (which does not have string ID)
collectors now also perform a final step of converting the final result
 - they STILL return traditional DocumentMatch (with string ID)
 - but they now also require an IndexReader (so that they can do the conversion)
2016-07-31 13:46:18 -04:00
Steve Yen 4822cff63a optimize Advance() with pre-allocated in-out param
This perf-related change helps the code and API reach more similarity
with the Next() methods, which now take a pre-allocate param.
2016-07-29 14:15:00 -07:00
Steve Yen 988ca62182 optimize upside_down reader Next() with doc match reuse
This optimization changes the search.Search.Next() interface API,
adding an optional, pre-allocated *DocumentMatch parameter.

When it's non-nil, the TermSearcher and TermQueryScorer will use that
pre-allocated *DocumentMatch, instead of allocating a brand new
DocumentMatch instance.
2016-07-21 11:10:49 -07:00
Marty Schoch d73beac3b9 search result hits now have a field with the name of the index
this allows you to figure out where a result actually came
from when using aliases
2015-12-08 13:55:04 -05:00
Marty Schoch b4d4ee2fff fix incorrect results returned by phrase search
previously phrase searcher would not validate that consecutive
terms were actually occurring in the same array position

fixes #292
2015-12-06 15:55:00 -05:00
Patrick Mezard ee8af9cfa3 doc: document field values storage and retrieval 2015-10-04 11:25:58 +02:00
Silvan Jegen 650d48427d Refactor AddFieldValue method
Removing one level of nesting makes the method easier to read.
2015-09-21 21:14:15 +02:00
Marty Schoch 3682c25467 update to correctly work with composite fields
also updated search results to return array positions
2015-07-31 11:16:11 -04:00
Marty Schoch 300ec79c96 first pass at checking errors that were ignored
part of #169
2015-03-06 14:46:29 -05:00
Marty Schoch eb16b3c563 properly return multi-value fields in an array 2014-11-19 15:58:15 -05:00
Marty Schoch 51a59cb05c initial impl of Index Aliases
an IndexAlias allows you easily work with one logical Index
while changing the actual Index its pointing to behind the scenes
Changing which actual Index is backing an IndexAlias can be done
atomically so that your application smoothly transitions from
one Index to another.
A separate use of IndexAlias is allowed when the IndexAlias is
defined to point to multiple Indexes.  In this case only the
Search() operation is supported, but the Search will be run
on each of the underlying indexes in parallel, and the results
will be merged.
2014-10-29 09:22:11 -04:00
Marty Schoch 7a7eb2e94c add newline between license and package
this avoids cluttering godocs with the license
2014-09-02 10:54:50 -04:00
Marty Schoch f2c781fa21 refactor to make all the query classes private 2014-08-29 18:14:12 -04:00
Marty Schoch 41d4f67ee2 fix storing/retrieving numeric and date fields
also includes new ability to request stored fields be returned with results

closes #55 and closes #56 and closes #58
2014-08-06 13:52:20 -04:00
Marty Schoch 238f3af4bd change higlight api to store in document match 2014-07-03 14:53:44 -04:00
Marty Schoch 3d842dfaf2 initial commit 2014-04-17 16:55:53 -04:00