bleve

Author	SHA1	Message	Date
Steve Yen	46a46357a7	simplify BooleanSearcher mustSearcher else logic	2016-09-20 18:11:04 -07:00
Steve Yen	3acad78875	optimize boolean search Next() with fewer id comparisons This change to the BooleanSearcher.Next() tries to perform fewer internal id comparisons.	2016-09-20 09:43:01 -07:00
Steve Yen	26b621e916	reuse backing array of matches for boolean searcher The reused backing array of constituent matches should help avoid additional memory allocations.	2016-09-18 10:43:29 -07:00
Steve Yen	dd7cb14a56	disjunction searcher avoids second ID.Equals() comparison Optimization for DisjunctionSearcher, where an extra matchingIdxs helps track the currs that were matching. This avoids the previous code's second loop through the currs slice.	2016-09-18 10:43:16 -07:00
Steve Yen	090c08eb46	upside_down disjunction searcher reuses matching slice	2016-09-18 10:43:16 -07:00
Marty Schoch	3fd2a64872	BREAKING CHANGE - removed DumpXXX() methods from bleve.Index The DumpXXX() methods were always documented as internal and unsupported. However, now they are being removed from the public top-level API. They are still available on the internal IndexReader, which can be accessed using the Advanced() method. The DocCount() and DumpXXX() methods on the internal index have moved to the internal index reader, since they logically operate on a snapshot of an index.	2016-09-13 12:40:01 -04:00
Marty Schoch	1b68c4ec5b	make backindex rows more compact, fix bug counting docs on start	2016-09-11 20:29:15 -04:00
Marty Schoch	ce0b299d6f	switch sort impl to use interface this improves perf in the case where we're not doing any sorting as we avoid allocating memory and converting scores into numeric terms	2016-08-24 19:02:22 -04:00
Marty Schoch	24a2b57e29	refactor search package to reuse DocumentMatch and ID []byte's the motivation for this commit is long and detailed and has been documented externally here: https://gist.github.com/mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2 the core of the changes are: 1. recognize that collector/searcher need only a fixed number of DocumentMatch instances, and this number can be determined from the structure of the query, not the size of the data 2. knowing this, instances can be allocated in bulk, up front and they can be reused without locking (since all search operations take place in a single goroutine 3. combined with previous commits which enabled reuse of the IndexInternalID []byte, this allows for no allocation/copy of these bytes as well (by using DocumentMatch Reset() method when returning entries to the pool	2016-08-08 22:21:47 -04:00
Marty Schoch	4b1b866e0f	remove commented out old code	2016-08-02 16:48:00 -04:00
Marty Schoch	e188fe35f7	switch back to single DocumentMatch struct instead of separate DocumentMatch/DocumentMatchInternal rules are simple, everything operates on the IndexInternalID field until the results are returned, then ID is set correctly the IndexInternalID field is not exported to JSON	2016-08-01 14:58:02 -04:00
Marty Schoch	1aacd9bad5	changed approach IndexInternalID is now []byte this is still opaque, and should still work for any future index implementations as it is a least common denominator choice, all implementations must internally represent the id as []byte at some point for storage to disk	2016-08-01 14:26:50 -04:00
Marty Schoch	5aa9e95468	major refactor of index/search API index id's are now opaque (until finally returned to top-level user) - the TermFieldDoc's returned by TermFieldReader no longer contain doc id - instead they return an opaque IndexInternalID - items returned are still in the "natural index order" - but that is no longer guaranteed to be "doc id order" - correct behavior requires that they all follow the same order - but not any particular order - new API FinalizeDocID which converts index internal ID's to public string ID - APIs used internally which previously took doc id now take IndexInternalID - that is DocumentFieldTerms() and DocumentFieldTermsForFields() - however, APIs that are used externally do not reflect this change - that is Document() - DocumentIDReader follows the same changes, but this is less obvious - behavior clarified, used to iterate doc ids, BUT NOT in doc id order - method STILL available to iterate doc ids in range - but again, you won't get them in any meaningful order - new method to iterate actual doc ids from list of possible ids - this was introduced to make the DocIDSearcher continue working searchers now work with the new opaque index internal doc ids - they return new DocumentMatchInternal (which does not have string ID) scorerers also work with these opaque index internal doc ids - they return DocumentMatchInternal (which does not have string ID) collectors now also perform a final step of converting the final result - they STILL return traditional DocumentMatch (with string ID) - but they now also require an IndexReader (so that they can do the conversion)	2016-07-31 13:46:18 -04:00
Marty Schoch	47ee69ae82	term field reader supports optionally omitting 3 details at the time you create the term field reader, you can specify that you don't need the term freq, the norm, or the term vectors in that case, the index implementation can choose to not return them in its subsequently returned values this is advisory only, some simple implementations may ignore this and continue to return the values anyway (as the current impl of upside_down does today) this change will allow future index implementations the opportunity to do less work when it isn't required	2016-07-30 10:26:42 -04:00
Steve Yen	4822cff63a	optimize Advance() with pre-allocated in-out param This perf-related change helps the code and API reach more similarity with the Next() methods, which now take a pre-allocate param.	2016-07-29 14:15:00 -07:00
Steve Yen	3c82086805	optimize upside_down reader & 64-bit struct alignments The UpsideDownCouchTermFieldReader.Next() only needs the doc ID from the key, so this change provides a specialized parseKDoc() method for that optimization. Additionally, fields in various structs are more 64-bit aligned, in an attempt to reduce the invocations of runtime.typedmemmove() and runtime.heapBitsBulkBarrier(), which the go compiler seems to automatically insert to transparently handle misaligned data.	2016-07-23 10:37:40 -07:00
Steve Yen	39d3e2f028	optimize upside_down reader Next() with TermFieldDoc reuse This optimization changes the index.TermFieldReader.Next() interface API, adding an optional, pre-allocated *TermFieldDoc parameter, which can help prevent garbage creation.	2016-07-21 11:10:49 -07:00
Steve Yen	988ca62182	optimize upside_down reader Next() with doc match reuse This optimization changes the search.Search.Next() interface API, adding an optional, pre-allocated DocumentMatch parameter. When it's non-nil, the TermSearcher and TermQueryScorer will use that pre-allocated DocumentMatch, instead of allocating a brand new DocumentMatch instance.	2016-07-21 11:10:49 -07:00
Marty Schoch	54b06ce0f6	fix bug in regexp, prefix and fuzzy searchers these searchers incorrectly called Next() on their underlying searcher, instead of Advance(). this can cause values to be returned with an ID less than the one that was Advanced() to, which violates the contract, and causes other incorrect behavior. fixes #342	2016-06-21 09:00:05 -04:00
Marty Schoch	53f7eb2891	multi-term searches check DisjunctionMaxClauseCount earlier regexp, fuzzy and numeric range searchers now check to see if they will be exceeding a configured DisjunctionMaxClauseCount and stop work earlier, this does a better job of avoiding situations which consume all available memory for an operation they cannot complete	2016-04-18 10:06:34 -04:00
Marty Schoch	2a703376ea	fix ineffectual assignments	2016-04-02 22:42:56 -04:00
Marty Schoch	194ee82c80	gofmt simplifications	2016-04-02 21:54:33 -04:00
Marty Schoch	74a52f94bb	prefix,regexp, and fuzzy searchers failed to close fieldDict	2016-02-20 15:41:12 -05:00
Marty Schoch	ebb7d2d076	added ability to limit the max number of disjunction clauses set DisjunctionMaxClauseCount to a non-zero value to enforce the limit	2016-02-08 17:21:03 -05:00
slavikm	c1ce8910d7	pass the boost value into the term searcher	2016-02-03 14:49:11 -08:00
opennota	8517feb1c6	Fix some typos	2016-01-15 05:46:27 +07:00
Silvan Jegen	84c755cdb0	Add tests for fuzzy search	2015-12-20 17:00:46 +01:00
Marty Schoch	f7698f1f15	support match_all, match_none and docid queries via JSON also fixed bug in docIDQuery execution which would cause not matching the highest docID passed in if it was in fact a valid ID	2015-12-16 14:53:14 -05:00
Marty Schoch	b4d4ee2fff	fix incorrect results returned by phrase search previously phrase searcher would not validate that consecutive terms were actually occurring in the same array position fixes #292	2015-12-06 15:55:00 -05:00
Patrick Mezard	19230b2f8a	searcher_docid: catch DocIDReader.Close() possible error	2015-11-04 19:24:01 +01:00
Patrick Mezard	ff7234d893	query_docid: add DocIDQuery to filter by document identifiers	2015-11-04 18:41:16 +01:00
Marty Schoch	900f1b4a67	major kvstore interface and impl overhaul clarified the interface contract	2015-09-23 11:25:47 -07:00
Marty Schoch	dbb93b75a4	refactoring to allow pluggable index encodings this lays the foundation for supporting the new firestorm indexing scheme. i'm merging these changes ahead of the rest of the firestorm branch so i can continue to make changes to the analysis pipeline in parallel	2015-09-02 13:12:08 -04:00
Marty Schoch	a9c07acbfa	refactor of kvstore api to support native merge in rocksdb refactor to share code in emulated batch refactor to share code in emulated merge refactor index kvstore benchmarks to share more code refactor index kvstore benchmarks to be more repeatable	2015-04-24 17:13:50 -04:00
Marty Schoch	539aeb8dc7	fix errors identified by errcheck part of #169	2015-04-07 18:05:41 -04:00
Marty Schoch	c8d974048a	fix issues identified by errcheck part of #169	2015-04-07 14:59:35 -04:00
Marty Schoch	867110e03b	major improvements to index row encoding improvements uncovered some issues with how k/v data was copied or not. to address this, kv abstraction layer now lets impl specify if the bytes returned are safe to use after a reader (or writer since writers are also readers) are closed See index/store/KVReader - BytesSafeAfterClose() bool false is the safe value if you're not sure it will cause index impls to copy the data Some kv impls already have created a copy a the C-api barrier in which case they can safely return true. Overall this yields ~25% speedup for searches with leveldb. It yields ~10% speedup for boltdb. Returning stored fields is now slower with boltdb, as previously we were returning unsafe bytes.	2015-04-03 16:50:48 -04:00
Marty Schoch	a41f229b14	added regexp and wildcard queries fixes #152	2015-03-11 16:57:22 -04:00
Marty Schoch	183fcd4b14	added a missing check for errors	2015-03-11 16:56:01 -04:00
Marty Schoch	522f9d5cc7	significant change to index format, support dictionary rows this introduces disk format v4 now the summary rows for a term are stored in their own "dictionary row" format, previously the same information was stored in special term frequency rows this now allows us to easily iterate all the terms for a field in sorted order (useful for many other fuzzy data structures) at the top-level of bleve you can now browse terms within a field using the following api on the Index interface: FieldDict(field string) (index.FieldDict, error) FieldDictRange(field string, startTerm []byte, endTerm []byte) (index.FieldDict, error) FieldDictPrefix(field string, termPrefix []byte) (index.FieldDict, error) fixes #127	2015-03-10 16:22:19 -04:00
Marty Schoch	300ec79c96	first pass at checking errors that were ignored part of #169	2015-03-06 14:46:29 -05:00
Marty Schoch	0ed47f5343	fix advance logic to not skip over result	2015-01-22 09:56:40 -05:00
Marty Schoch	5a09ceeac8	fix traversal logic when not in expected order	2015-01-22 09:56:21 -05:00
Silvan Jegen	ef18dfe4cd	Fix typos in comments and strings	2014-12-18 18:43:12 +01:00
Marty Schoch	fc33752c80	moved levenshtein code outside of fuzzy searcher should allow easier reuse	2014-12-12 13:23:06 -05:00
Marty Schoch	67beaca6d6	fix to phrase/phrase match search involving stop words closes #122	2014-11-25 10:07:54 -05:00
Marty Schoch	c7443fe52b	refactored API a bit more things can return error now in a couple of places we had to swallow errors because they didn't fit the existing API. in these case and proactively in a few others we now return error as well. also the batch API has been updated to allow performing set/delete internal within the batch	2014-10-31 09:40:23 -04:00
Marty Schoch	3a0263bb72	finished initial impl of fuzzy search you can do a manual fuzzy term search using the FuzzyQuery struct or, more suitable for most users the MatchQuery now supports some fuzzy options. Here you can specify fuzziness and prefix_length, to turn the underlying term search into a fuzzy term search. This has the benefit that analysis is performed on your input, just like the analyzed field, prior to computing the fuzzy variants. closes #82	2014-10-24 13:39:48 -04:00
Marty Schoch	d485b0ef26	initial impl of fuzzy search	2014-10-23 13:02:29 -04:00
Marty Schoch	97902e2619	text analysis now moved out of index write lock onto goroutine 1. text analysis is now done before the write lock is acquired 2. there is now a pool of analysis workers 3. the size of this pool is configurable 4. this allows for documents in a batch to be analyzed concurrently as a part of benchmarking these changes i've also introduce a new null storage implementation. this should never be used, as it does not actualy build an index. it does however let us go through all the normal indexing machinery, without incuring any indexing I/O. this is very helpful in measuring improvements made to the text analsysis pipeline, which are often overshadowed by indexing times in benchmarks actually building an index.	2014-09-24 08:13:14 -04:00

1 2

57 Commits