Optimization for DisjunctionSearcher, where an extra matchingIdxs
helps track the currs that were matching. This avoids the previous
code's second loop through the currs slice.
The DumpXXX() methods were always documented as internal and
unsupported. However, now they are being removed from the
public top-level API. They are still available on the internal
IndexReader, which can be accessed using the Advanced() method.
The DocCount() and DumpXXX() methods on the internal index
have moved to the internal index reader, since they logically
operate on a snapshot of an index.
the motivation for this commit is long and detailed and has been
documented externally here:
https://gist.github.com/mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2
the core of the changes are:
1. recognize that collector/searcher need only a fixed number
of DocumentMatch instances, and this number can be determined
from the structure of the query, not the size of the data
2. knowing this, instances can be allocated in bulk, up front
and they can be reused without locking (since all search
operations take place in a single goroutine
3. combined with previous commits which enabled reuse of
the IndexInternalID []byte, this allows for no allocation/copy
of these bytes as well (by using DocumentMatch Reset() method
when returning entries to the pool
instead of separate DocumentMatch/DocumentMatchInternal
rules are simple, everything operates on the IndexInternalID field
until the results are returned, then ID is set correctly
the IndexInternalID field is not exported to JSON
IndexInternalID is now []byte
this is still opaque, and should still work for any future
index implementations as it is a least common denominator
choice, all implementations must internally represent the
id as []byte at some point for storage to disk
index id's are now opaque (until finally returned to top-level user)
- the TermFieldDoc's returned by TermFieldReader no longer contain doc id
- instead they return an opaque IndexInternalID
- items returned are still in the "natural index order"
- but that is no longer guaranteed to be "doc id order"
- correct behavior requires that they all follow the same order
- but not any particular order
- new API FinalizeDocID which converts index internal ID's to public string ID
- APIs used internally which previously took doc id now take IndexInternalID
- that is DocumentFieldTerms() and DocumentFieldTermsForFields()
- however, APIs that are used externally do not reflect this change
- that is Document()
- DocumentIDReader follows the same changes, but this is less obvious
- behavior clarified, used to iterate doc ids, BUT NOT in doc id order
- method STILL available to iterate doc ids in range
- but again, you won't get them in any meaningful order
- new method to iterate actual doc ids from list of possible ids
- this was introduced to make the DocIDSearcher continue working
searchers now work with the new opaque index internal doc ids
- they return new DocumentMatchInternal (which does not have string ID)
scorerers also work with these opaque index internal doc ids
- they return DocumentMatchInternal (which does not have string ID)
collectors now also perform a final step of converting the final result
- they STILL return traditional DocumentMatch (with string ID)
- but they now also require an IndexReader (so that they can do the conversion)
at the time you create the term field reader, you can specify
that you don't need the term freq, the norm, or the term vectors
in that case, the index implementation can choose to not return
them in its subsequently returned values
this is advisory only, some simple implementations may ignore this
and continue to return the values anyway (as the current impl of
upside_down does today)
this change will allow future index implementations the
opportunity to do less work when it isn't required
The UpsideDownCouchTermFieldReader.Next() only needs the doc ID from
the key, so this change provides a specialized parseKDoc() method for
that optimization.
Additionally, fields in various structs are more 64-bit aligned, in an
attempt to reduce the invocations of runtime.typedmemmove() and
runtime.heapBitsBulkBarrier(), which the go compiler seems to
automatically insert to transparently handle misaligned data.
This optimization changes the index.TermFieldReader.Next() interface
API, adding an optional, pre-allocated *TermFieldDoc parameter, which
can help prevent garbage creation.
This optimization changes the search.Search.Next() interface API,
adding an optional, pre-allocated *DocumentMatch parameter.
When it's non-nil, the TermSearcher and TermQueryScorer will use that
pre-allocated *DocumentMatch, instead of allocating a brand new
DocumentMatch instance.
these searchers incorrectly called Next() on their underlying
searcher, instead of Advance(). this can cause values to be
returned with an ID less than the one that was Advanced() to,
which violates the contract, and causes other incorrect behavior.
fixes#342
regexp, fuzzy and numeric range searchers now check to see if
they will be exceeding a configured DisjunctionMaxClauseCount
and stop work earlier, this does a better job of avoiding
situations which consume all available memory for an operation
they cannot complete
this lays the foundation for supporting the new firestorm
indexing scheme. i'm merging these changes ahead of
the rest of the firestorm branch so i can continue
to make changes to the analysis pipeline in parallel
refactor to share code in emulated batch
refactor to share code in emulated merge
refactor index kvstore benchmarks to share more code
refactor index kvstore benchmarks to be more repeatable
improvements uncovered some issues with how k/v data was copied
or not. to address this, kv abstraction layer now lets impl
specify if the bytes returned are safe to use after a reader
(or writer since writers are also readers) are closed
See index/store/KVReader - BytesSafeAfterClose() bool
false is the safe value if you're not sure
it will cause index impls to copy the data
Some kv impls already have created a copy a the C-api barrier
in which case they can safely return true.
Overall this yields ~25% speedup for searches with leveldb.
It yields ~10% speedup for boltdb.
Returning stored fields is now slower with boltdb, as previously
we were returning unsafe bytes.
this introduces disk format v4
now the summary rows for a term are stored in their own
"dictionary row" format, previously the same information
was stored in special term frequency rows
this now allows us to easily iterate all the terms for a field
in sorted order (useful for many other fuzzy data structures)
at the top-level of bleve you can now browse terms within a field
using the following api on the Index interface:
FieldDict(field string) (index.FieldDict, error)
FieldDictRange(field string, startTerm []byte, endTerm []byte) (index.FieldDict, error)
FieldDictPrefix(field string, termPrefix []byte) (index.FieldDict, error)
fixes#127
more things can return error now
in a couple of places we had to swallow errors because they didn't
fit the existing API. in these case and proactively in a few
others we now return error as well.
also the batch API has been updated to allow performing
set/delete internal within the batch
you can do a manual fuzzy term search using the FuzzyQuery struct
or, more suitable for most users the MatchQuery now supports
some fuzzy options. Here you can specify fuzziness and
prefix_length, to turn the underlying term search into a fuzzy
term search. This has the benefit that analysis is performed
on your input, just like the analyzed field, prior to computing
the fuzzy variants.
closes#82
1. text analysis is now done before the write lock is acquired
2. there is now a pool of analysis workers
3. the size of this pool is configurable
4. this allows for documents in a batch to be analyzed concurrently
as a part of benchmarking these changes i've also introduce a new
null storage implementation. this should never be used, as it
does not actualy build an index. it does however let us go
through all the normal indexing machinery, without incuring
any indexing I/O. this is very helpful in measuring improvements
made to the text analsysis pipeline, which are often overshadowed
by indexing times in benchmarks actually building an index.