previous attempt was flawed (but maked by Reset() method)
new approach is to do this work in the Reset() method itself,
logically this is where it belongs.
but further we acknowledge that IndexInternalID []byte lifetime
lives beyond the TermFieldDoc, so another copy is made into
the DocumentMatch. Although this introduces yet another copy
the theory being tested is that it allows each of these
structuress to reuse memory without additional allocation.
IndexInternalID is now []byte
this is still opaque, and should still work for any future
index implementations as it is a least common denominator
choice, all implementations must internally represent the
id as []byte at some point for storage to disk
index id's are now opaque (until finally returned to top-level user)
- the TermFieldDoc's returned by TermFieldReader no longer contain doc id
- instead they return an opaque IndexInternalID
- items returned are still in the "natural index order"
- but that is no longer guaranteed to be "doc id order"
- correct behavior requires that they all follow the same order
- but not any particular order
- new API FinalizeDocID which converts index internal ID's to public string ID
- APIs used internally which previously took doc id now take IndexInternalID
- that is DocumentFieldTerms() and DocumentFieldTermsForFields()
- however, APIs that are used externally do not reflect this change
- that is Document()
- DocumentIDReader follows the same changes, but this is less obvious
- behavior clarified, used to iterate doc ids, BUT NOT in doc id order
- method STILL available to iterate doc ids in range
- but again, you won't get them in any meaningful order
- new method to iterate actual doc ids from list of possible ids
- this was introduced to make the DocIDSearcher continue working
searchers now work with the new opaque index internal doc ids
- they return new DocumentMatchInternal (which does not have string ID)
scorerers also work with these opaque index internal doc ids
- they return DocumentMatchInternal (which does not have string ID)
collectors now also perform a final step of converting the final result
- they STILL return traditional DocumentMatch (with string ID)
- but they now also require an IndexReader (so that they can do the conversion)
at the time you create the term field reader, you can specify
that you don't need the term freq, the norm, or the term vectors
in that case, the index implementation can choose to not return
them in its subsequently returned values
this is advisory only, some simple implementations may ignore this
and continue to return the values anyway (as the current impl of
upside_down does today)
this change will allow future index implementations the
opportunity to do less work when it isn't required
This optimization changes the index.TermFieldReader.Next() interface
API, adding an optional, pre-allocated *TermFieldDoc parameter, which
can help prevent garbage creation.
Currently bleve batch is build by user goroutine
Then read by bleve gourinte
This is still safe when used correctly
However, Reset() will modify the map, which is now a data race
This fix is to simply make batch.Reset() alloc new maps.
This provides a data-access pattern that can be used safely.
Also, this thread argues that creating a new map may be faster
than trying to reuse an existing one:
https://groups.google.com/d/msg/golang-nuts/UvUm3LA1u8g/jGv_FobNpN0J
Separate but related, I have opted to remove the "unsafe batch"
checking that we did. This was always limited anyway, and now
users of Go 1.6 are just as likely to get a panic from the
runtime for concurrent map access anyway. So, the price paid
by us (additional mutex) is not worth it.
fixes#360 and #260
this lays the foundation for supporting the new firestorm
indexing scheme. i'm merging these changes ahead of
the rest of the firestorm branch so i can continue
to make changes to the analysis pipeline in parallel
in some limited cases we can detect unsafe usage
in these cases, do not trip over ourselves and panic
instead return a strongly typed error upside_down.UnsafeBatchUseDetected
also, introduced Batch.Reset() to allow batch reuse
this is currently still experimental
closes#195
this introduces disk format v4
now the summary rows for a term are stored in their own
"dictionary row" format, previously the same information
was stored in special term frequency rows
this now allows us to easily iterate all the terms for a field
in sorted order (useful for many other fuzzy data structures)
at the top-level of bleve you can now browse terms within a field
using the following api on the Index interface:
FieldDict(field string) (index.FieldDict, error)
FieldDictRange(field string, startTerm []byte, endTerm []byte) (index.FieldDict, error)
FieldDictPrefix(field string, termPrefix []byte) (index.FieldDict, error)
fixes#127
more things can return error now
in a couple of places we had to swallow errors because they didn't
fit the existing API. in these case and proactively in a few
others we now return error as well.
also the batch API has been updated to allow performing
set/delete internal within the batch
In the index/store package
introduce KVReader
creates snapshot
all read operations consistent from this snapshot
must close to release
introduce KVWriter
only one writer active
access to all operations
allows for consisten read-modify-write
must close to release
introduce AssociativeMerge operation on batch
allows efficient read-modify-write
for associative operations
used to consolidate updates to the term summary rows
saves 1 set and 1 get op per shared instance of term in field
In the index package
introduced an IndexReader
exposes a consisten snapshot of the index for searching
At top level
All searches now operate on a consisten snapshot of the index
ultimately this is make it more convenient for us to wire up
different elements of the analysis pipeline, without having to
preload everything into memory before we need it
separately the index layer now has a mechanism for storing
internal key/value pairs. this is expected to be used to
store the mapping, and possibly other pieces of data by the
top layer, but not exposed to the user at the top.
this change was then exposed at the higher levels
also the beer-sample app was upgraded to index in batches of 100
by default. this yieled an indexing speed up from 27s to 16s.
closes#57