Previously bleve allowed you to create a memory-only index by
simply passing "" as the path argument to the New() method.
This was not clear when reading the code, and led to some
problematic error cases as well.
Now, to create a memory-only index one should use the
NewMemOnly() method. Passing "" as the path argument
to the New() method will now return os.ErrInvalid.
Advanced users calling NewUsing() can create disk-based or
memory-only indexes, but the change here is that pass ""
as the path argument no longer defaults you into getting
a memory-only index. Instead, the KV store is selected
manually, just as it is for the disk-based solutions.
Here is an example use of the NewUsing() method to create
a memory-only index:
NewUsing("", indexMapping, Config.DefaultIndexType,
Config.DefaultMemKVStore, nil)
Config.DefaultMemKVStore is just a new default value
added to the configuration, it currently points to
gtreap.Name (which could have been used directly
instead for more control)
closes#427
On a dev laptop, bleve-query benchmark on wiki dataset using
query-string of "+text:afternoon +text:coffee" previously had
throughput of 1222qps, and with this change hits 1940qps.
This change to upside_down term-field-reader no longer moves the
underlying iterator forward preemptively. Instead, it will invoke
Next() on the underlying iterator only when the caller invokes the
term-field-reader's Next().
There's a special case to handle the situation on the first Next()
invocation after the term-field-reader is created.
This commit modifies the upside_down TermFrequencyRow parseKDoc() to
skip the ByteSeparator (0xFF) scan, as we already know the term's
length in the UpsideDownCouchTermFieldReader.
On my dev box, results from bleve-query test on high frequency terms
went from previous 107qps to 124qps.
The DumpXXX() methods were always documented as internal and
unsupported. However, now they are being removed from the
public top-level API. They are still available on the internal
IndexReader, which can be accessed using the Advanced() method.
The DocCount() and DumpXXX() methods on the internal index
have moved to the internal index reader, since they logically
operate on a snapshot of an index.
the test had incorreclty been updated to compare the internal
document ids, but these are opaque and may not be the expected
ids in some cases, the test should simply check that it
corresponds to the correct external ids
This change depends on the recently introduced mossStore Stats() API
in github.com/couchbase/moss 564bdbc0 commit. So, gvt for moss has
been updated as part of this change.
Most of the change involves propagating the mossStore instance (the
statsFunc callback) so that it's accessible to the KVStore.Stats()
method.
See also: http://review.couchbase.org/#/c/67524/
previously from JSON we would just deserialize strings like
"-abv" or "city" or "_id" or "_score" as simple sorts
on fields, ids or scores respectively
while this is simple and compact, it can be ambiguous (for
example if you have a field starting with - or if you have a field
named "_id" already. also, this simple syntax doesnt allow us
to specify more cmoplex options to deal with type/mode/missing
we keep support for the simple string syntax, but now also
recognize a more expressive syntax like:
{
"by": "field",
"field": "abv",
"desc": true,
"type": "string",
"mode": "min",
"missing": "first"
}
type, mode and missing are optional and default to
"auto", "default", and "last" respectively
in a recent commit, we changed the code to reuse
TermFrequencyRow objects intsead of constantly allocating new
ones. unfortunately, one of the original methods was not coded
with this reuse in mind, and a lazy initialization cause us to
leak data from previous uses of the same object.
in particular this caused term vector information from previous
hits to still be applied to subsequent hits. eventually this
causes the highlighter to try and highlight invalid regions
of a slice.
fixes#404
previous attempt was flawed (but maked by Reset() method)
new approach is to do this work in the Reset() method itself,
logically this is where it belongs.
but further we acknowledge that IndexInternalID []byte lifetime
lives beyond the TermFieldDoc, so another copy is made into
the DocumentMatch. Although this introduces yet another copy
the theory being tested is that it allows each of these
structuress to reuse memory without additional allocation.
when the term field reader is copying ID values out of the
kv store's iterator, it is already attempting to reuse the
term frequency row data structure. this change allows us
to also attempt to reuse the []byte allocated for previous
copies of the docid. we reset the slice length to zero
then copy the data into the existing slice, avoiding
new allocation and garbage collection in the cases where
there is already enough space
IndexInternalID is now []byte
this is still opaque, and should still work for any future
index implementations as it is a least common denominator
choice, all implementations must internally represent the
id as []byte at some point for storage to disk
index id's are now opaque (until finally returned to top-level user)
- the TermFieldDoc's returned by TermFieldReader no longer contain doc id
- instead they return an opaque IndexInternalID
- items returned are still in the "natural index order"
- but that is no longer guaranteed to be "doc id order"
- correct behavior requires that they all follow the same order
- but not any particular order
- new API FinalizeDocID which converts index internal ID's to public string ID
- APIs used internally which previously took doc id now take IndexInternalID
- that is DocumentFieldTerms() and DocumentFieldTermsForFields()
- however, APIs that are used externally do not reflect this change
- that is Document()
- DocumentIDReader follows the same changes, but this is less obvious
- behavior clarified, used to iterate doc ids, BUT NOT in doc id order
- method STILL available to iterate doc ids in range
- but again, you won't get them in any meaningful order
- new method to iterate actual doc ids from list of possible ids
- this was introduced to make the DocIDSearcher continue working
searchers now work with the new opaque index internal doc ids
- they return new DocumentMatchInternal (which does not have string ID)
scorerers also work with these opaque index internal doc ids
- they return DocumentMatchInternal (which does not have string ID)
collectors now also perform a final step of converting the final result
- they STILL return traditional DocumentMatch (with string ID)
- but they now also require an IndexReader (so that they can do the conversion)