Observed problem:
Persisted index state (in root bolt) would contain index snapshots which
pointed to index files that did not exist.
Debugging this uncovered two main problems:
1. At the end of persisting a snapshot, the persister creates a new index
snapshot with the SAME epoch as the current root, only it replaces in-memory
segments with the new disk based ones. This is problematic because reference
counting an index segment triggers "eligible for deletion". And eligible for
deletion is keyed by epoch. So having two separate instances going by the same
epoch is problematic. Specifically, one of them gets to 0 before the other,
and we wrongly conclude it's eligible for deletion, when in fact the "other"
instance with same epoch is actually still in use.
To address this problem, we have modified the behavior of the persister. Now,
upon completion of persistence, ONLY if new files were actually created do we
proceed to introduce a new snapshot. AND, this new snapshot now gets it's own
brand new epoch. BOTH of these are important because since the persister now
also introduces a new epoch, it will see this epoch again in the future AND be
expected to persist it. That is OK (mostly harmless), but we cannot allow it
to form a loop. Checking that new files were actually introduced is what
short-circuits the potential loop. The new epoch introduced by the persister,
if seen again will not have any new segments that actually need persisting to
disk, and the cycle is stopped.
2. The implementation of NumSnapshotsToKeep, and related code to deleted old
snapshots from the root bolt also contains problems. Specifically, the
determination of which snapshots to keep vs delete did not consider which ones
were actually persisted. So, lets say you had set NumSnapshotsToKeep to 3, if
the introducer gets 3 snapshots ahead of the persister, what can happen is that
the three snapshots we choose to keep are all in memory. We now wrongly delete
all of the snapshots from the root bolt. But it gets worse, in this instant of
time, we now have files on disk that nothing in the root bolt points to, so we
also go ahead and delete those files. Those files were still being referenced
by the in-memory snapshots. But, now even if they get persisted to disk, they
simply have references to non-existent files. Opening up one of these indexes
results in lost data (often everything).
To address this problem, we made large change to the way this section of code
operates. First, we now start with a list of all epochs actually persisted in
the root bolt. Second, we set aside NumSnapshotsToKeep of these snapshots to
keep. Third, anything else in the eligibleForRemoval list will be deleted. I
suspect this code is slower and less elegant, but I think it is more correct.
Also, previously NumSnapshotsToKeep defaulted to 0, I have now defaulted it to
1, which feels like saner out-of-the-box behavior (though it's debatable if the
original intent was perhaps instead for "extra" snapshots to keep, but with the
variable named as it is, 1 makes more sense to me)
Other minor changes included in this change:
- Location of 'nextSnapshotEpoch', 'eligibleForRemoval', and
'ineligibleForRemoval' members of Scorch struct were moved into the
paragraph with 'rootLock' to clarify that you must hold the lock to access it.
- TestBatchRaceBug260 was updated to properly Close() the index, which leads to
occasional test failures.
+ Track memory usage at a segment level
+ Add a new scorch API: MemoryUsed()
- Aggregate the memory consumption across
segments when API is invoked.
+ TODO:
- Revisit the second iteration if it can be gotten
rid off, and the size accounted for during the first
run while building an in-mem segment.
- Accounting for pointer and slice overhead.
+ Adding new entries to the stats struct of scorch.
+ These stats are atomically incremented upon every segment
introduction, and upon successful persistence.
docValues are persisted along with the index,
in a columnar fashion per field with variable
sized chunking for quick look up.
-naive chunk level caching is added per field
-data part inside a chunk is snappy compressed
-metaHeader inside the chunk index the dv values
inside the uncompressed data part
-all the fields are docValue persisted in this iteration
Added an option to the kvconfig JSON, called "unsafe_batch" (bool).
Default is false, so Batch() calls are synchronously persisted by
default. Advanced users may want to unsafe, asynchronous persistence
to tradeoff performance (mutations are queryable sooner) over safety.
{
"index_type": "scorch",
"kvconfig": { "unsafe_batch": true }
}
This change replaces the previous kvstore=="moss" workaround.
With more pprof focusing (zooming in on a particular func), there were
still some memory allocations showing up with docNumberToBytes() in
micro benchmarks of bleve-query. On a dev macbook, on an index of 50K
wikipedia docs, using search of relatively common "text:date"...
400 qps - upsidedown/moss
680 qps - scorch before
775 qps - scorch after
With this change, there are no more memory allocations in the calls to
PostingsIterator.Next() in the micro benchmarks of bleve-query. On a
dev macbook, on an index of 50K wikipedia docs, using high frequency
search of "text:date"...
400 qps - upsidedown/moss
565 qps - scorch before
680 qps - scorch after
With the previous commit, there can be a scenario where batches that
had internal-updates-only can be rapidly introduced by the app, but
the persisted notifications on only the very last IndexSnapshot would
be fired. The persisted notifications on the in-between batches might
be missed.
The solution was to track the persisted notification channels at a
higher Scorch struct level, instead of tracking the persisted channels
at the IndexSnapshot and SegmentSnapshot levels.
Also, the persister double-check looping was simplified, which avoids
a race where an introducer might incorrectly not notify the persister.
This commit improves handling when an incoming batch has internal-data
updates only and no doc updates. In this case, a nil segment instead
of an empty segment instance is used in the segmentIntroduction. The
segmentIntroduction, that is, might now hold only internal-data
updates only.
To handle synchronous persistence, a new field that's a slice of
persisted notification channels is added to the IndexSnapshot struct,
which the persister goroutine will close as each IndexSnapshot is
persisted.
Also, as part of this change, instead of checking the unsafeBatch flag
in several places, we instead check for non-nil'ness of these
persisted chan's.
Previously, CalcBudget() was treating
MergePlanOptions.SegmentsPerMergeTask as the growth factor while
computing the idealized staircase of segments.
This change introduces a TierGrowth option to MergePlanOptions for
more control and so that SegmentsPerMergeTask can be tweaked
independently of the tier growth factor.
On a couple of micro benchmarks on a dev macbook using bleve-query on
an index of 50K wikipedia docs, scorch is now faster than
upsidedown/moss on high-freq term search "text:date"...
400 qps - upsidedown/moss
404 qps - scorch before
565 qps - scorch after
On a couple of micro benchmarks on a dev macbook using bleve-query on
an index of 50K wikipedia docs, scorch is now in more the same
neighborhood of upsidedown/moss...
high-freq term search "text:date"...
400 qps - upsidedown/moss
360 qps - scorch before
404 qps - scorch after
zero-freq term search "text:mschoch"...
100K qps - upsidedown/moss
55K qps - scorch before
99K qps - scorch after
Of note, the scorch index had ~150 *.zap files in it, which likely
made made the worker goroutine overhead more costly than for a case
with few segments, where goroutine and channel related work appeared
relatively prominently in the pprof SVG's.
The cachedDocs preparation has to happen for all docs in the field,
not just on the currently requested docNum.
Also, as part of this commit, there's a loop optimization where we no
longer use bytes.Split() on the terms buffer, thus avoiding garbage
creation.
A race & solution found by Marty Schoch... consider a case when the
merger might grab a nextSegmentID, like 4, but takes awhile to
complete. Meanwhile, the persister grabs the nextSegmentID of 5, but
finishes its persistence work fast, and then loops to cleanup any old
files. The simple approach of checking a "highest segment ID" of 5 is
wrong now, because the deleter now thinks that segment 4's zap file is
(incorrectly) ok to delete.
The solution in this commit is to track an ephemeral map of filenames
which are ineligibleForRemoval, because they're still being written
(by the merger) and haven't been fully incorporated into the rootBolt
yet.
The merger adds to that ineligibleForRemoval map as it starts a merged
zap file, the persister cleans up entries from that map when it
persists zap filenames into the rootBolt, and the deleter (part of the
persister's loop) consults the map before performing any actual zap
file deletions.
A new global variable, NumSnapshotsToKeep, represents the default
number of old snapshots that each scorch instance should maintain -- 0
is the default. Apps that need rollback'ability may want to increase
this value in early initialization.
The Scorch.eligibleForRemoval field tracks epoches which are safe to
delete from the rootBolt. The eligibleForRemoval is appended to
whenever the ref-count on an IndexSnapshot drops to 0.
On startup, eligibleForRemoval is also initialized with any older
epoch's found in the rootBolt.
The newly introduced Scorch.removeOldSnapshots() method is called on
every cycle of the persisterLoop(), where it maintains the
eligibleForRemoval slice to under a size defined by the
NumSnapshotsToKeep.
A future commit will remove actual storage files in order to match the
"source of truth" information found in the rootBolt.
Instead of cloning an input bitmap, the roaring.Or(x, y)
implementation fills a brand new result bitmap, which should be allow
for more efficient packing and memory utilization.
the implementation of the doc id search requires that the list
of ids be sorted. however, when doing a multisearch across
many indexes at once, the list of doc ids in the query is shared.
deeper in the implementation, the search of each shard attempts
to sort this list, resulting in a data race.
this is one example of a potentially larger problem, however
it has been decided to fix this data race, even though larger
issues of data owernship may remain unresolved.
this fix makes a copy of the list of doc ids, just prior to
sorting the list. subsequently, all use of the list is on the
copy that was made, not the original.
fixes#518
previously we parsed/returned large sections of the documents
back index row in order to compute facet information. this
would require parsing the protobuf of the entire back index row.
unfortunately this creates considerable garbage.
this new version introduces a visitor/callback approach to
working with data inside the back index row. the benefit
of this approach is that we can let the higher-level code
see values, prior to any copies of data being made or
intermediate garbage being created. implementations of
the callback must copy any value which they would like to
retain beyond the callback.
NOTE: this approach is duplicates code from the
automatically generated protobuf code
NOTE: this approach assumes that the "field" field be serialized
before the "terms" field. This is guaranteed by our currently
generated protobuf encoder, and is recommended by the protobuf
spec. But, decoders SHOULD support them occuring in any order,
which we do not.
Previously term entries were encoded pairwise (field/term), so
you'd have data like:
F1/T1 F1/T2 F1/T3 F2/T4 F3/T5
As you can see, even though field 1 has 3 terms, we repeat the F1
part in the encoded data. This is a bit wasteful.
In the new format we encode it as a list of terms for each field:
F1/T1,T2,T3 F2/T4 F3/T5
When fields have multiple terms, this saves space. In unit
tests there is no additional waste even in the case that a field
has only a single value.
Here are the results of an indexing test case (beer-search):
$ benchcmp indexing-before.txt indexing-after.txt
benchmark old ns/op new ns/op delta
BenchmarkIndexing-4 11275835988 10745514321 -4.70%
benchmark old allocs new allocs delta
BenchmarkIndexing-4 25230685 22480494 -10.90%
benchmark old bytes new bytes delta
BenchmarkIndexing-4 4802816224 4741641856 -1.27%
And here are the results of a MatchAll search building a facet
on the "abv" field:
$ benchcmp facet-before.txt facet-after.txt
benchmark old ns/op new ns/op delta
BenchmarkFacets-4 439762100 228064575 -48.14%
benchmark old allocs new allocs delta
BenchmarkFacets-4 9460208 3723286 -60.64%
benchmark old bytes new bytes delta
BenchmarkFacets-4 260784261 151746483 -41.81%
Although we expect the index to be smaller in many cases, the
beer-search index is about the same in this case. However,
this may be due to the underlying storage (boltdb) in this case.
Finally, the index version was bumped from 5 to 7, since smolder
also used version 6, which could lead to some confusion.
This change adds methods that provide access to the actual, underlying
mossStore instance in the bleve/index/store/moss KVStore adaptor.
This enables applications to utilize advanced, mossStore-specific
features (such as partial rollback of indexes). See also
https://issues.couchbase.com/browse/MB-17805
In this commit, I saw that there was a simple incrementBytes()
implementation elsewhere in bleve that seemed simpler than using the
big int package.
Edge case note: if the input bytes would overflow in incrementBytes(),
such as with an input of [0xff 0xff 0xff], it returns nil. moss then
treats a nil endKeyExclusive iterator param as a logical
"higher-than-topmost" key, which produces the prefix iteration
behavior that we want for this edge situation.
Previously bleve allowed you to create a memory-only index by
simply passing "" as the path argument to the New() method.
This was not clear when reading the code, and led to some
problematic error cases as well.
Now, to create a memory-only index one should use the
NewMemOnly() method. Passing "" as the path argument
to the New() method will now return os.ErrInvalid.
Advanced users calling NewUsing() can create disk-based or
memory-only indexes, but the change here is that pass ""
as the path argument no longer defaults you into getting
a memory-only index. Instead, the KV store is selected
manually, just as it is for the disk-based solutions.
Here is an example use of the NewUsing() method to create
a memory-only index:
NewUsing("", indexMapping, Config.DefaultIndexType,
Config.DefaultMemKVStore, nil)
Config.DefaultMemKVStore is just a new default value
added to the configuration, it currently points to
gtreap.Name (which could have been used directly
instead for more control)
closes#427
On a dev laptop, bleve-query benchmark on wiki dataset using
query-string of "+text:afternoon +text:coffee" previously had
throughput of 1222qps, and with this change hits 1940qps.
This change to upside_down term-field-reader no longer moves the
underlying iterator forward preemptively. Instead, it will invoke
Next() on the underlying iterator only when the caller invokes the
term-field-reader's Next().
There's a special case to handle the situation on the first Next()
invocation after the term-field-reader is created.
This commit modifies the upside_down TermFrequencyRow parseKDoc() to
skip the ByteSeparator (0xFF) scan, as we already know the term's
length in the UpsideDownCouchTermFieldReader.
On my dev box, results from bleve-query test on high frequency terms
went from previous 107qps to 124qps.
The DumpXXX() methods were always documented as internal and
unsupported. However, now they are being removed from the
public top-level API. They are still available on the internal
IndexReader, which can be accessed using the Advanced() method.
The DocCount() and DumpXXX() methods on the internal index
have moved to the internal index reader, since they logically
operate on a snapshot of an index.
the test had incorreclty been updated to compare the internal
document ids, but these are opaque and may not be the expected
ids in some cases, the test should simply check that it
corresponds to the correct external ids
This change depends on the recently introduced mossStore Stats() API
in github.com/couchbase/moss 564bdbc0 commit. So, gvt for moss has
been updated as part of this change.
Most of the change involves propagating the mossStore instance (the
statsFunc callback) so that it's accessible to the KVStore.Stats()
method.
See also: http://review.couchbase.org/#/c/67524/
previously from JSON we would just deserialize strings like
"-abv" or "city" or "_id" or "_score" as simple sorts
on fields, ids or scores respectively
while this is simple and compact, it can be ambiguous (for
example if you have a field starting with - or if you have a field
named "_id" already. also, this simple syntax doesnt allow us
to specify more cmoplex options to deal with type/mode/missing
we keep support for the simple string syntax, but now also
recognize a more expressive syntax like:
{
"by": "field",
"field": "abv",
"desc": true,
"type": "string",
"mode": "min",
"missing": "first"
}
type, mode and missing are optional and default to
"auto", "default", and "last" respectively
in a recent commit, we changed the code to reuse
TermFrequencyRow objects intsead of constantly allocating new
ones. unfortunately, one of the original methods was not coded
with this reuse in mind, and a lazy initialization cause us to
leak data from previous uses of the same object.
in particular this caused term vector information from previous
hits to still be applied to subsequent hits. eventually this
causes the highlighter to try and highlight invalid regions
of a slice.
fixes#404
previous attempt was flawed (but maked by Reset() method)
new approach is to do this work in the Reset() method itself,
logically this is where it belongs.
but further we acknowledge that IndexInternalID []byte lifetime
lives beyond the TermFieldDoc, so another copy is made into
the DocumentMatch. Although this introduces yet another copy
the theory being tested is that it allows each of these
structuress to reuse memory without additional allocation.
when the term field reader is copying ID values out of the
kv store's iterator, it is already attempting to reuse the
term frequency row data structure. this change allows us
to also attempt to reuse the []byte allocated for previous
copies of the docid. we reset the slice length to zero
then copy the data into the existing slice, avoiding
new allocation and garbage collection in the cases where
there is already enough space
IndexInternalID is now []byte
this is still opaque, and should still work for any future
index implementations as it is a least common denominator
choice, all implementations must internally represent the
id as []byte at some point for storage to disk
index id's are now opaque (until finally returned to top-level user)
- the TermFieldDoc's returned by TermFieldReader no longer contain doc id
- instead they return an opaque IndexInternalID
- items returned are still in the "natural index order"
- but that is no longer guaranteed to be "doc id order"
- correct behavior requires that they all follow the same order
- but not any particular order
- new API FinalizeDocID which converts index internal ID's to public string ID
- APIs used internally which previously took doc id now take IndexInternalID
- that is DocumentFieldTerms() and DocumentFieldTermsForFields()
- however, APIs that are used externally do not reflect this change
- that is Document()
- DocumentIDReader follows the same changes, but this is less obvious
- behavior clarified, used to iterate doc ids, BUT NOT in doc id order
- method STILL available to iterate doc ids in range
- but again, you won't get them in any meaningful order
- new method to iterate actual doc ids from list of possible ids
- this was introduced to make the DocIDSearcher continue working
searchers now work with the new opaque index internal doc ids
- they return new DocumentMatchInternal (which does not have string ID)
scorerers also work with these opaque index internal doc ids
- they return DocumentMatchInternal (which does not have string ID)
collectors now also perform a final step of converting the final result
- they STILL return traditional DocumentMatch (with string ID)
- but they now also require an IndexReader (so that they can do the conversion)
at the time you create the term field reader, you can specify
that you don't need the term freq, the norm, or the term vectors
in that case, the index implementation can choose to not return
them in its subsequently returned values
this is advisory only, some simple implementations may ignore this
and continue to return the values anyway (as the current impl of
upside_down does today)
this change will allow future index implementations the
opportunity to do less work when it isn't required
The UpsideDownCouchTermFieldReader.Next() only needs the doc ID from
the key, so this change provides a specialized parseKDoc() method for
that optimization.
Additionally, fields in various structs are more 64-bit aligned, in an
attempt to reduce the invocations of runtime.typedmemmove() and
runtime.heapBitsBulkBarrier(), which the go compiler seems to
automatically insert to transparently handle misaligned data.
Previously, the PrefixIterator() for moss was implemented by comparing
the prefix bytes on every Next().
With this optimization, the next larger endKeyExclusive is computed at
the iterator's initialization, which allows us to avoid all those
prefix comparisons.
This optimization changes the index.TermFieldReader.Next() interface
API, adding an optional, pre-allocated *TermFieldDoc parameter, which
can help prevent garbage creation.
Before this change, upside down's reader would alloc a new
TermFrequencyRow on every Next(), which would be immediately
transformed into an index.TermFieldDoc{}. This change reuses a
pre-allocated TermFrequencyRow that's a field in the reader.
From some bleve-query perf profiling, term field vectors appeared to
be alloc'ed, which was unnecessary as term field vectors are disabled
in the bleve-blast/bleve-query tests.
Currently bleve batch is build by user goroutine
Then read by bleve gourinte
This is still safe when used correctly
However, Reset() will modify the map, which is now a data race
This fix is to simply make batch.Reset() alloc new maps.
This provides a data-access pattern that can be used safely.
Also, this thread argues that creating a new map may be faster
than trying to reuse an existing one:
https://groups.google.com/d/msg/golang-nuts/UvUm3LA1u8g/jGv_FobNpN0J
Separate but related, I have opted to remove the "unsafe batch"
checking that we did. This was always limited anyway, and now
users of Go 1.6 are just as likely to get a panic from the
runtime for concurrent map access anyway. So, the price paid
by us (additional mutex) is not worth it.
fixes#360 and #260
several data structures had a pointer at the start of the struct
on some 32-bit systems, this causes the remaining fields no longer
be aligned on 64-bit boundaries
the fix identifed by @pmezard is to put the counters first in the
struct, which guarantees correct alignment
fixes#359
The moss RegistryCollectionOptions allows applications to register
moss-related callback API functions and other advanced feature usage
at process initialization time.
For example, this could be used for moss's OnError(), OnEvent() and
logging callback options.