+ Account for all the overhead incurred from the data structures
within mem.Segment and zap.Segment.
- SizeOfMap = 8
- SizeOfPointer = 8
- SizeOfSlice = 24
- SizeOfString = 16
+ Include overhead from certain new fields as well.
This change preallocates more of the backing arrays for Locfields,
Locstarts, Locends, Locpos, Locaaraypos sub-slices of a scorch mem
segment.
On small bleve-blast tests (50K wiki docs) on a dev macbook, scorch
indexing throughput seems to improve from 15MB/sec to 20MB/sec after
the recent series of preallocation changes.
The scorch mem segment build phase uses the append() idiom to populate
various slices that are keyed by postings list id's. These slices
include...
* Postings
* PostingsLocs
* Freqs
* Norms
* Locfields
* Locstarts
* Locends
* Locpos
* Locarraypos
This change introduces an initialization step that preallocates those
slices up-front, by assigning postings list id's to terms up-front.
This change also has an additional effect of simplifying the
processDocument() logic to no longer have to worry about a first-time
initialization case, removing some duplicate'ish code.
The first time through, startNumFields should be 0, where there ought
to be more optimization assuming later docs have similar fields as the
first doc.
-VisitableDocValueFields API for persisted DV field list
-making dv configs overridable at field level
-enabling on the fly/runtime un inverting of doc values
-few UT updates
previous approach used SetEventCallback method which allowed
you to change the callback, unfotunately that also included
times after the goroutines were started and potentially firing
the callback.
checking lock on this would be too expensive, so instead we go
for an approach that allows callbacks to be registered by name
during process init(), then upon opening up an index a string
config key 'eventCallbackName' is used to look up the
appropriate callback function. also, since this string config
name is serializable, it fits into the existing bleve index
metadata without any new issues.
zap command-line tool added to main bleve command-line tool
this required physical relocation due to the vendoring used
only on the bleve command-line tool (unforseen limitation)
a new scorch command-line tool has also been introduced
and for the same reasons it is physically store under
the top-level bleve command-line tool as well
New APIs:
+ RollbackPoints()
- Retrieves the available list of rollback points: epoch+meta.
- The application will need to check with the meta to decide
on the rollback point.
+ Rollback()
- API requires a rollback point identified by the first API.
- Atomically & Durably rolls back the index to specified point,
provided the specified rollback point is still available.
+ Unit test: TestIndexRollback
- Writes a batch.
- Sets the rollback point.
- Writes second batch.
- Rollback to previously decided point.
- Ensure that data is as is before the second batch.
inserted to the list of dv enabled fields list -
DocValueFields in mem segment.
Moved back to the original type `DocValueFields map[uint16]bool`
for easy look up to check whether the fieldID is
configured for dv storage.
Observed problem:
Persisted index state (in root bolt) would contain index snapshots which
pointed to index files that did not exist.
Debugging this uncovered two main problems:
1. At the end of persisting a snapshot, the persister creates a new index
snapshot with the SAME epoch as the current root, only it replaces in-memory
segments with the new disk based ones. This is problematic because reference
counting an index segment triggers "eligible for deletion". And eligible for
deletion is keyed by epoch. So having two separate instances going by the same
epoch is problematic. Specifically, one of them gets to 0 before the other,
and we wrongly conclude it's eligible for deletion, when in fact the "other"
instance with same epoch is actually still in use.
To address this problem, we have modified the behavior of the persister. Now,
upon completion of persistence, ONLY if new files were actually created do we
proceed to introduce a new snapshot. AND, this new snapshot now gets it's own
brand new epoch. BOTH of these are important because since the persister now
also introduces a new epoch, it will see this epoch again in the future AND be
expected to persist it. That is OK (mostly harmless), but we cannot allow it
to form a loop. Checking that new files were actually introduced is what
short-circuits the potential loop. The new epoch introduced by the persister,
if seen again will not have any new segments that actually need persisting to
disk, and the cycle is stopped.
2. The implementation of NumSnapshotsToKeep, and related code to deleted old
snapshots from the root bolt also contains problems. Specifically, the
determination of which snapshots to keep vs delete did not consider which ones
were actually persisted. So, lets say you had set NumSnapshotsToKeep to 3, if
the introducer gets 3 snapshots ahead of the persister, what can happen is that
the three snapshots we choose to keep are all in memory. We now wrongly delete
all of the snapshots from the root bolt. But it gets worse, in this instant of
time, we now have files on disk that nothing in the root bolt points to, so we
also go ahead and delete those files. Those files were still being referenced
by the in-memory snapshots. But, now even if they get persisted to disk, they
simply have references to non-existent files. Opening up one of these indexes
results in lost data (often everything).
To address this problem, we made large change to the way this section of code
operates. First, we now start with a list of all epochs actually persisted in
the root bolt. Second, we set aside NumSnapshotsToKeep of these snapshots to
keep. Third, anything else in the eligibleForRemoval list will be deleted. I
suspect this code is slower and less elegant, but I think it is more correct.
Also, previously NumSnapshotsToKeep defaulted to 0, I have now defaulted it to
1, which feels like saner out-of-the-box behavior (though it's debatable if the
original intent was perhaps instead for "extra" snapshots to keep, but with the
variable named as it is, 1 makes more sense to me)
Other minor changes included in this change:
- Location of 'nextSnapshotEpoch', 'eligibleForRemoval', and
'ineligibleForRemoval' members of Scorch struct were moved into the
paragraph with 'rootLock' to clarify that you must hold the lock to access it.
- TestBatchRaceBug260 was updated to properly Close() the index, which leads to
occasional test failures.
+ Track memory usage at a segment level
+ Add a new scorch API: MemoryUsed()
- Aggregate the memory consumption across
segments when API is invoked.
+ TODO:
- Revisit the second iteration if it can be gotten
rid off, and the size accounted for during the first
run while building an in-mem segment.
- Accounting for pointer and slice overhead.
+ Adding new entries to the stats struct of scorch.
+ These stats are atomically incremented upon every segment
introduction, and upon successful persistence.
docValues are persisted along with the index,
in a columnar fashion per field with variable
sized chunking for quick look up.
-naive chunk level caching is added per field
-data part inside a chunk is snappy compressed
-metaHeader inside the chunk index the dv values
inside the uncompressed data part
-all the fields are docValue persisted in this iteration
Added an option to the kvconfig JSON, called "unsafe_batch" (bool).
Default is false, so Batch() calls are synchronously persisted by
default. Advanced users may want to unsafe, asynchronous persistence
to tradeoff performance (mutations are queryable sooner) over safety.
{
"index_type": "scorch",
"kvconfig": { "unsafe_batch": true }
}
This change replaces the previous kvstore=="moss" workaround.
With more pprof focusing (zooming in on a particular func), there were
still some memory allocations showing up with docNumberToBytes() in
micro benchmarks of bleve-query. On a dev macbook, on an index of 50K
wikipedia docs, using search of relatively common "text:date"...
400 qps - upsidedown/moss
680 qps - scorch before
775 qps - scorch after
With this change, there are no more memory allocations in the calls to
PostingsIterator.Next() in the micro benchmarks of bleve-query. On a
dev macbook, on an index of 50K wikipedia docs, using high frequency
search of "text:date"...
400 qps - upsidedown/moss
565 qps - scorch before
680 qps - scorch after
With the previous commit, there can be a scenario where batches that
had internal-updates-only can be rapidly introduced by the app, but
the persisted notifications on only the very last IndexSnapshot would
be fired. The persisted notifications on the in-between batches might
be missed.
The solution was to track the persisted notification channels at a
higher Scorch struct level, instead of tracking the persisted channels
at the IndexSnapshot and SegmentSnapshot levels.
Also, the persister double-check looping was simplified, which avoids
a race where an introducer might incorrectly not notify the persister.
This commit improves handling when an incoming batch has internal-data
updates only and no doc updates. In this case, a nil segment instead
of an empty segment instance is used in the segmentIntroduction. The
segmentIntroduction, that is, might now hold only internal-data
updates only.
To handle synchronous persistence, a new field that's a slice of
persisted notification channels is added to the IndexSnapshot struct,
which the persister goroutine will close as each IndexSnapshot is
persisted.
Also, as part of this change, instead of checking the unsafeBatch flag
in several places, we instead check for non-nil'ness of these
persisted chan's.
Previously, CalcBudget() was treating
MergePlanOptions.SegmentsPerMergeTask as the growth factor while
computing the idealized staircase of segments.
This change introduces a TierGrowth option to MergePlanOptions for
more control and so that SegmentsPerMergeTask can be tweaked
independently of the tier growth factor.
On a couple of micro benchmarks on a dev macbook using bleve-query on
an index of 50K wikipedia docs, scorch is now faster than
upsidedown/moss on high-freq term search "text:date"...
400 qps - upsidedown/moss
404 qps - scorch before
565 qps - scorch after
On a couple of micro benchmarks on a dev macbook using bleve-query on
an index of 50K wikipedia docs, scorch is now in more the same
neighborhood of upsidedown/moss...
high-freq term search "text:date"...
400 qps - upsidedown/moss
360 qps - scorch before
404 qps - scorch after
zero-freq term search "text:mschoch"...
100K qps - upsidedown/moss
55K qps - scorch before
99K qps - scorch after
Of note, the scorch index had ~150 *.zap files in it, which likely
made made the worker goroutine overhead more costly than for a case
with few segments, where goroutine and channel related work appeared
relatively prominently in the pprof SVG's.
The cachedDocs preparation has to happen for all docs in the field,
not just on the currently requested docNum.
Also, as part of this commit, there's a loop optimization where we no
longer use bytes.Split() on the terms buffer, thus avoiding garbage
creation.
A race & solution found by Marty Schoch... consider a case when the
merger might grab a nextSegmentID, like 4, but takes awhile to
complete. Meanwhile, the persister grabs the nextSegmentID of 5, but
finishes its persistence work fast, and then loops to cleanup any old
files. The simple approach of checking a "highest segment ID" of 5 is
wrong now, because the deleter now thinks that segment 4's zap file is
(incorrectly) ok to delete.
The solution in this commit is to track an ephemeral map of filenames
which are ineligibleForRemoval, because they're still being written
(by the merger) and haven't been fully incorporated into the rootBolt
yet.
The merger adds to that ineligibleForRemoval map as it starts a merged
zap file, the persister cleans up entries from that map when it
persists zap filenames into the rootBolt, and the deleter (part of the
persister's loop) consults the map before performing any actual zap
file deletions.
A new global variable, NumSnapshotsToKeep, represents the default
number of old snapshots that each scorch instance should maintain -- 0
is the default. Apps that need rollback'ability may want to increase
this value in early initialization.
The Scorch.eligibleForRemoval field tracks epoches which are safe to
delete from the rootBolt. The eligibleForRemoval is appended to
whenever the ref-count on an IndexSnapshot drops to 0.
On startup, eligibleForRemoval is also initialized with any older
epoch's found in the rootBolt.
The newly introduced Scorch.removeOldSnapshots() method is called on
every cycle of the persisterLoop(), where it maintains the
eligibleForRemoval slice to under a size defined by the
NumSnapshotsToKeep.
A future commit will remove actual storage files in order to match the
"source of truth" information found in the rootBolt.
Instead of cloning an input bitmap, the roaring.Or(x, y)
implementation fills a brand new result bitmap, which should be allow
for more efficient packing and memory utilization.