The slow merger was lagging behind the fast persister
to a persister notify send-loop while the persister awaits
for any new introductions from introducer totally blocking
the merger
This fix along with the deleted files eligibilty flipping
makes the file count to around 6 to 11 files per shard
for both travel and beer samples
This change turns zap.MergeToWriter() into a public func, so that it's
now directly callable from outside packages (such as from scorch's
top-level merger or persister). And, MergerToWriter() now takes input
of SegmentBases instead of Segments, so that it can now work on either
in-memory zap segments or file-based zap segments.
This is yet another stepping stone towards in-memory merging of zap
segments.
This change optimizes the scorch/mem DictionaryIterator by reusing a
DictEntry struct across multiple Next() calls. This follows the same
optimization trick and Next() semantics as upsidedown's FieldDict
implementation.
Adjusting the merge task creation loop to accommodate
the newly merged segments so that the eventual merge
results/ number of segments stay within the calculated budget.
The TestIndexRollback unit test was failing more often than ever
(perhaps raciness?), so this commit tries to remove avenues of
raciness in the test...
- The Scorch.Open() method is refactored into an Scorch.openBolt()
helper method in order to allow unit tests to control which
background goroutines are started.
- TestIndexRollback() doesn't start the merger goroutine, to simulate
a really slow merger that never gets around to merging old segments.
- TestIndexRollback() creates a long-lived reader after the first
batch, so that the first index snapshot isn't removed due to the
long-lived reader's ref-count.
- TestIndexRollback() temporarily bumps NumSnapshotsToKeep to a large
number so the persister isn't tempted to removeOldData() that we're
trying to rollback to.
The zap DictionaryIterator Next() was incorrectly returning the
postingsList offset as the term count. As part of this, refactored
out a PostingsList.read() helper method.
Also added more merge unit test scenarios, including merging a segment
for a few rounds to see if there are differences before/after merging.
Instead of sorting docNum keys from a hashmap, this change instead
iterates from docNum 0 to N and uses an array instead of hashmap.
The array is also reused across outer loop iterations.
This optimizes for when there's a lot of structural similarity between
docs, where many/most docs have the same fields. i.e., beers,
breweries. If every doc has completely different fields, then this
change might produce worse behavior compared to the previous sparse
hashmap approach.
Instead of allocating a govarint.U64Base128Encoder in the inner loop,
allocate it just once on the outside, as it appears that it's just a
thin wrapper around binary.PutUvarint().
The zap SegmentBase struct is a refactoring of the zap Segment into
the subset of fields that are needed for read-only ops, without any
persistence related info. This allows us to use zap's optimized data
encoding as scorch's in-memory segments.
The zap Segment struct now embeds a zap SegmentBase struct, and layers
on persistence. Both the zap Segment and zap SegmentBase implement
scorch's Segment interface.
+ Account for all the overhead incurred from the data structures
within mem.Segment and zap.Segment.
- SizeOfMap = 8
- SizeOfPointer = 8
- SizeOfSlice = 24
- SizeOfString = 16
+ Include overhead from certain new fields as well.
This change preallocates more of the backing arrays for Locfields,
Locstarts, Locends, Locpos, Locaaraypos sub-slices of a scorch mem
segment.
On small bleve-blast tests (50K wiki docs) on a dev macbook, scorch
indexing throughput seems to improve from 15MB/sec to 20MB/sec after
the recent series of preallocation changes.