As part of this, zap.MergeToWriter() now returns more information --
enough so that callers can now create their own SegmentBase instances.
Also, the fieldsMap maintained and returned by zap.MergeToWriter() is
now a mapping from fieldName ==> fieldID+1 (instead of the previous
mapping from fieldName ==> fieldID). This makes it similar to how
fieldsMap are handled in other parts of zap to avoid "zero value"
issues.
The theory with this change is that the dicts and itrs should be
positionally in "lock-step" with paired entries.
And, since later code also uses the same array indexing to access the
drops and newDocNums, those also need to be positionally in pair-wise
lock-step, too.
Unlike vellum's MergeIterator, the enumerator introduced in this
commit doesn't merge when there are matching keys across iterators.
Instead, the enumerator implementation provides a traversal of all the
tuples of (key, iteratorIndex, val) from the underlying vellum
iterators, ordered by key ASC, iteratorIndex ASC.
During zap segment merging, a new zap PostingsIterator was allocated
for every field X segment X term.
This change optimizes by reusing a single PostingsIterator instance
per persistMergedRest() invocation.
And, also unused fields are removed from the PostingsIterator.
Two cases in this commit...
If we're shutting down, the merger might not have handed off its
latest merged segment to the introducer yet, so the merger still owns
the segment and needs to Close() that segment itself.
In persistSnapshot(), there migth be cases where the persister might
not be able to swap in its newly persisted segments -- so, the
persistSnapshot() needs to Close() those segments itself.
Some codepaths in persistSnapshot() were saving errors into an err2
local variable, which might lead incorrectly to commit during an error
situation rather than abort.
When a scorch is just opened and is "empty", RollbackPoints() no
longer considers that an error situation.
Also, this commit makes the TestIndexRollback unit tests is a bit more
forgiving to races, as we were seeing failures sometimes in travis-CI
environments (TestIndexRollback was passing fine on my dev macbook).
The theory is the double-looping in the persisterLoop would sometimes
be racy, leading to 1 or 2 rollback points.
The optimization to byte-copy all the storedDocs for a given segment
during merging kicks in when the fields are the same across all
segments and when there are no deletions for that given segment. This
can happen, for example, during data loading or insert-only scenarios.
As part of this commit, the Segment.copyStoredDocs() method was added,
which uses a single Write() call to copy all the stored docs bytes of
a segment to a writer in one shot.
And, getDocStoredMetaAndCompressed() was refactored into a related
helper function, getDocStoredOffsets(), which provides the storedDocs
metadata (offsets & lengths) for a doc.
COMPATIBILITY NOTE: scorch zap version bumped in this commit.
The version bump is because mergeFields() now computes whether fields
are the same across segments and it relies on the previous commit
where fieldID's are assigned in field name sorted order (albeit with
_id field always having fieldID of 0).
Potential future commits might rely on this info that "fields are the
same across segments" for more optimizations, etc.
This is a stepping stone to allow easier future comparisons of field
maps and potential merge optimizations.
In bleve-blast tests on a 2015 macbook (50K wikipedia docs, 8
indexers, batch size 100, ssd), this does not seem to have a distinct
effect on indexing throughput.
The slow merger was lagging behind the fast persister
to a persister notify send-loop while the persister awaits
for any new introductions from introducer totally blocking
the merger
This fix along with the deleted files eligibilty flipping
makes the file count to around 6 to 11 files per shard
for both travel and beer samples
This change turns zap.MergeToWriter() into a public func, so that it's
now directly callable from outside packages (such as from scorch's
top-level merger or persister). And, MergerToWriter() now takes input
of SegmentBases instead of Segments, so that it can now work on either
in-memory zap segments or file-based zap segments.
This is yet another stepping stone towards in-memory merging of zap
segments.
This change optimizes the scorch/mem DictionaryIterator by reusing a
DictEntry struct across multiple Next() calls. This follows the same
optimization trick and Next() semantics as upsidedown's FieldDict
implementation.