Due to the usage rules of iterators, mem.PostingsIterator.Next() can
reuse its returned Postings instance.
Also, there's a micro optimization in persistDocValues() for one fewer
access to the docTermMap in the inner-loop.
The optimizations / changes include...
- reuse of a memory buf when serializing varint's.
- reuse of a govarint.U64Base128Encoder instance, as it's a thin,
wrapper around an underlying chunkBuf, so Reset()'s on the
chunkBuf is enough for encoder reuse.
- chunkedIntcoder.Write() method was changed to invoke w.Write() less
often by forming a larger, reused buf. Profiling and analysis
showed w.Write() was getting called a lot, often with tiny 1 or 2
byte inputs. The theory is w.Write() and its underlying memmove()
can be more efficient when provided with larger bufs.
- some repeated code removal, by reusing the Close() method.
As part of this, zap.MergeToWriter() now returns more information --
enough so that callers can now create their own SegmentBase instances.
Also, the fieldsMap maintained and returned by zap.MergeToWriter() is
now a mapping from fieldName ==> fieldID+1 (instead of the previous
mapping from fieldName ==> fieldID). This makes it similar to how
fieldsMap are handled in other parts of zap to avoid "zero value"
issues.
The theory with this change is that the dicts and itrs should be
positionally in "lock-step" with paired entries.
And, since later code also uses the same array indexing to access the
drops and newDocNums, those also need to be positionally in pair-wise
lock-step, too.
Unlike vellum's MergeIterator, the enumerator introduced in this
commit doesn't merge when there are matching keys across iterators.
Instead, the enumerator implementation provides a traversal of all the
tuples of (key, iteratorIndex, val) from the underlying vellum
iterators, ordered by key ASC, iteratorIndex ASC.
During zap segment merging, a new zap PostingsIterator was allocated
for every field X segment X term.
This change optimizes by reusing a single PostingsIterator instance
per persistMergedRest() invocation.
And, also unused fields are removed from the PostingsIterator.
The optimization to byte-copy all the storedDocs for a given segment
during merging kicks in when the fields are the same across all
segments and when there are no deletions for that given segment. This
can happen, for example, during data loading or insert-only scenarios.
As part of this commit, the Segment.copyStoredDocs() method was added,
which uses a single Write() call to copy all the stored docs bytes of
a segment to a writer in one shot.
And, getDocStoredMetaAndCompressed() was refactored into a related
helper function, getDocStoredOffsets(), which provides the storedDocs
metadata (offsets & lengths) for a doc.
COMPATIBILITY NOTE: scorch zap version bumped in this commit.
The version bump is because mergeFields() now computes whether fields
are the same across segments and it relies on the previous commit
where fieldID's are assigned in field name sorted order (albeit with
_id field always having fieldID of 0).
Potential future commits might rely on this info that "fields are the
same across segments" for more optimizations, etc.
This is a stepping stone to allow easier future comparisons of field
maps and potential merge optimizations.
In bleve-blast tests on a 2015 macbook (50K wikipedia docs, 8
indexers, batch size 100, ssd), this does not seem to have a distinct
effect on indexing throughput.
This change turns zap.MergeToWriter() into a public func, so that it's
now directly callable from outside packages (such as from scorch's
top-level merger or persister). And, MergerToWriter() now takes input
of SegmentBases instead of Segments, so that it can now work on either
in-memory zap segments or file-based zap segments.
This is yet another stepping stone towards in-memory merging of zap
segments.
The zap DictionaryIterator Next() was incorrectly returning the
postingsList offset as the term count. As part of this, refactored
out a PostingsList.read() helper method.
Also added more merge unit test scenarios, including merging a segment
for a few rounds to see if there are differences before/after merging.
Instead of sorting docNum keys from a hashmap, this change instead
iterates from docNum 0 to N and uses an array instead of hashmap.
The array is also reused across outer loop iterations.
This optimizes for when there's a lot of structural similarity between
docs, where many/most docs have the same fields. i.e., beers,
breweries. If every doc has completely different fields, then this
change might produce worse behavior compared to the previous sparse
hashmap approach.
Instead of allocating a govarint.U64Base128Encoder in the inner loop,
allocate it just once on the outside, as it appears that it's just a
thin wrapper around binary.PutUvarint().
The zap SegmentBase struct is a refactoring of the zap Segment into
the subset of fields that are needed for read-only ops, without any
persistence related info. This allows us to use zap's optimized data
encoding as scorch's in-memory segments.
The zap Segment struct now embeds a zap SegmentBase struct, and layers
on persistence. Both the zap Segment and zap SegmentBase implement
scorch's Segment interface.
+ Account for all the overhead incurred from the data structures
within mem.Segment and zap.Segment.
- SizeOfMap = 8
- SizeOfPointer = 8
- SizeOfSlice = 24
- SizeOfString = 16
+ Include overhead from certain new fields as well.
-VisitableDocValueFields API for persisted DV field list
-making dv configs overridable at field level
-enabling on the fly/runtime un inverting of doc values
-few UT updates
zap command-line tool added to main bleve command-line tool
this required physical relocation due to the vendoring used
only on the bleve command-line tool (unforseen limitation)
a new scorch command-line tool has also been introduced
and for the same reasons it is physically store under
the top-level bleve command-line tool as well
inserted to the list of dv enabled fields list -
DocValueFields in mem segment.
Moved back to the original type `DocValueFields map[uint16]bool`
for easy look up to check whether the fieldID is
configured for dv storage.
+ Track memory usage at a segment level
+ Add a new scorch API: MemoryUsed()
- Aggregate the memory consumption across
segments when API is invoked.
+ TODO:
- Revisit the second iteration if it can be gotten
rid off, and the size accounted for during the first
run while building an in-mem segment.
- Accounting for pointer and slice overhead.
docValues are persisted along with the index,
in a columnar fashion per field with variable
sized chunking for quick look up.
-naive chunk level caching is added per field
-data part inside a chunk is snappy compressed
-metaHeader inside the chunk index the dv values
inside the uncompressed data part
-all the fields are docValue persisted in this iteration
With this change, there are no more memory allocations in the calls to
PostingsIterator.Next() in the micro benchmarks of bleve-query. On a
dev macbook, on an index of 50K wikipedia docs, using high frequency
search of "text:date"...
400 qps - upsidedown/moss
565 qps - scorch before
680 qps - scorch after