The previous commit's optimization that replaced the locsBitmap was
incorrectly handling the case when there was a 1-bit encoding
optimization in the postingsIterator.nextBytes() method,
incorrectly generating the freq-norm bytes.
Also as part of this change, more unused locsBitmap's were removed.
This is attempt #2 of the optimization that replaces the locsBitmap,
without any changes from the original commit attempt. A commit that
follows this one contains the actual fix.
See also...
- commit 621b58dd83 (the 1st attempt)
- commit 49a4ee60ba (the revert)
-------------
The original commit message body from 621b58 was...
NOTE: this is a zap file format change.
The separate "postings locations" roaring Bitmap that encoded whether
a posting has locations info is now replaced by the least significant
bit in the freq varint encoded in the freq-norm chunkedIntCoder.
encode/decodeFreqHasLocs() are added as helper functions.
Testing with the cbft application led to cbft process exits...
AsyncError exit()... error reading location field: EOF --
main.initBleveOptions.func1() at init_bleve.go:85
This reverts commit 621b58dd83.
NOTE: this is a zap file format change.
The separate "postings locations" roaring Bitmap that encoded whether
a posting has locations info is now replaced by the least significant
bit in the freq varint encoded in the freq-norm chunkedIntCoder.
encode/decodeFreqHasLocs() are added as helper functions.
by memoizing the size of index snapshots and their
constituent parts, we significantly reduce the amount
of time that the lock is held in the app_herder, when
calculating the total memory used
NOTE: this is a scorch zap file format change / bump to version 4.
In this optimization, the uint64 val stored in the vellum FST (term
dictionary) now may either be a uint64 postingsOffset (same as before
this change) or a uint64 encoding of the docNum + norm (in the case
where a term appears in just a single doc).
Before this change, writeRoaringWithLen() would leverage a reused
bytes.Buffer (#A) and invoke the roaring.WriteTo() API.
But, it turns out the roaring.WriteTo() API has a suboptimal
implementation, in that underneath-the-hood it converts the roaring
bitmap to a byte buffer (using roaring.ToBytes()), and then calls
Write(). But, that Write() turns out to be an additional memcpy into
the provided bytes.Buffer (#A).
By directly invoking roaring.ToBytes(), this change to
writeRoaringWithLen() avoids the extra memory allocation and memcpy.
This change leverages the ability for the chunkedIntCoder.Add() method
to accept multiple input param values (via the '...' param signature),
meaning there are fewer Add() invocations.
Due to the usage rules of iterators, mem.PostingsIterator.Next() can
reuse its returned Postings instance.
Also, there's a micro optimization in persistDocValues() for one fewer
access to the docTermMap in the inner-loop.
COMPATIBILITY NOTE: scorch zap version bumped in this commit.
The version bump is because mergeFields() now computes whether fields
are the same across segments and it relies on the previous commit
where fieldID's are assigned in field name sorted order (albeit with
_id field always having fieldID of 0).
Potential future commits might rely on this info that "fields are the
same across segments" for more optimizations, etc.
The zap SegmentBase struct is a refactoring of the zap Segment into
the subset of fields that are needed for read-only ops, without any
persistence related info. This allows us to use zap's optimized data
encoding as scorch's in-memory segments.
The zap Segment struct now embeds a zap SegmentBase struct, and layers
on persistence. Both the zap Segment and zap SegmentBase implement
scorch's Segment interface.
inserted to the list of dv enabled fields list -
DocValueFields in mem segment.
Moved back to the original type `DocValueFields map[uint16]bool`
for easy look up to check whether the fieldID is
configured for dv storage.
docValues are persisted along with the index,
in a columnar fashion per field with variable
sized chunking for quick look up.
-naive chunk level caching is added per field
-data part inside a chunk is snappy compressed
-metaHeader inside the chunk index the dv values
inside the uncompressed data part
-all the fields are docValue persisted in this iteration