In this optimization, the zap PostingsIterator skips the parsing of
freq/norm/locs chunks based on the includeFreq|Norm|Locs flags.
In bleve-query microbenchmark on dev macbookpro, with 50K en-wiki
docs, on a medium frequency term search that does not ask for term
vectors, throughput was ~750 q/sec before the change and
went to ~1400 q/sec after the change.
This commit adds boolean flag params to the scorch
PostingsList.Iterator() method, so that the caller can specify whether
freq/norm/locs information is needed or not.
Future changes can leverage these params for optimizations.
The previous code would inefficiently throw away the nextLocs and
would also throw away the []segment.Location slice if there were no
locations, such as if it was a 1-hit postings list.
This change tries to reuse the nextLocs/nextSegmentLocs for all cases.
The snappy Encode/Decode() API's accept an optional destination buffer
param where their encoded/decoded output results will be placed, but
they only check that the buffer has enough len() rather than enough
capacity before deciding to allocate a new buffer.
The previous commit's optimization that replaced the locsBitmap was
incorrectly handling the case when there was a 1-bit encoding
optimization in the postingsIterator.nextBytes() method,
incorrectly generating the freq-norm bytes.
Also as part of this change, more unused locsBitmap's were removed.
This is attempt #2 of the optimization that replaces the locsBitmap,
without any changes from the original commit attempt. A commit that
follows this one contains the actual fix.
See also...
- commit 621b58dd83 (the 1st attempt)
- commit 49a4ee60ba (the revert)
-------------
The original commit message body from 621b58 was...
NOTE: this is a zap file format change.
The separate "postings locations" roaring Bitmap that encoded whether
a posting has locations info is now replaced by the least significant
bit in the freq varint encoded in the freq-norm chunkedIntCoder.
encode/decodeFreqHasLocs() are added as helper functions.
Testing with the cbft application led to cbft process exits...
AsyncError exit()... error reading location field: EOF --
main.initBleveOptions.func1() at init_bleve.go:85
This reverts commit 621b58dd83.
NOTE: this is a zap file format change.
The separate "postings locations" roaring Bitmap that encoded whether
a posting has locations info is now replaced by the least significant
bit in the freq varint encoded in the freq-norm chunkedIntCoder.
encode/decodeFreqHasLocs() are added as helper functions.
When the index is closed, do not fire an AsyncError (fatal) from either
the merger or the persister that is actively working. This is quite a
probable situation, so exit the loop within the goroutine.
Use of sync.Pool to reuse the interm structure relied on resetting
the fieldsInv slice. However, actual segments continued to use
this same fieldsInv slice after returning it to the pool. Simple
fix is to nil out fieldsInv slice in reset method and let the
newly built segment keep the one from the interim struct.
mergeFields depends on the fields from the various segments being
sorted for the fieldsSame comparison to work.
Of note, the 'fieldi > 1' guard skips the 0th field, which should
always be the '_id' field.
correctly handle/print additional loc bitmap address
this fixes bitmap length that is output
instantiate roaring bitmap and print it out
removed some unnecessary debug logging
updated dict command to print 1-hit encoded vals
this makes dict command usable for seeing which
doc ids are in a segment and their corresponding doc number
The optimization recently introduced in commit 530a3d24cf,
("scorch zap optimize merge by byte copying freq/norm/loc's") was to
byte-copy freq/norm/loc data directly during merging. But, it was
incorrect if the fields were different across segments.
This change now performs that byte-copying merging optimization only
when the fields are the same across segments, and if not, leverages
the old approach of deserializing & re-serializing the freq/norm/loc
information, which has the important step of remapping fieldID's.
See also: https://issues.couchbase.com/browse/MB-28781