The zap DictionaryIterator Next() was incorrectly returning the
postingsList offset as the term count. As part of this, refactored
out a PostingsList.read() helper method.
Also added more merge unit test scenarios, including merging a segment
for a few rounds to see if there are differences before/after merging.
Instead of sorting docNum keys from a hashmap, this change instead
iterates from docNum 0 to N and uses an array instead of hashmap.
The array is also reused across outer loop iterations.
This optimizes for when there's a lot of structural similarity between
docs, where many/most docs have the same fields. i.e., beers,
breweries. If every doc has completely different fields, then this
change might produce worse behavior compared to the previous sparse
hashmap approach.
Instead of allocating a govarint.U64Base128Encoder in the inner loop,
allocate it just once on the outside, as it appears that it's just a
thin wrapper around binary.PutUvarint().
The zap SegmentBase struct is a refactoring of the zap Segment into
the subset of fields that are needed for read-only ops, without any
persistence related info. This allows us to use zap's optimized data
encoding as scorch's in-memory segments.
The zap Segment struct now embeds a zap SegmentBase struct, and layers
on persistence. Both the zap Segment and zap SegmentBase implement
scorch's Segment interface.
+ Account for all the overhead incurred from the data structures
within mem.Segment and zap.Segment.
- SizeOfMap = 8
- SizeOfPointer = 8
- SizeOfSlice = 24
- SizeOfString = 16
+ Include overhead from certain new fields as well.
This change preallocates more of the backing arrays for Locfields,
Locstarts, Locends, Locpos, Locaaraypos sub-slices of a scorch mem
segment.
On small bleve-blast tests (50K wiki docs) on a dev macbook, scorch
indexing throughput seems to improve from 15MB/sec to 20MB/sec after
the recent series of preallocation changes.
The scorch mem segment build phase uses the append() idiom to populate
various slices that are keyed by postings list id's. These slices
include...
* Postings
* PostingsLocs
* Freqs
* Norms
* Locfields
* Locstarts
* Locends
* Locpos
* Locarraypos
This change introduces an initialization step that preallocates those
slices up-front, by assigning postings list id's to terms up-front.
This change also has an additional effect of simplifying the
processDocument() logic to no longer have to worry about a first-time
initialization case, removing some duplicate'ish code.
The first time through, startNumFields should be 0, where there ought
to be more optimization assuming later docs have similar fields as the
first doc.