This API (unexported) will estimate the amount of memory needed to execute
a search query over an index before the collector begins data collection.
Sample estimates for certain queries:
{Size: 10, BenchmarkUpsidedownSearchOverhead}
ESTIMATE BENCHMEM
TermQuery 4616 4796
MatchQuery 5210 5405
DisjunctionQuery (Match queries) 7700 8447
DisjunctionQuery (Term queries) 6514 6591
ConjunctionQuery (Match queries) 7524 8175
Nested disjunction query (disjunction of disjunctions) 10306 10708
…
Due to the usage rules of iterators, mem.PostingsIterator.Next() can
reuse its returned Postings instance.
Also, there's a micro optimization in persistDocValues() for one fewer
access to the docTermMap in the inner-loop.
This is a stepping stone to allow easier future comparisons of field
maps and potential merge optimizations.
In bleve-blast tests on a 2015 macbook (50K wikipedia docs, 8
indexers, batch size 100, ssd), this does not seem to have a distinct
effect on indexing throughput.
This change optimizes the scorch/mem DictionaryIterator by reusing a
DictEntry struct across multiple Next() calls. This follows the same
optimization trick and Next() semantics as upsidedown's FieldDict
implementation.
The zap SegmentBase struct is a refactoring of the zap Segment into
the subset of fields that are needed for read-only ops, without any
persistence related info. This allows us to use zap's optimized data
encoding as scorch's in-memory segments.
The zap Segment struct now embeds a zap SegmentBase struct, and layers
on persistence. Both the zap Segment and zap SegmentBase implement
scorch's Segment interface.
+ Account for all the overhead incurred from the data structures
within mem.Segment and zap.Segment.
- SizeOfMap = 8
- SizeOfPointer = 8
- SizeOfSlice = 24
- SizeOfString = 16
+ Include overhead from certain new fields as well.
This change preallocates more of the backing arrays for Locfields,
Locstarts, Locends, Locpos, Locaaraypos sub-slices of a scorch mem
segment.
On small bleve-blast tests (50K wiki docs) on a dev macbook, scorch
indexing throughput seems to improve from 15MB/sec to 20MB/sec after
the recent series of preallocation changes.
The scorch mem segment build phase uses the append() idiom to populate
various slices that are keyed by postings list id's. These slices
include...
* Postings
* PostingsLocs
* Freqs
* Norms
* Locfields
* Locstarts
* Locends
* Locpos
* Locarraypos
This change introduces an initialization step that preallocates those
slices up-front, by assigning postings list id's to terms up-front.
This change also has an additional effect of simplifying the
processDocument() logic to no longer have to worry about a first-time
initialization case, removing some duplicate'ish code.
The first time through, startNumFields should be 0, where there ought
to be more optimization assuming later docs have similar fields as the
first doc.
-VisitableDocValueFields API for persisted DV field list
-making dv configs overridable at field level
-enabling on the fly/runtime un inverting of doc values
-few UT updates
inserted to the list of dv enabled fields list -
DocValueFields in mem segment.
Moved back to the original type `DocValueFields map[uint16]bool`
for easy look up to check whether the fieldID is
configured for dv storage.
+ Track memory usage at a segment level
+ Add a new scorch API: MemoryUsed()
- Aggregate the memory consumption across
segments when API is invoked.
+ TODO:
- Revisit the second iteration if it can be gotten
rid off, and the size accounted for during the first
run while building an in-mem segment.
- Accounting for pointer and slice overhead.
docValues are persisted along with the index,
in a columnar fashion per field with variable
sized chunking for quick look up.
-naive chunk level caching is added per field
-data part inside a chunk is snappy compressed
-metaHeader inside the chunk index the dv values
inside the uncompressed data part
-all the fields are docValue persisted in this iteration