0
0
Fork 0
Commit Graph

136 Commits

Author SHA1 Message Date
abhinavdangeti 7e36109b3c MB-28162: Provide API to estimate memory needed to run a search query
This API (unexported) will estimate the amount of memory needed to execute
a search query over an index before the collector begins data collection.

Sample estimates for certain queries:
{Size: 10, BenchmarkUpsidedownSearchOverhead}
                                                           ESTIMATE    BENCHMEM
TermQuery                                                  4616        4796
MatchQuery                                                 5210        5405
DisjunctionQuery (Match queries)                           7700        8447
DisjunctionQuery (Term queries)                            6514        6591
ConjunctionQuery (Match queries)                           7524        8175
Nested disjunction query (disjunction of disjunctions)     10306       10708
…
2018-03-06 13:53:42 -08:00
Steve Yen 5b86da85f3 scorch zap optimize postings itr with tf/loc reader/decoder reuse 2018-03-06 13:30:59 -08:00
Steve Yen 530a3d24cf scorch zap optimize merge by byte copying freq/norm/loc's
This change adds a zap PostingsIterator.nextBytes() method, which is
similar to Next(), but instead of returning a Posting instance,
nextBytes() returns the encoded freq/norm and location byte slices.

The zap merge code then provides those byte slices directly to the
intCoder's via a new method, intCoder.AddBytes(), thereby avoiding
having to encode many uvarint's.
2018-03-06 13:30:59 -08:00
Steve Yen 655268bec8 scorch zap postings iterator nextDocNum() helper method
Refactored out a nextDocNum() helper method from Next() that future
optimizations can use.
2018-03-06 07:55:26 -08:00
Steve Yen 502e64c256 scorch zap Posting doesn't use iterator field 2018-03-05 16:33:13 -08:00
Steve Yen 8f8fd511b7 scorch zap access freqs[offset] outside loop 2018-03-05 12:02:33 -08:00
Steve Yen a338386a03 scorch build optimize freq/loc slice capacity 2018-03-05 12:02:33 -08:00
Steve Yen 856778ad7b scorch zap build prealloc docNumbers capacity 2018-03-05 12:02:33 -08:00
Steve Yen 8c0881eab2 scorch zap build reuses mem postingsList/Iterator structs 2018-03-05 12:02:33 -08:00
Steve Yen 85761c6a57 go fmt 2018-03-05 12:02:33 -08:00
Steve Yen 884da6f93a scorch optimize mem processDocument() norm calculation
This change moves the norm calculation outside of the inner loop.
2018-03-03 11:58:30 -08:00
Steve Yen 6ae799052a scorch mem optimize processDocument() stored field 2018-03-03 11:52:33 -08:00
Steve Yen b7cfef81c9 scorch optimize mem processDocument() dict access
This change moves the dict lookup to outside of the loop.
2018-03-03 11:43:25 -08:00
Steve Yen 88c740095b scorch optimizations for mem.PostingsIterator.Next() & docTermMap
Due to the usage rules of iterators, mem.PostingsIterator.Next() can
reuse its returned Postings instance.

Also, there's a micro optimization in persistDocValues() for one fewer
access to the docTermMap in the inner-loop.
2018-03-03 11:31:18 -08:00
Marty Schoch 0363b24dd4 update to use new vellum Reset API 2018-03-01 09:37:39 -08:00
Steve Yen 7d46d2c7ae scorch zap intcoder encoder is never nil 2018-02-28 10:09:21 -08:00
Steve Yen dd7d93ee5e scorch zap loadChunk reuses Location slices 2018-02-27 18:01:48 -08:00
Steve Yen 4dbb4b1495 scorch zap posting reuses freqNorm & loc reader and decoder 2018-02-27 18:01:48 -08:00
Steve Yen 3f1dcb6078 scorch zap merge optimize drops lookup to outside of loop 2018-02-27 09:23:29 -08:00
Steve Yen 99ed127176 scorch zap merge optimize newDocNums lookup to outside of loop
And, also a "go fmt".
2018-02-26 14:23:55 -08:00
Steve Yen 98d5d7bd81 scorch zap chunkedIntCoder optimizations
The optimizations / changes include...

- reuse of a memory buf when serializing varint's.

- reuse of a govarint.U64Base128Encoder instance, as it's a thin,
  wrapper around an underlying chunkBuf, so Reset()'s on the
  chunkBuf is enough for encoder reuse.

- chunkedIntcoder.Write() method was changed to invoke w.Write() less
  often by forming a larger, reused buf.  Profiling and analysis
  showed w.Write() was getting called a lot, often with tiny 1 or 2
  byte inputs.  The theory is w.Write() and its underlying memmove()
  can be more efficient when provided with larger bufs.

- some repeated code removal, by reusing the Close() method.
2018-02-26 14:17:09 -08:00
Steve Yen ce2332e111 scorch zap merge reuses tf/locEncoder across terms
The finishTerm() helper func that's invoked on every outer loop resets
the tf/locEncoders so they can be safely reused.
2018-02-26 11:37:11 -08:00
Steve Yen a0b7508da7 scorch zap mergeSegmentBases() func
As part of this, zap.MergeToWriter() now returns more information --
enough so that callers can now create their own SegmentBase instances.

Also, the fieldsMap maintained and returned by zap.MergeToWriter() is
now a mapping from fieldName ==> fieldID+1 (instead of the previous
mapping from fieldName ==> fieldID).  This makes it similar to how
fieldsMap are handled in other parts of zap to avoid "zero value"
issues.
2018-02-19 14:13:31 -08:00
Steve Yen 720010783e scorch zap InitSegmentBase() helper func
Refactored out a zap.InitSegmentBase() func so that non-zap packages
can create SegmentBase instances.
2018-02-19 14:13:31 -08:00
Steve Yen fe544f3352 scorch zap merge uses enumerator for vellum.Iterator's 2018-02-12 21:28:46 -08:00
Steve Yen a073424e5a scorch zap dict.postingsListFromOffset() method
A helper method that can create a PostingsList if the caller already
knows the postingsOffset.
2018-02-12 20:54:07 -08:00
Steve Yen 2158e06c40 scorch zap merge collects dicts & itrs in lock-step
The theory with this change is that the dicts and itrs should be
positionally in "lock-step" with paired entries.

And, since later code also uses the same array indexing to access the
drops and newDocNums, those also need to be positionally in pair-wise
lock-step, too.
2018-02-12 20:54:07 -08:00
Steve Yen 95a4f37e5c scorch zap enumerator impl that joins multiple vellum iterators
Unlike vellum's MergeIterator, the enumerator introduced in this
commit doesn't merge when there are matching keys across iterators.

Instead, the enumerator implementation provides a traversal of all the
tuples of (key, iteratorIndex, val) from the underlying vellum
iterators, ordered by key ASC, iteratorIndex ASC.
2018-02-12 20:54:06 -08:00
Steve Yen e37c563c56 scorch zap merge move fieldDvLocsOffset var declaration
Move the var declaration to nearer where its used.
2018-02-08 18:03:09 -08:00
Steve Yen f177f07613 scorch zap segment merging reuses prealloc'ed PostingsIterator
During zap segment merging, a new zap PostingsIterator was allocated
for every field X segment X term.

This change optimizes by reusing a single PostingsIterator instance
per persistMergedRest() invocation.

And, also unused fields are removed from the PostingsIterator.
2018-02-08 17:24:30 -08:00
Steve Yen ed4826b189 scorch zap merge optimization to byte-copy storedDocs
The optimization to byte-copy all the storedDocs for a given segment
during merging kicks in when the fields are the same across all
segments and when there are no deletions for that given segment.  This
can happen, for example, during data loading or insert-only scenarios.

As part of this commit, the Segment.copyStoredDocs() method was added,
which uses a single Write() call to copy all the stored docs bytes of
a segment to a writer in one shot.

And, getDocStoredMetaAndCompressed() was refactored into a related
helper function, getDocStoredOffsets(), which provides the storedDocs
metadata (offsets & lengths) for a doc.
2018-02-08 09:08:35 -08:00
Steve Yen 0b50a20cac scorch zap move docDropped const to earlier in file 2018-02-08 09:06:31 -08:00
Steve Yen 822457542e scorch zap VERSION bump: check whether fields are the same at merge
COMPATIBILITY NOTE: scorch zap version bumped in this commit.

The version bump is because mergeFields() now computes whether fields
are the same across segments and it relies on the previous commit
where fieldID's are assigned in field name sorted order (albeit with
_id field always having fieldID of 0).

Potential future commits might rely on this info that "fields are the
same across segments" for more optimizations, etc.
2018-02-08 09:06:30 -08:00
Steve Yen ffdeb8055e scorch sorts fields by name to assign fieldID's
This is a stepping stone to allow easier future comparisons of field
maps and potential merge optimizations.

In bleve-blast tests on a 2015 macbook (50K wikipedia docs, 8
indexers, batch size 100, ssd), this does not seem to have a distinct
effect on indexing throughput.
2018-02-08 09:06:30 -08:00
Steve Yen a83ee0f364 scorch zap.MergeToWriter() takes SegmentBases instead of Segments
This change turns zap.MergeToWriter() into a public func, so that it's
now directly callable from outside packages (such as from scorch's
top-level merger or persister).  And, MergerToWriter() now takes input
of SegmentBases instead of Segments, so that it can now work on either
in-memory zap segments or file-based zap segments.

This is yet another stepping stone towards in-memory merging of zap
segments.
2018-02-07 14:38:13 -08:00
Steve Yen 8c2520d55c scorch zap optimize via postingsList reuse
pprof graphs were showing many postingsList allocations during
merging, so this change optimizes by reusing postingList memory in the
merging loops.
2018-02-07 14:33:20 -08:00
Steve Yen 03c8b2b7ec scorch mem segment optimizes DictEntry's across Next() calls
This change optimizes the scorch/mem DictionaryIterator by reusing a
DictEntry struct across multiple Next() calls.  This follows the same
optimization trick and Next() semantics as upsidedown's FieldDict
implementation.
2018-02-07 14:17:48 -08:00
Steve Yen 0dfd73d6cc scorch zap mergeStoredAndRemap loop optimization
This change avoids an array/slice access in a loop body.
2018-02-06 17:10:44 -08:00
Steve Yen c09e2a08ca scorch zap chunkedContentCoder reuses chunk metadata slice memory
And, renamed the chunk MetaData.DocID field to DocNum for naming
correctness, where much of this commit is the mechanical effect of
that rename.
2018-02-05 07:39:16 -08:00
Steve Yen 6578655758 scorch zap refactored out mergeToWriter() func
This is a step towards supporting in-memory zap segment merging.
2018-02-05 07:39:16 -08:00
Steve Yen eb21bf8315 scorch zap merge & build share persistStoredFieldValues()
Refactored out a helper func, persistStoredFieldValues(), that both
the persistence and merge codepaths now share.
2018-02-05 07:38:55 -08:00
Steve Yen 714f5321e0 scorch zap merge storedFieldVals inner loop optimization 2018-02-01 16:28:15 -08:00
Steve Yen 93b037cdbb scorch zap TestMergeWithUpdates() 2018-01-31 11:44:41 -08:00
Steve Yen 4dd64b68fa scorch zap TestMergeWithEmptySegment(s) 2018-01-30 22:27:40 -08:00
Steve Yen 684ee3c0e7 scorch zap DictIterator term count fixed and more merge unit tests
The zap DictionaryIterator Next() was incorrectly returning the
postingsList offset as the term count.  As part of this, refactored
out a PostingsList.read() helper method.

Also added more merge unit test scenarios, including merging a segment
for a few rounds to see if there are differences before/after merging.
2018-01-30 21:22:06 -08:00
Steve Yen 634cfa0560 scorch zap chunkedIntCoder optimization to prealloc some final buf 2018-01-29 11:03:53 -08:00
Steve Yen a444c25ddf scorch zap merge uses array for docTermMap with no sorting
Instead of sorting docNum keys from a hashmap, this change instead
iterates from docNum 0 to N and uses an array instead of hashmap.
The array is also reused across outer loop iterations.

This optimizes for when there's a lot of structural similarity between
docs, where many/most docs have the same fields.  i.e., beers,
breweries.  If every doc has completely different fields, then this
change might produce worse behavior compared to the previous sparse
hashmap approach.
2018-01-29 10:47:08 -08:00
Steve Yen 745575a6c1 scorch zap mergeStoredAndRemap uses array indexing, not append()
Since we have right array size preallocated, we don't need the extra
capacity checking of append().
2018-01-27 11:35:10 -08:00
Steve Yen 8dd17a3b20 scorch zap mergeStoredAndRemap uses continue for less indentation 2018-01-27 11:35:10 -08:00
Steve Yen 0041664bc4 scorch zap merge computeNewDocCount() optimize 1 variable 2018-01-27 11:35:10 -08:00