0
0
Fork 0
Commit Graph

65 Commits

Author SHA1 Message Date
abhinavdangeti 844845b5d2 Revert "scorch zap panic if mergeFields() sees unsorted fields"
This reverts commit 2f4d3d8587.
2018-03-20 14:51:25 -07:00
Steve Yen 2f4d3d8587 scorch zap panic if mergeFields() sees unsorted fields
mergeFields depends on the fields from the various segments being
sorted for the fieldsSame comparison to work.

Of note, the 'fieldi > 1' guard skips the 0th field, which should
always be the '_id' field.
2018-03-20 11:17:46 -07:00
Steve Yen f65ba5c0f4 MB-28781 - scorch zap merge freq/loc copying only when fieldsSame
The optimization recently introduced in commit 530a3d24cf,
("scorch zap optimize merge by byte copying freq/norm/loc's") was to
byte-copy freq/norm/loc data directly during merging.  But, it was
incorrect if the fields were different across segments.

This change now performs that byte-copying merging optimization only
when the fields are the same across segments, and if not, leverages
the old approach of deserializing & re-serializing the freq/norm/loc
information, which has the important step of remapping fieldID's.

See also: https://issues.couchbase.com/browse/MB-28781
2018-03-19 11:26:51 -07:00
Steve Yen c881146270 scorch zap mergeTermFreqNormLocsByCopying() helper func 2018-03-19 10:36:23 -07:00
Steve Yen 5df53c8e1f scorch zap file merger uses 1MB buffered writer
pprof of bleve-blast was showing file merging was in syscall/write a
lot.  The bufio.NewWriter() provides a default buffer size of 4K,
which is too small, and using bufio.NewWriterSize(1MB buffer size)
leads to syscall/write dropping out of the file merging flame graphs.
2018-03-16 11:49:53 -07:00
Steve Yen e52eb84e37 scorch zap optimize merge when deletion bitmap is empty
This change detects whether a deletion bitmap is empty, and treats
that as a nil bitmap, which allows further postings iterator codepaths
to avoid roaring bitmap operations (like, AndNot(docNums, drops)).
2018-03-16 11:22:50 -07:00
Steve Yen cad88096ca scorch zap reuse roaring Bitmap during merge 2018-03-12 09:17:37 -07:00
Steve Yen 3884cf4d12 scorch zap writePostings() helper func refactored out 2018-03-09 13:29:28 -08:00
Steve Yen eac9808990 scorch zap optimize FST val encoding for terms with 1 hit
NOTE: this is a scorch zap file format change / bump to version 4.

In this optimization, the uint64 val stored in the vellum FST (term
dictionary) now may either be a uint64 postingsOffset (same as before
this change) or a uint64 encoding of the docNum + norm (in the case
where a term appears in just a single doc).
2018-03-08 09:19:54 -08:00
Steve Yen 15242af465
Merge pull request #805 from steveyen/optimize-scorch-mem-processField
Optimize scorch processField() inner loop and writeRoaringWithLen()
2018-03-07 09:09:57 -08:00
Sreekanth Sivasankaran e0369a3553
Merge branch 'master' into compaction_bytes_stats 2018-03-07 14:47:33 +05:30
Sreekanth Sivasankaran 2a9739ee1b naming change, interface removal 2018-03-07 14:43:33 +05:30
Steve Yen dde6c2e01b scorch zap optimize writeRoaringWithLen()
Before this change, writeRoaringWithLen() would leverage a reused
bytes.Buffer (#A) and invoke the roaring.WriteTo() API.

But, it turns out the roaring.WriteTo() API has a suboptimal
implementation, in that underneath-the-hood it converts the roaring
bitmap to a byte buffer (using roaring.ToBytes()), and then calls
Write().  But, that Write() turns out to be an additional memcpy into
the provided bytes.Buffer (#A).

By directly invoking roaring.ToBytes(), this change to
writeRoaringWithLen() avoids the extra memory allocation and memcpy.
2018-03-06 14:59:20 -08:00
Steve Yen 530a3d24cf scorch zap optimize merge by byte copying freq/norm/loc's
This change adds a zap PostingsIterator.nextBytes() method, which is
similar to Next(), but instead of returning a Posting instance,
nextBytes() returns the encoded freq/norm and location byte slices.

The zap merge code then provides those byte slices directly to the
intCoder's via a new method, intCoder.AddBytes(), thereby avoiding
having to encode many uvarint's.
2018-03-06 13:30:59 -08:00
Sreekanth Sivasankaran 395b0a312d adding UTs 2018-03-05 17:02:58 +05:30
Sreekanth Sivasankaran dec265c481 adding compaction_written_bytes/sec stats to scorch 2018-03-05 16:32:57 +05:30
Marty Schoch 0363b24dd4 update to use new vellum Reset API 2018-03-01 09:37:39 -08:00
Steve Yen 3f1dcb6078 scorch zap merge optimize drops lookup to outside of loop 2018-02-27 09:23:29 -08:00
Steve Yen 99ed127176 scorch zap merge optimize newDocNums lookup to outside of loop
And, also a "go fmt".
2018-02-26 14:23:55 -08:00
Steve Yen ce2332e111 scorch zap merge reuses tf/locEncoder across terms
The finishTerm() helper func that's invoked on every outer loop resets
the tf/locEncoders so they can be safely reused.
2018-02-26 11:37:11 -08:00
Steve Yen a0b7508da7 scorch zap mergeSegmentBases() func
As part of this, zap.MergeToWriter() now returns more information --
enough so that callers can now create their own SegmentBase instances.

Also, the fieldsMap maintained and returned by zap.MergeToWriter() is
now a mapping from fieldName ==> fieldID+1 (instead of the previous
mapping from fieldName ==> fieldID).  This makes it similar to how
fieldsMap are handled in other parts of zap to avoid "zero value"
issues.
2018-02-19 14:13:31 -08:00
Steve Yen fe544f3352 scorch zap merge uses enumerator for vellum.Iterator's 2018-02-12 21:28:46 -08:00
Steve Yen 2158e06c40 scorch zap merge collects dicts & itrs in lock-step
The theory with this change is that the dicts and itrs should be
positionally in "lock-step" with paired entries.

And, since later code also uses the same array indexing to access the
drops and newDocNums, those also need to be positionally in pair-wise
lock-step, too.
2018-02-12 20:54:07 -08:00
Steve Yen e37c563c56 scorch zap merge move fieldDvLocsOffset var declaration
Move the var declaration to nearer where its used.
2018-02-08 18:03:09 -08:00
Steve Yen f177f07613 scorch zap segment merging reuses prealloc'ed PostingsIterator
During zap segment merging, a new zap PostingsIterator was allocated
for every field X segment X term.

This change optimizes by reusing a single PostingsIterator instance
per persistMergedRest() invocation.

And, also unused fields are removed from the PostingsIterator.
2018-02-08 17:24:30 -08:00
Steve Yen ed4826b189 scorch zap merge optimization to byte-copy storedDocs
The optimization to byte-copy all the storedDocs for a given segment
during merging kicks in when the fields are the same across all
segments and when there are no deletions for that given segment.  This
can happen, for example, during data loading or insert-only scenarios.

As part of this commit, the Segment.copyStoredDocs() method was added,
which uses a single Write() call to copy all the stored docs bytes of
a segment to a writer in one shot.

And, getDocStoredMetaAndCompressed() was refactored into a related
helper function, getDocStoredOffsets(), which provides the storedDocs
metadata (offsets & lengths) for a doc.
2018-02-08 09:08:35 -08:00
Steve Yen 0b50a20cac scorch zap move docDropped const to earlier in file 2018-02-08 09:06:31 -08:00
Steve Yen 822457542e scorch zap VERSION bump: check whether fields are the same at merge
COMPATIBILITY NOTE: scorch zap version bumped in this commit.

The version bump is because mergeFields() now computes whether fields
are the same across segments and it relies on the previous commit
where fieldID's are assigned in field name sorted order (albeit with
_id field always having fieldID of 0).

Potential future commits might rely on this info that "fields are the
same across segments" for more optimizations, etc.
2018-02-08 09:06:30 -08:00
Steve Yen ffdeb8055e scorch sorts fields by name to assign fieldID's
This is a stepping stone to allow easier future comparisons of field
maps and potential merge optimizations.

In bleve-blast tests on a 2015 macbook (50K wikipedia docs, 8
indexers, batch size 100, ssd), this does not seem to have a distinct
effect on indexing throughput.
2018-02-08 09:06:30 -08:00
Steve Yen a83ee0f364 scorch zap.MergeToWriter() takes SegmentBases instead of Segments
This change turns zap.MergeToWriter() into a public func, so that it's
now directly callable from outside packages (such as from scorch's
top-level merger or persister).  And, MergerToWriter() now takes input
of SegmentBases instead of Segments, so that it can now work on either
in-memory zap segments or file-based zap segments.

This is yet another stepping stone towards in-memory merging of zap
segments.
2018-02-07 14:38:13 -08:00
Steve Yen 8c2520d55c scorch zap optimize via postingsList reuse
pprof graphs were showing many postingsList allocations during
merging, so this change optimizes by reusing postingList memory in the
merging loops.
2018-02-07 14:33:20 -08:00
Steve Yen 0dfd73d6cc scorch zap mergeStoredAndRemap loop optimization
This change avoids an array/slice access in a loop body.
2018-02-06 17:10:44 -08:00
Steve Yen 6578655758 scorch zap refactored out mergeToWriter() func
This is a step towards supporting in-memory zap segment merging.
2018-02-05 07:39:16 -08:00
Steve Yen eb21bf8315 scorch zap merge & build share persistStoredFieldValues()
Refactored out a helper func, persistStoredFieldValues(), that both
the persistence and merge codepaths now share.
2018-02-05 07:38:55 -08:00
Steve Yen 714f5321e0 scorch zap merge storedFieldVals inner loop optimization 2018-02-01 16:28:15 -08:00
Steve Yen 634cfa0560 scorch zap chunkedIntCoder optimization to prealloc some final buf 2018-01-29 11:03:53 -08:00
Steve Yen a444c25ddf scorch zap merge uses array for docTermMap with no sorting
Instead of sorting docNum keys from a hashmap, this change instead
iterates from docNum 0 to N and uses an array instead of hashmap.
The array is also reused across outer loop iterations.

This optimizes for when there's a lot of structural similarity between
docs, where many/most docs have the same fields.  i.e., beers,
breweries.  If every doc has completely different fields, then this
change might produce worse behavior compared to the previous sparse
hashmap approach.
2018-01-29 10:47:08 -08:00
Steve Yen 745575a6c1 scorch zap mergeStoredAndRemap uses array indexing, not append()
Since we have right array size preallocated, we don't need the extra
capacity checking of append().
2018-01-27 11:35:10 -08:00
Steve Yen 8dd17a3b20 scorch zap mergeStoredAndRemap uses continue for less indentation 2018-01-27 11:35:10 -08:00
Steve Yen 0041664bc4 scorch zap merge computeNewDocCount() optimize 1 variable 2018-01-27 11:35:10 -08:00
Steve Yen 6985db13a0 scorch zap merge reuses docNumbers array 2018-01-27 11:35:10 -08:00
Steve Yen 916bbf4125 scorch zap merge prealloc's docTermMap capacity 2018-01-27 11:35:10 -08:00
Steve Yen 56cdb68f35 scorch zap merge checks err2 not err
Also, optimize the appending of the termSeparator so that the
docTermMap is accessed and updated just once.
2018-01-27 11:35:10 -08:00
Steve Yen 3030d4edb5 scorch zap merge preallocs segNewDocNums capacity 2018-01-27 11:35:10 -08:00
Steve Yen 9038d75c98 scorch zap allocate govarint.U64Base128Encoder just once
Instead of allocating a govarint.U64Base128Encoder in the inner loop,
allocate it just once on the outside, as it appears that it's just a
thin wrapper around binary.PutUvarint().
2018-01-27 11:35:10 -08:00
Steve Yen 10dd5489c2 scorch zap Dict.postingsList() takes []byte for more mem control
This allows callers that already have a []byte term to avoid
string'ification garbage.
2018-01-27 11:35:10 -08:00
Steve Yen 6a17ff48c7 scorch zap removed uneeded []byte cast of term 2018-01-27 11:35:10 -08:00
Steve Yen d389e2bb40 scorch zap merge file cleanup on error, and some minor prealloc's 2018-01-27 11:35:10 -08:00
Steve Yen 37121c3b49 scorch zap writeRoaringWithLen optimized with reused bufs 2018-01-27 11:35:10 -08:00
Steve Yen 5a035dc9aa scorch zap in-memory segment representation (SegmentBase)
The zap SegmentBase struct is a refactoring of the zap Segment into
the subset of fields that are needed for read-only ops, without any
persistence related info.  This allows us to use zap's optimized data
encoding as scorch's in-memory segments.

The zap Segment struct now embeds a zap SegmentBase struct, and layers
on persistence.  Both the zap Segment and zap SegmentBase implement
scorch's Segment interface.
2018-01-27 11:35:10 -08:00