0
0
Commit Graph

71 Commits

Author SHA1 Message Date
Steve Yen
192621f402 scorch includeFreq/Norm/Locs params for postingsList.Iterator API
This commit adds boolean flag params to the scorch
PostingsList.Iterator() method, so that the caller can specify whether
freq/norm/locs information is needed or not.

Future changes can leverage these params for optimizations.
2018-03-26 09:49:44 -07:00
Steve Yen
6540b197d4 scorch zap provide full buffer capacity to snappy Encode/Decode()
The snappy Encode/Decode() API's accept an optional destination buffer
param where their encoded/decoded output results will be placed, but
they only check that the buffer has enough len() rather than enough
capacity before deciding to allocate a new buffer.
2018-03-26 09:49:44 -07:00
Steve Yen
ba644f3893 scorch zap fix postingsIter.nextBytes() when 1-bit encoded
The previous commit's optimization that replaced the locsBitmap was
incorrectly handling the case when there was a 1-bit encoding
optimization in the postingsIterator.nextBytes() method,
incorrectly generating the freq-norm bytes.

Also as part of this change, more unused locsBitmap's were removed.
2018-03-26 09:19:00 -07:00
Steve Yen
7a19e6fd7e scorch zap replace locsBitmap w/ 1 bit from freq-norm varint encoding
This is attempt #2 of the optimization that replaces the locsBitmap,
without any changes from the original commit attempt.  A commit that
follows this one contains the actual fix.

See also...
- commit 621b58dd83 (the 1st attempt)
- commit 49a4ee60ba (the revert)

-------------
The original commit message body from 621b58 was...

NOTE: this is a zap file format change.

The separate "postings locations" roaring Bitmap that encoded whether
a posting has locations info is now replaced by the least significant
bit in the freq varint encoded in the freq-norm chunkedIntCoder.

encode/decodeFreqHasLocs() are added as helper functions.
2018-03-23 12:50:24 -07:00
Steve Yen
49a4ee60ba Revert "scorch zap replace locsBitmap w/ 1 bit from freq-norm varint encoding"
Testing with the cbft application led to cbft process exits...

  AsyncError exit()... error reading location field: EOF --
  main.initBleveOptions.func1() at init_bleve.go:85

This reverts commit 621b58dd83.
2018-03-23 10:01:30 -07:00
Steve Yen
621b58dd83 scorch zap replace locsBitmap w/ 1 bit from freq-norm varint encoding
NOTE: this is a zap file format change.

The separate "postings locations" roaring Bitmap that encoded whether
a posting has locations info is now replaced by the least significant
bit in the freq varint encoded in the freq-norm chunkedIntCoder.

encode/decodeFreqHasLocs() are added as helper functions.
2018-03-22 17:43:07 -07:00
abhinavdangeti
844845b5d2 Revert "scorch zap panic if mergeFields() sees unsorted fields"
This reverts commit 2f4d3d8587.
2018-03-20 14:51:25 -07:00
Steve Yen
2f4d3d8587 scorch zap panic if mergeFields() sees unsorted fields
mergeFields depends on the fields from the various segments being
sorted for the fieldsSame comparison to work.

Of note, the 'fieldi > 1' guard skips the 0th field, which should
always be the '_id' field.
2018-03-20 11:17:46 -07:00
Steve Yen
f65ba5c0f4 MB-28781 - scorch zap merge freq/loc copying only when fieldsSame
The optimization recently introduced in commit 530a3d24cf,
("scorch zap optimize merge by byte copying freq/norm/loc's") was to
byte-copy freq/norm/loc data directly during merging.  But, it was
incorrect if the fields were different across segments.

This change now performs that byte-copying merging optimization only
when the fields are the same across segments, and if not, leverages
the old approach of deserializing & re-serializing the freq/norm/loc
information, which has the important step of remapping fieldID's.

See also: https://issues.couchbase.com/browse/MB-28781
2018-03-19 11:26:51 -07:00
Steve Yen
c881146270 scorch zap mergeTermFreqNormLocsByCopying() helper func 2018-03-19 10:36:23 -07:00
Steve Yen
5df53c8e1f scorch zap file merger uses 1MB buffered writer
pprof of bleve-blast was showing file merging was in syscall/write a
lot.  The bufio.NewWriter() provides a default buffer size of 4K,
which is too small, and using bufio.NewWriterSize(1MB buffer size)
leads to syscall/write dropping out of the file merging flame graphs.
2018-03-16 11:49:53 -07:00
Steve Yen
e52eb84e37 scorch zap optimize merge when deletion bitmap is empty
This change detects whether a deletion bitmap is empty, and treats
that as a nil bitmap, which allows further postings iterator codepaths
to avoid roaring bitmap operations (like, AndNot(docNums, drops)).
2018-03-16 11:22:50 -07:00
Steve Yen
cad88096ca scorch zap reuse roaring Bitmap during merge 2018-03-12 09:17:37 -07:00
Steve Yen
3884cf4d12 scorch zap writePostings() helper func refactored out 2018-03-09 13:29:28 -08:00
Steve Yen
eac9808990 scorch zap optimize FST val encoding for terms with 1 hit
NOTE: this is a scorch zap file format change / bump to version 4.

In this optimization, the uint64 val stored in the vellum FST (term
dictionary) now may either be a uint64 postingsOffset (same as before
this change) or a uint64 encoding of the docNum + norm (in the case
where a term appears in just a single doc).
2018-03-08 09:19:54 -08:00
Steve Yen
15242af465
Merge pull request #805 from steveyen/optimize-scorch-mem-processField
Optimize scorch processField() inner loop and writeRoaringWithLen()
2018-03-07 09:09:57 -08:00
Sreekanth Sivasankaran
e0369a3553
Merge branch 'master' into compaction_bytes_stats 2018-03-07 14:47:33 +05:30
Sreekanth Sivasankaran
2a9739ee1b naming change, interface removal 2018-03-07 14:43:33 +05:30
Steve Yen
dde6c2e01b scorch zap optimize writeRoaringWithLen()
Before this change, writeRoaringWithLen() would leverage a reused
bytes.Buffer (#A) and invoke the roaring.WriteTo() API.

But, it turns out the roaring.WriteTo() API has a suboptimal
implementation, in that underneath-the-hood it converts the roaring
bitmap to a byte buffer (using roaring.ToBytes()), and then calls
Write().  But, that Write() turns out to be an additional memcpy into
the provided bytes.Buffer (#A).

By directly invoking roaring.ToBytes(), this change to
writeRoaringWithLen() avoids the extra memory allocation and memcpy.
2018-03-06 14:59:20 -08:00
Steve Yen
530a3d24cf scorch zap optimize merge by byte copying freq/norm/loc's
This change adds a zap PostingsIterator.nextBytes() method, which is
similar to Next(), but instead of returning a Posting instance,
nextBytes() returns the encoded freq/norm and location byte slices.

The zap merge code then provides those byte slices directly to the
intCoder's via a new method, intCoder.AddBytes(), thereby avoiding
having to encode many uvarint's.
2018-03-06 13:30:59 -08:00
Sreekanth Sivasankaran
395b0a312d adding UTs 2018-03-05 17:02:58 +05:30
Sreekanth Sivasankaran
dec265c481 adding compaction_written_bytes/sec stats to scorch 2018-03-05 16:32:57 +05:30
Marty Schoch
0363b24dd4 update to use new vellum Reset API 2018-03-01 09:37:39 -08:00
Steve Yen
3f1dcb6078 scorch zap merge optimize drops lookup to outside of loop 2018-02-27 09:23:29 -08:00
Steve Yen
99ed127176 scorch zap merge optimize newDocNums lookup to outside of loop
And, also a "go fmt".
2018-02-26 14:23:55 -08:00
Steve Yen
ce2332e111 scorch zap merge reuses tf/locEncoder across terms
The finishTerm() helper func that's invoked on every outer loop resets
the tf/locEncoders so they can be safely reused.
2018-02-26 11:37:11 -08:00
Steve Yen
a0b7508da7 scorch zap mergeSegmentBases() func
As part of this, zap.MergeToWriter() now returns more information --
enough so that callers can now create their own SegmentBase instances.

Also, the fieldsMap maintained and returned by zap.MergeToWriter() is
now a mapping from fieldName ==> fieldID+1 (instead of the previous
mapping from fieldName ==> fieldID).  This makes it similar to how
fieldsMap are handled in other parts of zap to avoid "zero value"
issues.
2018-02-19 14:13:31 -08:00
Steve Yen
fe544f3352 scorch zap merge uses enumerator for vellum.Iterator's 2018-02-12 21:28:46 -08:00
Steve Yen
2158e06c40 scorch zap merge collects dicts & itrs in lock-step
The theory with this change is that the dicts and itrs should be
positionally in "lock-step" with paired entries.

And, since later code also uses the same array indexing to access the
drops and newDocNums, those also need to be positionally in pair-wise
lock-step, too.
2018-02-12 20:54:07 -08:00
Steve Yen
e37c563c56 scorch zap merge move fieldDvLocsOffset var declaration
Move the var declaration to nearer where its used.
2018-02-08 18:03:09 -08:00
Steve Yen
f177f07613 scorch zap segment merging reuses prealloc'ed PostingsIterator
During zap segment merging, a new zap PostingsIterator was allocated
for every field X segment X term.

This change optimizes by reusing a single PostingsIterator instance
per persistMergedRest() invocation.

And, also unused fields are removed from the PostingsIterator.
2018-02-08 17:24:30 -08:00
Steve Yen
ed4826b189 scorch zap merge optimization to byte-copy storedDocs
The optimization to byte-copy all the storedDocs for a given segment
during merging kicks in when the fields are the same across all
segments and when there are no deletions for that given segment.  This
can happen, for example, during data loading or insert-only scenarios.

As part of this commit, the Segment.copyStoredDocs() method was added,
which uses a single Write() call to copy all the stored docs bytes of
a segment to a writer in one shot.

And, getDocStoredMetaAndCompressed() was refactored into a related
helper function, getDocStoredOffsets(), which provides the storedDocs
metadata (offsets & lengths) for a doc.
2018-02-08 09:08:35 -08:00
Steve Yen
0b50a20cac scorch zap move docDropped const to earlier in file 2018-02-08 09:06:31 -08:00
Steve Yen
822457542e scorch zap VERSION bump: check whether fields are the same at merge
COMPATIBILITY NOTE: scorch zap version bumped in this commit.

The version bump is because mergeFields() now computes whether fields
are the same across segments and it relies on the previous commit
where fieldID's are assigned in field name sorted order (albeit with
_id field always having fieldID of 0).

Potential future commits might rely on this info that "fields are the
same across segments" for more optimizations, etc.
2018-02-08 09:06:30 -08:00
Steve Yen
ffdeb8055e scorch sorts fields by name to assign fieldID's
This is a stepping stone to allow easier future comparisons of field
maps and potential merge optimizations.

In bleve-blast tests on a 2015 macbook (50K wikipedia docs, 8
indexers, batch size 100, ssd), this does not seem to have a distinct
effect on indexing throughput.
2018-02-08 09:06:30 -08:00
Steve Yen
a83ee0f364 scorch zap.MergeToWriter() takes SegmentBases instead of Segments
This change turns zap.MergeToWriter() into a public func, so that it's
now directly callable from outside packages (such as from scorch's
top-level merger or persister).  And, MergerToWriter() now takes input
of SegmentBases instead of Segments, so that it can now work on either
in-memory zap segments or file-based zap segments.

This is yet another stepping stone towards in-memory merging of zap
segments.
2018-02-07 14:38:13 -08:00
Steve Yen
8c2520d55c scorch zap optimize via postingsList reuse
pprof graphs were showing many postingsList allocations during
merging, so this change optimizes by reusing postingList memory in the
merging loops.
2018-02-07 14:33:20 -08:00
Steve Yen
0dfd73d6cc scorch zap mergeStoredAndRemap loop optimization
This change avoids an array/slice access in a loop body.
2018-02-06 17:10:44 -08:00
Steve Yen
6578655758 scorch zap refactored out mergeToWriter() func
This is a step towards supporting in-memory zap segment merging.
2018-02-05 07:39:16 -08:00
Steve Yen
eb21bf8315 scorch zap merge & build share persistStoredFieldValues()
Refactored out a helper func, persistStoredFieldValues(), that both
the persistence and merge codepaths now share.
2018-02-05 07:38:55 -08:00
Steve Yen
714f5321e0 scorch zap merge storedFieldVals inner loop optimization 2018-02-01 16:28:15 -08:00
Steve Yen
634cfa0560 scorch zap chunkedIntCoder optimization to prealloc some final buf 2018-01-29 11:03:53 -08:00
Steve Yen
a444c25ddf scorch zap merge uses array for docTermMap with no sorting
Instead of sorting docNum keys from a hashmap, this change instead
iterates from docNum 0 to N and uses an array instead of hashmap.
The array is also reused across outer loop iterations.

This optimizes for when there's a lot of structural similarity between
docs, where many/most docs have the same fields.  i.e., beers,
breweries.  If every doc has completely different fields, then this
change might produce worse behavior compared to the previous sparse
hashmap approach.
2018-01-29 10:47:08 -08:00
Steve Yen
745575a6c1 scorch zap mergeStoredAndRemap uses array indexing, not append()
Since we have right array size preallocated, we don't need the extra
capacity checking of append().
2018-01-27 11:35:10 -08:00
Steve Yen
8dd17a3b20 scorch zap mergeStoredAndRemap uses continue for less indentation 2018-01-27 11:35:10 -08:00
Steve Yen
0041664bc4 scorch zap merge computeNewDocCount() optimize 1 variable 2018-01-27 11:35:10 -08:00
Steve Yen
6985db13a0 scorch zap merge reuses docNumbers array 2018-01-27 11:35:10 -08:00
Steve Yen
916bbf4125 scorch zap merge prealloc's docTermMap capacity 2018-01-27 11:35:10 -08:00
Steve Yen
56cdb68f35 scorch zap merge checks err2 not err
Also, optimize the appending of the termSeparator so that the
docTermMap is accessed and updated just once.
2018-01-27 11:35:10 -08:00
Steve Yen
3030d4edb5 scorch zap merge preallocs segNewDocNums capacity 2018-01-27 11:35:10 -08:00