0
0
Commit Graph

167 Commits

Author SHA1 Message Date
Steve Yen
99ed127176 scorch zap merge optimize newDocNums lookup to outside of loop
And, also a "go fmt".
2018-02-26 14:23:55 -08:00
Steve Yen
98d5d7bd81 scorch zap chunkedIntCoder optimizations
The optimizations / changes include...

- reuse of a memory buf when serializing varint's.

- reuse of a govarint.U64Base128Encoder instance, as it's a thin,
  wrapper around an underlying chunkBuf, so Reset()'s on the
  chunkBuf is enough for encoder reuse.

- chunkedIntcoder.Write() method was changed to invoke w.Write() less
  often by forming a larger, reused buf.  Profiling and analysis
  showed w.Write() was getting called a lot, often with tiny 1 or 2
  byte inputs.  The theory is w.Write() and its underlying memmove()
  can be more efficient when provided with larger bufs.

- some repeated code removal, by reusing the Close() method.
2018-02-26 14:17:09 -08:00
Steve Yen
ce2332e111 scorch zap merge reuses tf/locEncoder across terms
The finishTerm() helper func that's invoked on every outer loop resets
the tf/locEncoders so they can be safely reused.
2018-02-26 11:37:11 -08:00
Steve Yen
a0b7508da7 scorch zap mergeSegmentBases() func
As part of this, zap.MergeToWriter() now returns more information --
enough so that callers can now create their own SegmentBase instances.

Also, the fieldsMap maintained and returned by zap.MergeToWriter() is
now a mapping from fieldName ==> fieldID+1 (instead of the previous
mapping from fieldName ==> fieldID).  This makes it similar to how
fieldsMap are handled in other parts of zap to avoid "zero value"
issues.
2018-02-19 14:13:31 -08:00
Steve Yen
720010783e scorch zap InitSegmentBase() helper func
Refactored out a zap.InitSegmentBase() func so that non-zap packages
can create SegmentBase instances.
2018-02-19 14:13:31 -08:00
Steve Yen
fe544f3352 scorch zap merge uses enumerator for vellum.Iterator's 2018-02-12 21:28:46 -08:00
Steve Yen
a073424e5a scorch zap dict.postingsListFromOffset() method
A helper method that can create a PostingsList if the caller already
knows the postingsOffset.
2018-02-12 20:54:07 -08:00
Steve Yen
2158e06c40 scorch zap merge collects dicts & itrs in lock-step
The theory with this change is that the dicts and itrs should be
positionally in "lock-step" with paired entries.

And, since later code also uses the same array indexing to access the
drops and newDocNums, those also need to be positionally in pair-wise
lock-step, too.
2018-02-12 20:54:07 -08:00
Steve Yen
95a4f37e5c scorch zap enumerator impl that joins multiple vellum iterators
Unlike vellum's MergeIterator, the enumerator introduced in this
commit doesn't merge when there are matching keys across iterators.

Instead, the enumerator implementation provides a traversal of all the
tuples of (key, iteratorIndex, val) from the underlying vellum
iterators, ordered by key ASC, iteratorIndex ASC.
2018-02-12 20:54:06 -08:00
Steve Yen
e37c563c56 scorch zap merge move fieldDvLocsOffset var declaration
Move the var declaration to nearer where its used.
2018-02-08 18:03:09 -08:00
Steve Yen
f177f07613 scorch zap segment merging reuses prealloc'ed PostingsIterator
During zap segment merging, a new zap PostingsIterator was allocated
for every field X segment X term.

This change optimizes by reusing a single PostingsIterator instance
per persistMergedRest() invocation.

And, also unused fields are removed from the PostingsIterator.
2018-02-08 17:24:30 -08:00
Steve Yen
ed4826b189 scorch zap merge optimization to byte-copy storedDocs
The optimization to byte-copy all the storedDocs for a given segment
during merging kicks in when the fields are the same across all
segments and when there are no deletions for that given segment.  This
can happen, for example, during data loading or insert-only scenarios.

As part of this commit, the Segment.copyStoredDocs() method was added,
which uses a single Write() call to copy all the stored docs bytes of
a segment to a writer in one shot.

And, getDocStoredMetaAndCompressed() was refactored into a related
helper function, getDocStoredOffsets(), which provides the storedDocs
metadata (offsets & lengths) for a doc.
2018-02-08 09:08:35 -08:00
Steve Yen
0b50a20cac scorch zap move docDropped const to earlier in file 2018-02-08 09:06:31 -08:00
Steve Yen
822457542e scorch zap VERSION bump: check whether fields are the same at merge
COMPATIBILITY NOTE: scorch zap version bumped in this commit.

The version bump is because mergeFields() now computes whether fields
are the same across segments and it relies on the previous commit
where fieldID's are assigned in field name sorted order (albeit with
_id field always having fieldID of 0).

Potential future commits might rely on this info that "fields are the
same across segments" for more optimizations, etc.
2018-02-08 09:06:30 -08:00
Steve Yen
ffdeb8055e scorch sorts fields by name to assign fieldID's
This is a stepping stone to allow easier future comparisons of field
maps and potential merge optimizations.

In bleve-blast tests on a 2015 macbook (50K wikipedia docs, 8
indexers, batch size 100, ssd), this does not seem to have a distinct
effect on indexing throughput.
2018-02-08 09:06:30 -08:00
Steve Yen
a83ee0f364 scorch zap.MergeToWriter() takes SegmentBases instead of Segments
This change turns zap.MergeToWriter() into a public func, so that it's
now directly callable from outside packages (such as from scorch's
top-level merger or persister).  And, MergerToWriter() now takes input
of SegmentBases instead of Segments, so that it can now work on either
in-memory zap segments or file-based zap segments.

This is yet another stepping stone towards in-memory merging of zap
segments.
2018-02-07 14:38:13 -08:00
Steve Yen
8c2520d55c scorch zap optimize via postingsList reuse
pprof graphs were showing many postingsList allocations during
merging, so this change optimizes by reusing postingList memory in the
merging loops.
2018-02-07 14:33:20 -08:00
Steve Yen
03c8b2b7ec scorch mem segment optimizes DictEntry's across Next() calls
This change optimizes the scorch/mem DictionaryIterator by reusing a
DictEntry struct across multiple Next() calls.  This follows the same
optimization trick and Next() semantics as upsidedown's FieldDict
implementation.
2018-02-07 14:17:48 -08:00
Steve Yen
0dfd73d6cc scorch zap mergeStoredAndRemap loop optimization
This change avoids an array/slice access in a loop body.
2018-02-06 17:10:44 -08:00
Steve Yen
c09e2a08ca scorch zap chunkedContentCoder reuses chunk metadata slice memory
And, renamed the chunk MetaData.DocID field to DocNum for naming
correctness, where much of this commit is the mechanical effect of
that rename.
2018-02-05 07:39:16 -08:00
Steve Yen
6578655758 scorch zap refactored out mergeToWriter() func
This is a step towards supporting in-memory zap segment merging.
2018-02-05 07:39:16 -08:00
Steve Yen
eb21bf8315 scorch zap merge & build share persistStoredFieldValues()
Refactored out a helper func, persistStoredFieldValues(), that both
the persistence and merge codepaths now share.
2018-02-05 07:38:55 -08:00
Steve Yen
714f5321e0 scorch zap merge storedFieldVals inner loop optimization 2018-02-01 16:28:15 -08:00
Steve Yen
93b037cdbb scorch zap TestMergeWithUpdates() 2018-01-31 11:44:41 -08:00
Steve Yen
4dd64b68fa scorch zap TestMergeWithEmptySegment(s) 2018-01-30 22:27:40 -08:00
Steve Yen
684ee3c0e7 scorch zap DictIterator term count fixed and more merge unit tests
The zap DictionaryIterator Next() was incorrectly returning the
postingsList offset as the term count.  As part of this, refactored
out a PostingsList.read() helper method.

Also added more merge unit test scenarios, including merging a segment
for a few rounds to see if there are differences before/after merging.
2018-01-30 21:22:06 -08:00
Steve Yen
634cfa0560 scorch zap chunkedIntCoder optimization to prealloc some final buf 2018-01-29 11:03:53 -08:00
Steve Yen
a444c25ddf scorch zap merge uses array for docTermMap with no sorting
Instead of sorting docNum keys from a hashmap, this change instead
iterates from docNum 0 to N and uses an array instead of hashmap.
The array is also reused across outer loop iterations.

This optimizes for when there's a lot of structural similarity between
docs, where many/most docs have the same fields.  i.e., beers,
breweries.  If every doc has completely different fields, then this
change might produce worse behavior compared to the previous sparse
hashmap approach.
2018-01-29 10:47:08 -08:00
Steve Yen
745575a6c1 scorch zap mergeStoredAndRemap uses array indexing, not append()
Since we have right array size preallocated, we don't need the extra
capacity checking of append().
2018-01-27 11:35:10 -08:00
Steve Yen
8dd17a3b20 scorch zap mergeStoredAndRemap uses continue for less indentation 2018-01-27 11:35:10 -08:00
Steve Yen
0041664bc4 scorch zap merge computeNewDocCount() optimize 1 variable 2018-01-27 11:35:10 -08:00
Steve Yen
6985db13a0 scorch zap merge reuses docNumbers array 2018-01-27 11:35:10 -08:00
Steve Yen
916bbf4125 scorch zap merge prealloc's docTermMap capacity 2018-01-27 11:35:10 -08:00
Steve Yen
56cdb68f35 scorch zap merge checks err2 not err
Also, optimize the appending of the termSeparator so that the
docTermMap is accessed and updated just once.
2018-01-27 11:35:10 -08:00
Steve Yen
3030d4edb5 scorch zap merge preallocs segNewDocNums capacity 2018-01-27 11:35:10 -08:00
Steve Yen
9038d75c98 scorch zap allocate govarint.U64Base128Encoder just once
Instead of allocating a govarint.U64Base128Encoder in the inner loop,
allocate it just once on the outside, as it appears that it's just a
thin wrapper around binary.PutUvarint().
2018-01-27 11:35:10 -08:00
Steve Yen
10dd5489c2 scorch zap Dict.postingsList() takes []byte for more mem control
This allows callers that already have a []byte term to avoid
string'ification garbage.
2018-01-27 11:35:10 -08:00
Steve Yen
6a17ff48c7 scorch zap removed uneeded []byte cast of term 2018-01-27 11:35:10 -08:00
Steve Yen
d389e2bb40 scorch zap merge file cleanup on error, and some minor prealloc's 2018-01-27 11:35:10 -08:00
Steve Yen
37121c3b49 scorch zap writeRoaringWithLen optimized with reused bufs 2018-01-27 11:35:10 -08:00
Steve Yen
5a035dc9aa scorch zap in-memory segment representation (SegmentBase)
The zap SegmentBase struct is a refactoring of the zap Segment into
the subset of fields that are needed for read-only ops, without any
persistence related info.  This allows us to use zap's optimized data
encoding as scorch's in-memory segments.

The zap Segment struct now embeds a zap SegmentBase struct, and layers
on persistence.  Both the zap Segment and zap SegmentBase implement
scorch's Segment interface.
2018-01-27 11:35:10 -08:00
Steve Yen
dc62324e02 scorch zap miscellaneous typos 2018-01-27 11:35:10 -08:00
abhinavdangeti
1176c73a9c Include overhead from data structures in segment's SizeInBytes
+ Account for all the overhead incurred from the data structures
  within mem.Segment and zap.Segment.
    - SizeOfMap = 8
    - SizeOfPointer = 8
    - SizeOfSlice = 24
    - SizeOfString = 16
+ Include overhead from certain new fields as well.
2018-01-17 11:11:44 -08:00
Steve Yen
71d6d1691b scorch zap optimizations of inner loops and easy preallocs 2018-01-15 23:04:23 -08:00
Steve Yen
d682c85a7b scorch mem segments uses backing array trick even more
This change invokes make() only once per distinct type to allocate the
large, contiguous backing arrays for the mem segment.
2018-01-15 19:17:39 -08:00
Steve Yen
0f19b542a3 scorch mem segment prealloc's Locfields/starts/ends/pos/arraypos
This change preallocates more of the backing arrays for Locfields,
Locstarts, Locends, Locpos, Locaaraypos sub-slices of a scorch mem
segment.

On small bleve-blast tests (50K wiki docs) on a dev macbook, scorch
indexing throughput seems to improve from 15MB/sec to 20MB/sec after
the recent series of preallocation changes.
2018-01-15 18:40:28 -08:00
Steve Yen
a84bd122d2 scorch mem segment preallocates sub-slices via # terms
This change tracks the number of terms per posting list to
preallocate the sub-slices for the Freqs & Norms.
2018-01-15 18:20:43 -08:00
Steve Yen
a4110d325c scorch mem segment preallocates slices that are key'ed by postingId
The scorch mem segment build phase uses the append() idiom to populate
various slices that are keyed by postings list id's.  These slices
include...

* Postings
* PostingsLocs
* Freqs
* Norms
* Locfields
* Locstarts
* Locends
* Locpos
* Locarraypos

This change introduces an initialization step that preallocates those
slices up-front, by assigning postings list id's to terms up-front.

This change also has an additional effect of simplifying the
processDocument() logic to no longer have to worry about a first-time
initialization case, removing some duplicate'ish code.
2018-01-15 16:53:39 -08:00
Steve Yen
917c470791 scorch mem segment VisitDocument() accesses StoredTypes/Pos outside of loop 2018-01-15 11:54:46 -08:00
Steve Yen
e7bd6026eb scorch mem segment preallocs docMap/fieldLens with capacity
The first time through, startNumFields should be 0, where there ought
to be more optimization assuming later docs have similar fields as the
first doc.
2018-01-15 11:52:20 -08:00
Steve Yen
d777d7c365 scorch mem segment comments consistency 2018-01-15 11:08:21 -08:00
Marty Schoch
4e82a8a0ca
Merge pull request #726 from sreekanth-cb/docValue_configs
DocValue Config, new API Changes
2018-01-10 18:11:18 -05:00
Sreekanth Sivasankaran
53aef2104e fixing err handling in UTs, name changes 2018-01-10 22:00:26 +05:30
abhinavdangeti
43bfcc00c9 Do not account mmap'ed part of zap segments in MemoryUsed
This API is designed to only emit the dirty "unpersisted"
bytes only. This does not included the mmap'ed part in the
zap segments (disk).
2018-01-09 09:43:53 -08:00
Sreekanth Sivasankaran
4c256f5669 DocValue Config, new API Changes
-VisitableDocValueFields API for persisted DV field list
-making dv configs overridable at field level
-enabling on the fly/runtime un inverting of doc values
-few UT updates
2018-01-08 10:58:33 +05:30
Marty Schoch
c691cd2bb5 refactor scorch/zap command-line tools under bleve
zap command-line tool added to main bleve command-line tool
this required physical relocation due to the vendoring used
only on the bleve command-line tool (unforseen limitation)

a new scorch command-line tool has also been introduced
and for the same reasons it is physically store under
the top-level bleve command-line tool as well
2018-01-05 10:17:18 -05:00
Sreekanth Sivasankaran
71a726bbf6 perf issue was due to duplicate fieldIDs getting
inserted to the list of dv enabled fields list -
DocValueFields in mem segment.
Moved back to the original type `DocValueFields map[uint16]bool`
for easy look up to check whether the fieldID is
configured for dv storage.
2018-01-04 15:34:55 +05:30
Sreekanth Sivasankaran
f42ecb0ac7 docvalue "zap-path" cmd to print out the dv disk sizes 2018-01-04 13:58:51 +05:30
Sreekanth Sivasankaran
448201243a removed redundant buf writer, and checks 2017-12-30 16:54:06 +05:30
Sreekanth Sivasankaran
61ba81e964 Merge branch 'scorch', remote-tracking branch 'origin' into docValue_persisted 2017-12-30 16:52:51 +05:30
abhinavdangeti
5c26f5a86d Tracking memory consumption for a scorch index
+ Track memory usage at a segment level
+ Add a new scorch API: MemoryUsed()
    - Aggregate the memory consumption across
      segments when API is invoked.

+ TODO:
    - Revisit the second iteration if it can be gotten
      rid off, and the size accounted for during the first
      run while building an in-mem segment.
    - Accounting for pointer and slice overhead.
2017-12-29 10:20:11 -07:00
Sreekanth Sivasankaran
c8df014c0c Updated readme, zap version, added new docvalue cmd,
fixed the footer and fields cmd,
interface name updated
2017-12-29 21:39:29 +05:30
Sreekanth Sivasankaran
8abac42796 errCheck fixes 2017-12-28 13:23:57 +05:30
Sreekanth Sivasankaran
0272451093 adding checks for robustness 2017-12-28 13:05:25 +05:30
Sreekanth Sivasankaran
76f827f469 docValue persist changes
docValues are persisted along with the index,
in a columnar fashion per field with variable
sized chunking for quick look up.
-naive chunk level caching is added per field
-data part inside a chunk is snappy compressed
-metaHeader inside the chunk index the dv values
 inside the uncompressed data part
-all the fields are docValue persisted in this iteration
2017-12-28 12:05:33 +05:30
Steve Yen
67e0e5973b scorch mergeStoredAndRemap() memory reuse
In mergeStoredAndRemap(), instead of allocating new hashmaps for each
document, this commit reuses some arrays that are indexed by fieldId.
2017-12-20 15:18:22 -08:00
Steve Yen
c155255506 scorch optimize zap.Merge() to reuse some buffers 2017-12-20 14:59:53 -08:00
Steve Yen
1abbfadf0d scorch simplify err check after vellum load 2017-12-19 22:34:39 -08:00
Steve Yen
8f8333e01b scorch optimize zap Count()
This proposed approach avoids building a temporary AndNot() bitmap,
following the same kind of optimization used by mem segments.
2017-12-19 18:02:27 -08:00
Steve Yen
d0e4f85026 scorch avoid extra clone by using roaring.AndNot(x, y) 2017-12-19 13:37:04 -08:00
Steve Yen
f6b506134b import couchbase/vellum instead of couchbaselabs/vellum
Also, scrubbed an old couchbaselabs/moss reference in comments.

Also, go fmt.
2017-12-19 10:49:57 -08:00
Steve Yen
730d906a50 scorch reuses Posting instance in PostingsIterator.Next()
With this change, there are no more memory allocations in the calls to
PostingsIterator.Next() in the micro benchmarks of bleve-query.  On a
dev macbook, on an index of 50K wikipedia docs, using high frequency
search of "text:date"...

   400 qps - upsidedown/moss
   565 qps - scorch before
   680 qps - scorch after
2017-12-18 16:15:38 -08:00
Marty Schoch
b5aa4ed22b return err not panic 2017-12-14 17:41:02 -05:00
Marty Schoch
e1b0c61e2a fix bug in handling iterator-done 2017-12-13 22:08:06 -05:00
Steve Yen
c13ff85aaf scorch ref-counting
Future commits will provide actual cleanup when ref-counts reach 0.
2017-12-13 14:48:07 -08:00
Marty Schoch
a0e12b2640 add license to a few files missing it 2017-12-13 16:12:29 -05:00
Marty Schoch
85e15628ee major refactoring of posting details 2017-12-13 16:10:06 -05:00
Marty Schoch
6e2207c445 additional refactoring of build/merge 2017-12-13 15:22:13 -05:00
Marty Schoch
50441e5065 refactor to reuse shared code 2017-12-13 14:41:20 -05:00
Marty Schoch
289dc398bd more refacotring of build/merge 2017-12-13 14:26:11 -05:00
Marty Schoch
1cd3fd7fbe extrac common functionality between build/merge 2017-12-13 14:06:54 -05:00
Marty Schoch
f83c9f2a20 initial cut of merger that actually introduces changes 2017-12-13 13:41:03 -05:00
Marty Schoch
c15c3c11cd extra protection if dict address is 0 (empty segment) 2017-12-13 13:31:18 -05:00
Marty Schoch
57121e40a8 fix issues identified by errcheck 2017-12-12 11:41:14 -05:00
Marty Schoch
665c3c80ff initial cut of zap segment merging 2017-12-12 11:21:55 -05:00
Marty Schoch
927216df8c fix postings list count impl 2017-12-12 08:42:13 -05:00
Marty Schoch
58ef21a88a fix golint issue 2017-12-11 16:24:46 -05:00
Marty Schoch
f246e0e4c0 update README for zap file format changes 2017-12-11 16:22:29 -05:00
Marty Schoch
74b2eeb14d refactor where we do some work so we can return error 2017-12-11 15:59:36 -05:00
Marty Schoch
f13b786609 fix up issues to get all bleve unit tests passing for scorch
make scorch default
2017-12-11 15:47:41 -05:00
Marty Schoch
d7eb223e14 remove bolt segment format
upcomning breaking changes and no desire to maintain
2017-12-11 10:20:26 -05:00
Marty Schoch
8280859bb8 handle read-only and in-mem only cases 2017-12-11 09:07:01 -05:00
Marty Schoch
e8cc7ac0bf add new fields command to zap cmd-line util 2017-12-11 09:05:50 -05:00
Marty Schoch
dc0adc8827 add fsync 2017-12-09 20:52:01 -05:00
Marty Schoch
e0d9828cd0 add more detail to the readme 2017-12-09 14:42:36 -05:00
Marty Schoch
9781d9b089 add initial version of zap file format 2017-12-09 14:28:33 -05:00
Marty Schoch
ff2e6b98e4 added empty segment 2017-12-09 12:43:02 -05:00
Marty Schoch
adac4f41db initial version of scorch which persists index to disk 2017-12-06 18:33:47 -05:00
Marty Schoch
b1346b4c8a add readme describing our use of bolt as a segment format 2017-12-05 16:09:00 -05:00
Marty Schoch
898a6b1e85 fix errcheck issues 2017-12-05 13:32:57 -05:00