0
0
Commit Graph

606 Commits

Author SHA1 Message Date
Marty Schoch
f531a248e7
Merge pull request #749 from sreekanth-cb/zapfile_cleanup_fix
unblock the files for clean up, esp for merged new segment files
2018-02-08 10:53:41 -05:00
Steve Yen
a83ee0f364 scorch zap.MergeToWriter() takes SegmentBases instead of Segments
This change turns zap.MergeToWriter() into a public func, so that it's
now directly callable from outside packages (such as from scorch's
top-level merger or persister).  And, MergerToWriter() now takes input
of SegmentBases instead of Segments, so that it can now work on either
in-memory zap segments or file-based zap segments.

This is yet another stepping stone towards in-memory merging of zap
segments.
2018-02-07 14:38:13 -08:00
Steve Yen
8c2520d55c scorch zap optimize via postingsList reuse
pprof graphs were showing many postingsList allocations during
merging, so this change optimizes by reusing postingList memory in the
merging loops.
2018-02-07 14:33:20 -08:00
Steve Yen
03c8b2b7ec scorch mem segment optimizes DictEntry's across Next() calls
This change optimizes the scorch/mem DictionaryIterator by reusing a
DictEntry struct across multiple Next() calls.  This follows the same
optimization trick and Next() semantics as upsidedown's FieldDict
implementation.
2018-02-07 14:17:48 -08:00
Steve Yen
0dfd73d6cc scorch zap mergeStoredAndRemap loop optimization
This change avoids an array/slice access in a loop body.
2018-02-06 17:10:44 -08:00
Steve Yen
eb1d269521
Merge pull request #748 from steveyen/master
scorch zap merge related refactorings / optimizations
2018-02-06 07:52:17 -08:00
Steve Yen
fdb240f5f9 more zap merge-planner CalcBudget tests at larger sizes
Helps provide a sense of how # of segments grows as # of documents
grows.  Ex: 1B docs => budget of 54 segments.
2018-02-05 10:02:47 -08:00
Steve Yen
c09e2a08ca scorch zap chunkedContentCoder reuses chunk metadata slice memory
And, renamed the chunk MetaData.DocID field to DocNum for naming
correctness, where much of this commit is the mechanical effect of
that rename.
2018-02-05 07:39:16 -08:00
Steve Yen
3da191852d scorch zap tighten up prepareSegment()'s lock area 2018-02-05 07:39:16 -08:00
Steve Yen
6578655758 scorch zap refactored out mergeToWriter() func
This is a step towards supporting in-memory zap segment merging.
2018-02-05 07:39:16 -08:00
Steve Yen
eb21bf8315 scorch zap merge & build share persistStoredFieldValues()
Refactored out a helper func, persistStoredFieldValues(), that both
the persistence and merge codepaths now share.
2018-02-05 07:38:55 -08:00
Sreekanth Sivasankaran
9636209ae5
Update persister.go
comment updated
2018-02-05 20:49:30 +05:30
Sreekanth Sivasankaran
678c412157 unblock the files for clean up, esp for merged new segment files 2018-02-02 14:44:02 +05:30
Steve Yen
714f5321e0 scorch zap merge storedFieldVals inner loop optimization 2018-02-01 16:28:15 -08:00
Steve Yen
175f80403a
Merge pull request #747 from steveyen/master
scorch zap DictIterator term count fixed and more merge unit tests
2018-02-01 10:13:18 -08:00
Abhinav Dangeti
c24f8944c4
Merge pull request #738 from abhinavdangeti/scorch-stats
Add support for certain disk stats
2018-02-01 08:35:59 -08:00
Steve Yen
93b037cdbb scorch zap TestMergeWithUpdates() 2018-01-31 11:44:41 -08:00
Steve Yen
4dd64b68fa scorch zap TestMergeWithEmptySegment(s) 2018-01-30 22:27:40 -08:00
Steve Yen
684ee3c0e7 scorch zap DictIterator term count fixed and more merge unit tests
The zap DictionaryIterator Next() was incorrectly returning the
postingsList offset as the term count.  As part of this, refactored
out a PostingsList.read() helper method.

Also added more merge unit test scenarios, including merging a segment
for a few rounds to see if there are differences before/after merging.
2018-01-30 21:22:06 -08:00
Steve Yen
634cfa0560 scorch zap chunkedIntCoder optimization to prealloc some final buf 2018-01-29 11:03:53 -08:00
Steve Yen
a444c25ddf scorch zap merge uses array for docTermMap with no sorting
Instead of sorting docNum keys from a hashmap, this change instead
iterates from docNum 0 to N and uses an array instead of hashmap.
The array is also reused across outer loop iterations.

This optimizes for when there's a lot of structural similarity between
docs, where many/most docs have the same fields.  i.e., beers,
breweries.  If every doc has completely different fields, then this
change might produce worse behavior compared to the previous sparse
hashmap approach.
2018-01-29 10:47:08 -08:00
Steve Yen
745575a6c1 scorch zap mergeStoredAndRemap uses array indexing, not append()
Since we have right array size preallocated, we don't need the extra
capacity checking of append().
2018-01-27 11:35:10 -08:00
Steve Yen
8dd17a3b20 scorch zap mergeStoredAndRemap uses continue for less indentation 2018-01-27 11:35:10 -08:00
Steve Yen
0041664bc4 scorch zap merge computeNewDocCount() optimize 1 variable 2018-01-27 11:35:10 -08:00
Steve Yen
6985db13a0 scorch zap merge reuses docNumbers array 2018-01-27 11:35:10 -08:00
Steve Yen
916bbf4125 scorch zap merge prealloc's docTermMap capacity 2018-01-27 11:35:10 -08:00
Steve Yen
56cdb68f35 scorch zap merge checks err2 not err
Also, optimize the appending of the termSeparator so that the
docTermMap is accessed and updated just once.
2018-01-27 11:35:10 -08:00
Steve Yen
3030d4edb5 scorch zap merge preallocs segNewDocNums capacity 2018-01-27 11:35:10 -08:00
Steve Yen
9038d75c98 scorch zap allocate govarint.U64Base128Encoder just once
Instead of allocating a govarint.U64Base128Encoder in the inner loop,
allocate it just once on the outside, as it appears that it's just a
thin wrapper around binary.PutUvarint().
2018-01-27 11:35:10 -08:00
Steve Yen
10dd5489c2 scorch zap Dict.postingsList() takes []byte for more mem control
This allows callers that already have a []byte term to avoid
string'ification garbage.
2018-01-27 11:35:10 -08:00
Steve Yen
6a17ff48c7 scorch zap removed uneeded []byte cast of term 2018-01-27 11:35:10 -08:00
Steve Yen
d389e2bb40 scorch zap merge file cleanup on error, and some minor prealloc's 2018-01-27 11:35:10 -08:00
Steve Yen
29d526a7c2 scorch zap merge uses DefaultChunkFactor 2018-01-27 11:35:10 -08:00
Steve Yen
603425c2c5 scorch zap mergerLoop missing fireAsyncError case 2018-01-27 11:35:10 -08:00
Steve Yen
37121c3b49 scorch zap writeRoaringWithLen optimized with reused bufs 2018-01-27 11:35:10 -08:00
Steve Yen
5a035dc9aa scorch zap in-memory segment representation (SegmentBase)
The zap SegmentBase struct is a refactoring of the zap Segment into
the subset of fields that are needed for read-only ops, without any
persistence related info.  This allows us to use zap's optimized data
encoding as scorch's in-memory segments.

The zap Segment struct now embeds a zap SegmentBase struct, and layers
on persistence.  Both the zap Segment and zap SegmentBase implement
scorch's Segment interface.
2018-01-27 11:35:10 -08:00
Steve Yen
dc62324e02 scorch zap miscellaneous typos 2018-01-27 11:35:10 -08:00
abhinavdangeti
567d756c27 Add support for certain disk stats
+ num_bytes_used_disk
+ num_files_on_disk
2018-01-24 14:10:14 -08:00
Steve Yen
34fd77709f scorch unlocks in introduceSegment's DocNumbers() error codepath 2018-01-20 17:17:16 -08:00
abhinavdangeti
1176c73a9c Include overhead from data structures in segment's SizeInBytes
+ Account for all the overhead incurred from the data structures
  within mem.Segment and zap.Segment.
    - SizeOfMap = 8
    - SizeOfPointer = 8
    - SizeOfSlice = 24
    - SizeOfString = 16
+ Include overhead from certain new fields as well.
2018-01-17 11:11:44 -08:00
Steve Yen
71d6d1691b scorch zap optimizations of inner loops and easy preallocs 2018-01-15 23:04:23 -08:00
Steve Yen
d682c85a7b scorch mem segments uses backing array trick even more
This change invokes make() only once per distinct type to allocate the
large, contiguous backing arrays for the mem segment.
2018-01-15 19:17:39 -08:00
Steve Yen
0f19b542a3 scorch mem segment prealloc's Locfields/starts/ends/pos/arraypos
This change preallocates more of the backing arrays for Locfields,
Locstarts, Locends, Locpos, Locaaraypos sub-slices of a scorch mem
segment.

On small bleve-blast tests (50K wiki docs) on a dev macbook, scorch
indexing throughput seems to improve from 15MB/sec to 20MB/sec after
the recent series of preallocation changes.
2018-01-15 18:40:28 -08:00
Steve Yen
a84bd122d2 scorch mem segment preallocates sub-slices via # terms
This change tracks the number of terms per posting list to
preallocate the sub-slices for the Freqs & Norms.
2018-01-15 18:20:43 -08:00
Steve Yen
a4110d325c scorch mem segment preallocates slices that are key'ed by postingId
The scorch mem segment build phase uses the append() idiom to populate
various slices that are keyed by postings list id's.  These slices
include...

* Postings
* PostingsLocs
* Freqs
* Norms
* Locfields
* Locstarts
* Locends
* Locpos
* Locarraypos

This change introduces an initialization step that preallocates those
slices up-front, by assigning postings list id's to terms up-front.

This change also has an additional effect of simplifying the
processDocument() logic to no longer have to worry about a first-time
initialization case, removing some duplicate'ish code.
2018-01-15 16:53:39 -08:00
Steve Yen
917c470791 scorch mem segment VisitDocument() accesses StoredTypes/Pos outside of loop 2018-01-15 11:54:46 -08:00
Steve Yen
e7bd6026eb scorch mem segment preallocs docMap/fieldLens with capacity
The first time through, startNumFields should be 0, where there ought
to be more optimization assuming later docs have similar fields as the
first doc.
2018-01-15 11:52:20 -08:00
Steve Yen
d777d7c365 scorch mem segment comments consistency 2018-01-15 11:08:21 -08:00
Marty Schoch
4e82a8a0ca
Merge pull request #726 from sreekanth-cb/docValue_configs
DocValue Config, new API Changes
2018-01-10 18:11:18 -05:00
Sreekanth Sivasankaran
53aef2104e fixing err handling in UTs, name changes 2018-01-10 22:00:26 +05:30