0
0
Fork 0
Commit Graph

1887 Commits

Author SHA1 Message Date
Sreekanth Sivasankaran 678c412157 unblock the files for clean up, esp for merged new segment files 2018-02-02 14:44:02 +05:30
Steve Yen 714f5321e0 scorch zap merge storedFieldVals inner loop optimization 2018-02-01 16:28:15 -08:00
Abhinav Dangeti ff210fbc6d
Merge pull request #744 from abhinavdangeti/geopoint-fix
MB-26396: Handling docs with geopoints in slice format
2018-02-01 10:20:06 -08:00
Steve Yen 175f80403a
Merge pull request #747 from steveyen/master
scorch zap DictIterator term count fixed and more merge unit tests
2018-02-01 10:13:18 -08:00
Abhinav Dangeti c24f8944c4
Merge pull request #738 from abhinavdangeti/scorch-stats
Add support for certain disk stats
2018-02-01 08:35:59 -08:00
Steve Yen 93b037cdbb scorch zap TestMergeWithUpdates() 2018-01-31 11:44:41 -08:00
Steve Yen 4dd64b68fa scorch zap TestMergeWithEmptySegment(s) 2018-01-30 22:27:40 -08:00
Steve Yen 684ee3c0e7 scorch zap DictIterator term count fixed and more merge unit tests
The zap DictionaryIterator Next() was incorrectly returning the
postingsList offset as the term count.  As part of this, refactored
out a PostingsList.read() helper method.

Also added more merge unit test scenarios, including merging a segment
for a few rounds to see if there are differences before/after merging.
2018-01-30 21:22:06 -08:00
abhinavdangeti 6451c8c37f MB-26396: Handling documents with geopoints in slice format
+ The issue lies with parsing documents containing a geopoint
  in slice format - which wasn't handled.
+ Unit test that verifies the fix.
2018-01-29 18:31:56 -08:00
Steve Yen a3b125508b
Merge pull request #746 from steveyen/master
more scorch zap optimizations (array for docTermMap, etc)
2018-01-29 15:50:04 -08:00
Steve Yen 634cfa0560 scorch zap chunkedIntCoder optimization to prealloc some final buf 2018-01-29 11:03:53 -08:00
Steve Yen a444c25ddf scorch zap merge uses array for docTermMap with no sorting
Instead of sorting docNum keys from a hashmap, this change instead
iterates from docNum 0 to N and uses an array instead of hashmap.
The array is also reused across outer loop iterations.

This optimizes for when there's a lot of structural similarity between
docs, where many/most docs have the same fields.  i.e., beers,
breweries.  If every doc has completely different fields, then this
change might produce worse behavior compared to the previous sparse
hashmap approach.
2018-01-29 10:47:08 -08:00
Steve Yen 5d1a2b0ad7
Merge pull request #743 from steveyen/master
zap-based in-memory segment impl & various merge optimizations
2018-01-29 09:22:12 -08:00
Steve Yen 745575a6c1 scorch zap mergeStoredAndRemap uses array indexing, not append()
Since we have right array size preallocated, we don't need the extra
capacity checking of append().
2018-01-27 11:35:10 -08:00
Steve Yen 8dd17a3b20 scorch zap mergeStoredAndRemap uses continue for less indentation 2018-01-27 11:35:10 -08:00
Steve Yen 0041664bc4 scorch zap merge computeNewDocCount() optimize 1 variable 2018-01-27 11:35:10 -08:00
Steve Yen 6985db13a0 scorch zap merge reuses docNumbers array 2018-01-27 11:35:10 -08:00
Steve Yen 916bbf4125 scorch zap merge prealloc's docTermMap capacity 2018-01-27 11:35:10 -08:00
Steve Yen 56cdb68f35 scorch zap merge checks err2 not err
Also, optimize the appending of the termSeparator so that the
docTermMap is accessed and updated just once.
2018-01-27 11:35:10 -08:00
Steve Yen 3030d4edb5 scorch zap merge preallocs segNewDocNums capacity 2018-01-27 11:35:10 -08:00
Steve Yen 9038d75c98 scorch zap allocate govarint.U64Base128Encoder just once
Instead of allocating a govarint.U64Base128Encoder in the inner loop,
allocate it just once on the outside, as it appears that it's just a
thin wrapper around binary.PutUvarint().
2018-01-27 11:35:10 -08:00
Steve Yen 10dd5489c2 scorch zap Dict.postingsList() takes []byte for more mem control
This allows callers that already have a []byte term to avoid
string'ification garbage.
2018-01-27 11:35:10 -08:00
Steve Yen 6a17ff48c7 scorch zap removed uneeded []byte cast of term 2018-01-27 11:35:10 -08:00
Steve Yen d389e2bb40 scorch zap merge file cleanup on error, and some minor prealloc's 2018-01-27 11:35:10 -08:00
Steve Yen 29d526a7c2 scorch zap merge uses DefaultChunkFactor 2018-01-27 11:35:10 -08:00
Steve Yen 603425c2c5 scorch zap mergerLoop missing fireAsyncError case 2018-01-27 11:35:10 -08:00
Steve Yen 37121c3b49 scorch zap writeRoaringWithLen optimized with reused bufs 2018-01-27 11:35:10 -08:00
Steve Yen 5a035dc9aa scorch zap in-memory segment representation (SegmentBase)
The zap SegmentBase struct is a refactoring of the zap Segment into
the subset of fields that are needed for read-only ops, without any
persistence related info.  This allows us to use zap's optimized data
encoding as scorch's in-memory segments.

The zap Segment struct now embeds a zap SegmentBase struct, and layers
on persistence.  Both the zap Segment and zap SegmentBase implement
scorch's Segment interface.
2018-01-27 11:35:10 -08:00
Steve Yen dc62324e02 scorch zap miscellaneous typos 2018-01-27 11:35:10 -08:00
abhinavdangeti 567d756c27 Add support for certain disk stats
+ num_bytes_used_disk
+ num_files_on_disk
2018-01-24 14:10:14 -08:00
Marty Schoch 0fc9b4b74a
Merge pull request #742 from steveyen/scorch-unlock-needed
scorch unlocks in introduceSegment's DocNumbers() error codepath
2018-01-23 12:09:23 -05:00
Steve Yen 34fd77709f scorch unlocks in introduceSegment's DocNumbers() error codepath 2018-01-20 17:17:16 -08:00
Marty Schoch cb6391e75e
Merge pull request #733 from abhinavdangeti/scorch-segment-sizeinbytes
Include overhead from data structures in segment's SizeInBytes
2018-01-19 09:10:03 -05:00
Marty Schoch 5a812ee9ce
Merge pull request #732 from sreekanth-cb/facet_merge
MB-27498 - date range facet query panics
2018-01-19 09:02:57 -05:00
Sreekanth Sivasankaran 47f1c66889 adding UT 2018-01-19 11:47:28 +05:30
abhinavdangeti 1176c73a9c Include overhead from data structures in segment's SizeInBytes
+ Account for all the overhead incurred from the data structures
  within mem.Segment and zap.Segment.
    - SizeOfMap = 8
    - SizeOfPointer = 8
    - SizeOfSlice = 24
    - SizeOfString = 16
+ Include overhead from certain new fields as well.
2018-01-17 11:11:44 -08:00
Marty Schoch 44c371582a
Merge pull request #739 from ethantkoenig/unique_token_filter
Add UniqueTerm token filter
2018-01-17 13:10:10 -05:00
Ethan Koenig 012d436dd7 Add UniqueTerm token filter 2018-01-16 22:24:51 -08:00
Steve Yen f4c3f984a4
Merge pull request #734 from steveyen/master
scorch mem segment optimizations
2018-01-16 08:57:02 -08:00
Marty Schoch 423d7dc4e4
Merge pull request #736 from ethantkoenig/readme
Fix coverage badge in README
2018-01-16 08:01:46 -05:00
Steve Yen 71d6d1691b scorch zap optimizations of inner loops and easy preallocs 2018-01-15 23:04:23 -08:00
Ethan Koenig d14b290235 Fix coverage badge in README 2018-01-15 22:23:41 -08:00
Steve Yen d682c85a7b scorch mem segments uses backing array trick even more
This change invokes make() only once per distinct type to allocate the
large, contiguous backing arrays for the mem segment.
2018-01-15 19:17:39 -08:00
Steve Yen 0f19b542a3 scorch mem segment prealloc's Locfields/starts/ends/pos/arraypos
This change preallocates more of the backing arrays for Locfields,
Locstarts, Locends, Locpos, Locaaraypos sub-slices of a scorch mem
segment.

On small bleve-blast tests (50K wiki docs) on a dev macbook, scorch
indexing throughput seems to improve from 15MB/sec to 20MB/sec after
the recent series of preallocation changes.
2018-01-15 18:40:28 -08:00
Steve Yen a84bd122d2 scorch mem segment preallocates sub-slices via # terms
This change tracks the number of terms per posting list to
preallocate the sub-slices for the Freqs & Norms.
2018-01-15 18:20:43 -08:00
Steve Yen a4110d325c scorch mem segment preallocates slices that are key'ed by postingId
The scorch mem segment build phase uses the append() idiom to populate
various slices that are keyed by postings list id's.  These slices
include...

* Postings
* PostingsLocs
* Freqs
* Norms
* Locfields
* Locstarts
* Locends
* Locpos
* Locarraypos

This change introduces an initialization step that preallocates those
slices up-front, by assigning postings list id's to terms up-front.

This change also has an additional effect of simplifying the
processDocument() logic to no longer have to worry about a first-time
initialization case, removing some duplicate'ish code.
2018-01-15 16:53:39 -08:00
Steve Yen 917c470791 scorch mem segment VisitDocument() accesses StoredTypes/Pos outside of loop 2018-01-15 11:54:46 -08:00
Steve Yen e7bd6026eb scorch mem segment preallocs docMap/fieldLens with capacity
The first time through, startNumFields should be 0, where there ought
to be more optimization assuming later docs have similar fields as the
first doc.
2018-01-15 11:52:20 -08:00
Steve Yen d777d7c365 scorch mem segment comments consistency 2018-01-15 11:08:21 -08:00
Marty Schoch 4d71e901e8 make new analyzers available to consumers of the config pkg
many tools and applications using bleve use the config pkg to
include support for many languages out of the box by forcing
import of optional packages.
2018-01-11 11:01:35 -05:00