bleve

Author	SHA1	Message	Date
Steve Yen	95a4f37e5c	scorch zap enumerator impl that joins multiple vellum iterators Unlike vellum's MergeIterator, the enumerator introduced in this commit doesn't merge when there are matching keys across iterators. Instead, the enumerator implementation provides a traversal of all the tuples of (key, iteratorIndex, val) from the underlying vellum iterators, ordered by key ASC, iteratorIndex ASC.	2018-02-12 20:54:06 -08:00
Steve Yen	e37c563c56	scorch zap merge move fieldDvLocsOffset var declaration Move the var declaration to nearer where its used.	2018-02-08 18:03:09 -08:00
Steve Yen	f177f07613	scorch zap segment merging reuses prealloc'ed PostingsIterator During zap segment merging, a new zap PostingsIterator was allocated for every field X segment X term. This change optimizes by reusing a single PostingsIterator instance per persistMergedRest() invocation. And, also unused fields are removed from the PostingsIterator.	2018-02-08 17:24:30 -08:00
Steve Yen	ed4826b189	scorch zap merge optimization to byte-copy storedDocs The optimization to byte-copy all the storedDocs for a given segment during merging kicks in when the fields are the same across all segments and when there are no deletions for that given segment. This can happen, for example, during data loading or insert-only scenarios. As part of this commit, the Segment.copyStoredDocs() method was added, which uses a single Write() call to copy all the stored docs bytes of a segment to a writer in one shot. And, getDocStoredMetaAndCompressed() was refactored into a related helper function, getDocStoredOffsets(), which provides the storedDocs metadata (offsets & lengths) for a doc.	2018-02-08 09:08:35 -08:00
Steve Yen	0b50a20cac	scorch zap move docDropped const to earlier in file	2018-02-08 09:06:31 -08:00
Steve Yen	822457542e	scorch zap VERSION bump: check whether fields are the same at merge COMPATIBILITY NOTE: scorch zap version bumped in this commit. The version bump is because mergeFields() now computes whether fields are the same across segments and it relies on the previous commit where fieldID's are assigned in field name sorted order (albeit with _id field always having fieldID of 0). Potential future commits might rely on this info that "fields are the same across segments" for more optimizations, etc.	2018-02-08 09:06:30 -08:00
Steve Yen	ffdeb8055e	scorch sorts fields by name to assign fieldID's This is a stepping stone to allow easier future comparisons of field maps and potential merge optimizations. In bleve-blast tests on a 2015 macbook (50K wikipedia docs, 8 indexers, batch size 100, ssd), this does not seem to have a distinct effect on indexing throughput.	2018-02-08 09:06:30 -08:00
Steve Yen	a83ee0f364	scorch zap.MergeToWriter() takes SegmentBases instead of Segments This change turns zap.MergeToWriter() into a public func, so that it's now directly callable from outside packages (such as from scorch's top-level merger or persister). And, MergerToWriter() now takes input of SegmentBases instead of Segments, so that it can now work on either in-memory zap segments or file-based zap segments. This is yet another stepping stone towards in-memory merging of zap segments.	2018-02-07 14:38:13 -08:00
Steve Yen	8c2520d55c	scorch zap optimize via postingsList reuse pprof graphs were showing many postingsList allocations during merging, so this change optimizes by reusing postingList memory in the merging loops.	2018-02-07 14:33:20 -08:00
Steve Yen	0dfd73d6cc	scorch zap mergeStoredAndRemap loop optimization This change avoids an array/slice access in a loop body.	2018-02-06 17:10:44 -08:00
Steve Yen	c09e2a08ca	scorch zap chunkedContentCoder reuses chunk metadata slice memory And, renamed the chunk MetaData.DocID field to DocNum for naming correctness, where much of this commit is the mechanical effect of that rename.	2018-02-05 07:39:16 -08:00
Steve Yen	6578655758	scorch zap refactored out mergeToWriter() func This is a step towards supporting in-memory zap segment merging.	2018-02-05 07:39:16 -08:00
Steve Yen	eb21bf8315	scorch zap merge & build share persistStoredFieldValues() Refactored out a helper func, persistStoredFieldValues(), that both the persistence and merge codepaths now share.	2018-02-05 07:38:55 -08:00
Steve Yen	714f5321e0	scorch zap merge storedFieldVals inner loop optimization	2018-02-01 16:28:15 -08:00
Steve Yen	93b037cdbb	scorch zap TestMergeWithUpdates()	2018-01-31 11:44:41 -08:00
Steve Yen	4dd64b68fa	scorch zap TestMergeWithEmptySegment(s)	2018-01-30 22:27:40 -08:00
Steve Yen	684ee3c0e7	scorch zap DictIterator term count fixed and more merge unit tests The zap DictionaryIterator Next() was incorrectly returning the postingsList offset as the term count. As part of this, refactored out a PostingsList.read() helper method. Also added more merge unit test scenarios, including merging a segment for a few rounds to see if there are differences before/after merging.	2018-01-30 21:22:06 -08:00
Steve Yen	634cfa0560	scorch zap chunkedIntCoder optimization to prealloc some final buf	2018-01-29 11:03:53 -08:00
Steve Yen	a444c25ddf	scorch zap merge uses array for docTermMap with no sorting Instead of sorting docNum keys from a hashmap, this change instead iterates from docNum 0 to N and uses an array instead of hashmap. The array is also reused across outer loop iterations. This optimizes for when there's a lot of structural similarity between docs, where many/most docs have the same fields. i.e., beers, breweries. If every doc has completely different fields, then this change might produce worse behavior compared to the previous sparse hashmap approach.	2018-01-29 10:47:08 -08:00
Steve Yen	745575a6c1	scorch zap mergeStoredAndRemap uses array indexing, not append() Since we have right array size preallocated, we don't need the extra capacity checking of append().	2018-01-27 11:35:10 -08:00
Steve Yen	8dd17a3b20	scorch zap mergeStoredAndRemap uses continue for less indentation	2018-01-27 11:35:10 -08:00
Steve Yen	0041664bc4	scorch zap merge computeNewDocCount() optimize 1 variable	2018-01-27 11:35:10 -08:00
Steve Yen	6985db13a0	scorch zap merge reuses docNumbers array	2018-01-27 11:35:10 -08:00
Steve Yen	916bbf4125	scorch zap merge prealloc's docTermMap capacity	2018-01-27 11:35:10 -08:00
Steve Yen	56cdb68f35	scorch zap merge checks err2 not err Also, optimize the appending of the termSeparator so that the docTermMap is accessed and updated just once.	2018-01-27 11:35:10 -08:00
Steve Yen	3030d4edb5	scorch zap merge preallocs segNewDocNums capacity	2018-01-27 11:35:10 -08:00
Steve Yen	9038d75c98	scorch zap allocate govarint.U64Base128Encoder just once Instead of allocating a govarint.U64Base128Encoder in the inner loop, allocate it just once on the outside, as it appears that it's just a thin wrapper around binary.PutUvarint().	2018-01-27 11:35:10 -08:00
Steve Yen	10dd5489c2	scorch zap Dict.postingsList() takes []byte for more mem control This allows callers that already have a []byte term to avoid string'ification garbage.	2018-01-27 11:35:10 -08:00
Steve Yen	6a17ff48c7	scorch zap removed uneeded []byte cast of term	2018-01-27 11:35:10 -08:00
Steve Yen	d389e2bb40	scorch zap merge file cleanup on error, and some minor prealloc's	2018-01-27 11:35:10 -08:00
Steve Yen	37121c3b49	scorch zap writeRoaringWithLen optimized with reused bufs	2018-01-27 11:35:10 -08:00
Steve Yen	5a035dc9aa	scorch zap in-memory segment representation (SegmentBase) The zap SegmentBase struct is a refactoring of the zap Segment into the subset of fields that are needed for read-only ops, without any persistence related info. This allows us to use zap's optimized data encoding as scorch's in-memory segments. The zap Segment struct now embeds a zap SegmentBase struct, and layers on persistence. Both the zap Segment and zap SegmentBase implement scorch's Segment interface.	2018-01-27 11:35:10 -08:00
Steve Yen	dc62324e02	scorch zap miscellaneous typos	2018-01-27 11:35:10 -08:00
abhinavdangeti	1176c73a9c	Include overhead from data structures in segment's SizeInBytes + Account for all the overhead incurred from the data structures within mem.Segment and zap.Segment. - SizeOfMap = 8 - SizeOfPointer = 8 - SizeOfSlice = 24 - SizeOfString = 16 + Include overhead from certain new fields as well.	2018-01-17 11:11:44 -08:00
Steve Yen	71d6d1691b	scorch zap optimizations of inner loops and easy preallocs	2018-01-15 23:04:23 -08:00
Marty Schoch	4e82a8a0ca	Merge pull request #726 from sreekanth-cb/docValue_configs DocValue Config, new API Changes	2018-01-10 18:11:18 -05:00
Sreekanth Sivasankaran	53aef2104e	fixing err handling in UTs, name changes	2018-01-10 22:00:26 +05:30
abhinavdangeti	43bfcc00c9	Do not account mmap'ed part of zap segments in MemoryUsed This API is designed to only emit the dirty "unpersisted" bytes only. This does not included the mmap'ed part in the zap segments (disk).	2018-01-09 09:43:53 -08:00
Sreekanth Sivasankaran	4c256f5669	DocValue Config, new API Changes -VisitableDocValueFields API for persisted DV field list -making dv configs overridable at field level -enabling on the fly/runtime un inverting of doc values -few UT updates	2018-01-08 10:58:33 +05:30
Marty Schoch	c691cd2bb5	refactor scorch/zap command-line tools under bleve zap command-line tool added to main bleve command-line tool this required physical relocation due to the vendoring used only on the bleve command-line tool (unforseen limitation) a new scorch command-line tool has also been introduced and for the same reasons it is physically store under the top-level bleve command-line tool as well	2018-01-05 10:17:18 -05:00
Sreekanth Sivasankaran	71a726bbf6	perf issue was due to duplicate fieldIDs getting inserted to the list of dv enabled fields list - DocValueFields in mem segment. Moved back to the original type `DocValueFields map[uint16]bool` for easy look up to check whether the fieldID is configured for dv storage.	2018-01-04 15:34:55 +05:30
Sreekanth Sivasankaran	f42ecb0ac7	docvalue "zap-path" cmd to print out the dv disk sizes	2018-01-04 13:58:51 +05:30
Sreekanth Sivasankaran	448201243a	removed redundant buf writer, and checks	2017-12-30 16:54:06 +05:30
Sreekanth Sivasankaran	61ba81e964	Merge branch 'scorch', remote-tracking branch 'origin' into docValue_persisted	2017-12-30 16:52:51 +05:30
abhinavdangeti	5c26f5a86d	Tracking memory consumption for a scorch index + Track memory usage at a segment level + Add a new scorch API: MemoryUsed() - Aggregate the memory consumption across segments when API is invoked. + TODO: - Revisit the second iteration if it can be gotten rid off, and the size accounted for during the first run while building an in-mem segment. - Accounting for pointer and slice overhead.	2017-12-29 10:20:11 -07:00
Sreekanth Sivasankaran	c8df014c0c	Updated readme, zap version, added new docvalue cmd, fixed the footer and fields cmd, interface name updated	2017-12-29 21:39:29 +05:30
Sreekanth Sivasankaran	8abac42796	errCheck fixes	2017-12-28 13:23:57 +05:30
Sreekanth Sivasankaran	0272451093	adding checks for robustness	2017-12-28 13:05:25 +05:30
Sreekanth Sivasankaran	76f827f469	docValue persist changes docValues are persisted along with the index, in a columnar fashion per field with variable sized chunking for quick look up. -naive chunk level caching is added per field -data part inside a chunk is snappy compressed -metaHeader inside the chunk index the dv values inside the uncompressed data part -all the fields are docValue persisted in this iteration	2017-12-28 12:05:33 +05:30
Steve Yen	67e0e5973b	scorch mergeStoredAndRemap() memory reuse In mergeStoredAndRemap(), instead of allocating new hashmaps for each document, this commit reuses some arrays that are indexed by fieldId.	2017-12-20 15:18:22 -08:00

1 2

79 Commits