0
0
Fork 0
Commit Graph

1598 Commits

Author SHA1 Message Date
Ethan Koenig 012d436dd7 Add UniqueTerm token filter 2018-01-16 22:24:51 -08:00
Steve Yen f4c3f984a4
Merge pull request #734 from steveyen/master
scorch mem segment optimizations
2018-01-16 08:57:02 -08:00
Marty Schoch 423d7dc4e4
Merge pull request #736 from ethantkoenig/readme
Fix coverage badge in README
2018-01-16 08:01:46 -05:00
Steve Yen 71d6d1691b scorch zap optimizations of inner loops and easy preallocs 2018-01-15 23:04:23 -08:00
Ethan Koenig d14b290235 Fix coverage badge in README 2018-01-15 22:23:41 -08:00
Steve Yen d682c85a7b scorch mem segments uses backing array trick even more
This change invokes make() only once per distinct type to allocate the
large, contiguous backing arrays for the mem segment.
2018-01-15 19:17:39 -08:00
Steve Yen 0f19b542a3 scorch mem segment prealloc's Locfields/starts/ends/pos/arraypos
This change preallocates more of the backing arrays for Locfields,
Locstarts, Locends, Locpos, Locaaraypos sub-slices of a scorch mem
segment.

On small bleve-blast tests (50K wiki docs) on a dev macbook, scorch
indexing throughput seems to improve from 15MB/sec to 20MB/sec after
the recent series of preallocation changes.
2018-01-15 18:40:28 -08:00
Steve Yen a84bd122d2 scorch mem segment preallocates sub-slices via # terms
This change tracks the number of terms per posting list to
preallocate the sub-slices for the Freqs & Norms.
2018-01-15 18:20:43 -08:00
Steve Yen a4110d325c scorch mem segment preallocates slices that are key'ed by postingId
The scorch mem segment build phase uses the append() idiom to populate
various slices that are keyed by postings list id's.  These slices
include...

* Postings
* PostingsLocs
* Freqs
* Norms
* Locfields
* Locstarts
* Locends
* Locpos
* Locarraypos

This change introduces an initialization step that preallocates those
slices up-front, by assigning postings list id's to terms up-front.

This change also has an additional effect of simplifying the
processDocument() logic to no longer have to worry about a first-time
initialization case, removing some duplicate'ish code.
2018-01-15 16:53:39 -08:00
Steve Yen 917c470791 scorch mem segment VisitDocument() accesses StoredTypes/Pos outside of loop 2018-01-15 11:54:46 -08:00
Steve Yen e7bd6026eb scorch mem segment preallocs docMap/fieldLens with capacity
The first time through, startNumFields should be 0, where there ought
to be more optimization assuming later docs have similar fields as the
first doc.
2018-01-15 11:52:20 -08:00
Steve Yen d777d7c365 scorch mem segment comments consistency 2018-01-15 11:08:21 -08:00
Marty Schoch 4d71e901e8 make new analyzers available to consumers of the config pkg
many tools and applications using bleve use the config pkg to
include support for many languages out of the box by forcing
import of optional packages.
2018-01-11 11:01:35 -05:00
Marty Schoch 4e82a8a0ca
Merge pull request #726 from sreekanth-cb/docValue_configs
DocValue Config, new API Changes
2018-01-10 18:11:18 -05:00
Marty Schoch 09a61a7a38 add analyzers for several languages
Having pure Go snowball stemmers allows us to add support for
many languages into the core of bleve.  Specifically we just
added: Russian, Danish, Finnish, Hungarian, Dutch, Norwegian,
Romanian, Swedish, Turkish
2018-01-10 16:00:29 -05:00
Marty Schoch e68b70aa82 Merge branch 'sokolovstas-ru_analyzer' 2018-01-10 15:16:45 -05:00
Marty Schoch a9532e510a refactor slightly to use our new hosted snowball stemmers
rather than having each package include it directly inside of
bleve, we have decide to host them all in one repo

https://github.com/blevesearch/snowballstem

this makes the easier for the rest of the community to use
outside of bleve contexts
2018-01-10 15:15:31 -05:00
Sreekanth Sivasankaran 53aef2104e fixing err handling in UTs, name changes 2018-01-10 22:00:26 +05:30
Marty Schoch af198c833f Merge branch 'ru_analyzer' of https://github.com/sokolovstas/bleve into sokolovstas-ru_analyzer 2018-01-10 10:29:15 -05:00
Marty Schoch b1a079fe57
Merge pull request #725 from tomkralidis/patch-1
fix minor typo
2018-01-10 09:56:02 -05:00
Abhinav Dangeti 9c73d37987
Merge pull request #730 from abhinavdangeti/scorch-master
Do not account mmap'ed part of zap segments in MemoryUsed
2018-01-09 09:56:24 -08:00
abhinavdangeti 43bfcc00c9 Do not account mmap'ed part of zap segments in MemoryUsed
This API is designed to only emit the dirty "unpersisted"
bytes only. This does not included the mmap'ed part in the
zap segments (disk).
2018-01-09 09:43:53 -08:00
Sreekanth Sivasankaran 4c256f5669 DocValue Config, new API Changes
-VisitableDocValueFields API for persisted DV field list
-making dv configs overridable at field level
-enabling on the fly/runtime un inverting of doc values
-few UT updates
2018-01-08 10:58:33 +05:30
Marty Schoch 1788a03803 remove junk from end of scorch readme 2018-01-06 21:09:53 -05:00
Tom Kralidis 637fad78a5
fix minor typo 2018-01-06 21:04:03 -05:00
Marty Schoch 0456569b62
Merge pull request #724 from blevesearch/scorch
Scorch
2018-01-05 17:24:46 -05:00
Marty Schoch 94b0367e47 switch back to upsidedown as default index before merge to master 2018-01-05 16:53:16 -05:00
Marty Schoch d3106493de
Merge pull request #723 from mschoch/add-async-error
add initial support for async error callback
2018-01-05 16:52:54 -05:00
Marty Schoch e756c7acf0 add initial support for async error callback 2018-01-05 16:43:16 -05:00
Marty Schoch 038880571c
Merge pull request #722 from mschoch/fix-event-race
fix race condition in setting up event callbacks
2018-01-05 13:55:18 -05:00
Marty Schoch 6237479605 fix race condition in setting up event callbacks
previous approach used SetEventCallback method which allowed
you to change the callback, unfotunately that also included
times after the goroutines were started and potentially firing
the callback.

checking lock on this would be too expensive, so instead we go
for an approach that allows callbacks to be registered by name
during process init(), then upon opening up an index a string
config key 'eventCallbackName' is used to look up the
appropriate callback function.  also, since this string config
name is serializable, it fits into the existing bleve index
metadata without any new issues.
2018-01-05 13:46:03 -05:00
Marty Schoch 57a075afdb improving command-line tool for scorch 2018-01-05 11:50:07 -05:00
Marty Schoch c691cd2bb5 refactor scorch/zap command-line tools under bleve
zap command-line tool added to main bleve command-line tool
this required physical relocation due to the vendoring used
only on the bleve command-line tool (unforseen limitation)

a new scorch command-line tool has also been introduced
and for the same reasons it is physically store under
the top-level bleve command-line tool as well
2018-01-05 10:17:18 -05:00
Abhinav Dangeti dee1dd9bc8
Merge pull request #720 from abhinavdangeti/scorch
Updated Rollback APIs
2018-01-04 14:51:33 -08:00
abhinavdangeti 111f0d0721 Updated Rollback APIs
New APIs:
+ RollbackPoints()
    - Retrieves the available list of rollback points: epoch+meta.
    - The application will need to check with the meta to decide
    on the rollback point.
+ Rollback()
    - API requires a rollback point identified by the first API.
    - Atomically & Durably rolls back the index to specified point,
    provided the specified rollback point is still available.
+ Unit test: TestIndexRollback
    - Writes a batch.
    - Sets the rollback point.
    - Writes second batch.
    - Rollback to previously decided point.
    - Ensure that data is as is before the second batch.
2018-01-04 13:21:58 -08:00
Marty Schoch 53a7575530
Merge pull request #621 from xomaczar/patch-1
Typo index.go
2018-01-04 10:52:38 -05:00
Marty Schoch 715454d94c
Merge pull request #643 from schwarmco/patch-1
typo in documentation
2018-01-04 10:51:55 -05:00
Marty Schoch 71cdac785d
Merge pull request #703 from sreekanth-cb/docValue_persisted
docValue persist changes
2018-01-04 10:34:58 -05:00
Sreekanth Sivasankaran 71a726bbf6 perf issue was due to duplicate fieldIDs getting
inserted to the list of dv enabled fields list -
DocValueFields in mem segment.
Moved back to the original type `DocValueFields map[uint16]bool`
for easy look up to check whether the fieldID is
configured for dv storage.
2018-01-04 15:34:55 +05:30
Sreekanth Sivasankaran f42ecb0ac7 docvalue "zap-path" cmd to print out the dv disk sizes 2018-01-04 13:58:51 +05:30
Marty Schoch 6ad679bbb5
Merge pull request #717 from mschoch/scorch-fix-refcounting
attempt to fix core reference counting issues
2018-01-03 09:32:40 -08:00
Marty Schoch 1a59a1bb99 attempt to fix core reference counting issues
Observed problem:

Persisted index state (in root bolt) would contain index snapshots which
pointed to index files that did not exist.

Debugging this uncovered two main problems:

1.  At the end of persisting a snapshot, the persister creates a new index
snapshot with the SAME epoch as the current root, only it replaces in-memory
segments with the new disk based ones.  This is problematic because reference
counting an index segment triggers "eligible for deletion".  And eligible for
deletion is keyed by epoch.  So having two separate instances going by the same
epoch is problematic.  Specifically, one of them gets to 0 before the other,
and we wrongly conclude it's eligible for deletion, when in fact the "other"
instance with same epoch is actually still in use.

To address this problem, we have modified the behavior of the persister.  Now,
upon completion of persistence, ONLY if new files were actually created do we
proceed to introduce a new snapshot.  AND, this new snapshot now gets it's own
brand new epoch.  BOTH of these are important because since the persister now
also introduces a new epoch, it will see this epoch again in the future AND be
expected to persist it.  That is OK (mostly harmless), but we cannot allow it
to form a loop.  Checking that new files were actually introduced is what
short-circuits the potential loop.  The new epoch introduced by the persister,
if seen again will not have any new segments that actually need persisting to
disk, and the cycle is stopped.

2.  The implementation of NumSnapshotsToKeep, and related code to deleted old
snapshots from the root bolt also contains problems.  Specifically, the
determination of which snapshots to keep vs delete did not consider which ones
were actually persisted.  So, lets say you had set NumSnapshotsToKeep to 3, if
the introducer gets 3 snapshots ahead of the persister, what can happen is that
the three snapshots we choose to keep are all in memory.  We now wrongly delete
all of the snapshots from the root bolt.  But it gets worse, in this instant of
time, we now have files on disk that nothing in the root bolt points to, so we
also go ahead and delete those files.  Those files were still being referenced
by the in-memory snapshots.  But, now even if they get persisted to disk, they
simply have references to non-existent files.  Opening up one of these indexes
results in lost data (often everything).

To address this problem, we made large change to the way this section of code
operates.  First, we now start with a list of all epochs actually persisted in
the root bolt.  Second, we set aside NumSnapshotsToKeep of these snapshots to
keep.  Third, anything else in the eligibleForRemoval list will be deleted.  I
suspect this code is slower and less elegant, but I think it is more correct.
Also, previously NumSnapshotsToKeep defaulted to 0, I have now defaulted it to
1, which feels like saner out-of-the-box behavior (though it's debatable if the
original intent was perhaps instead for "extra" snapshots to keep, but with the
variable named as it is, 1 makes more sense to me)

Other minor changes included in this change:

- Location of 'nextSnapshotEpoch', 'eligibleForRemoval', and
'ineligibleForRemoval' members of Scorch struct were moved into the
paragraph with 'rootLock' to clarify that you must hold the lock to access it.

- TestBatchRaceBug260 was updated to properly Close() the index, which leads to
occasional test failures.
2018-01-03 12:05:00 -05:00
Sreekanth Sivasankaran 448201243a removed redundant buf writer, and checks 2017-12-30 16:54:06 +05:30
Sreekanth Sivasankaran 61ba81e964 Merge branch 'scorch', remote-tracking branch 'origin' into docValue_persisted 2017-12-30 16:52:51 +05:30
Marty Schoch 29b63cfe43
Merge pull request #711 from abhinavdangeti/scorch3
Tracking memory consumption for a scorch index
2017-12-29 12:52:32 -08:00
Marty Schoch 780c3e9c43
Merge pull request #710 from abhinavdangeti/scorch2
Adding onEvent callback support for scorch
2017-12-29 09:29:24 -08:00
abhinavdangeti 5c26f5a86d Tracking memory consumption for a scorch index
+ Track memory usage at a segment level
+ Add a new scorch API: MemoryUsed()
    - Aggregate the memory consumption across
      segments when API is invoked.

+ TODO:
    - Revisit the second iteration if it can be gotten
      rid off, and the size accounted for during the first
      run while building an in-mem segment.
    - Accounting for pointer and slice overhead.
2017-12-29 10:20:11 -07:00
abhinavdangeti 055d3e12df Adding onEvent callback support for scorch
Event types:
- EventKindCloseStart
- EventKindClose
- EventKindMergerProgress
- EventKindPersisterProgress
- EventKindBatchIntroductionStart
- EventKindBatchIntroduction
2017-12-29 09:47:25 -07:00
Sreekanth Sivasankaran c8df014c0c Updated readme, zap version, added new docvalue cmd,
fixed the footer and fields cmd,
interface name updated
2017-12-29 21:39:29 +05:30
Marty Schoch a475ee886d
Merge pull request #705 from abhinavdangeti/scorch
Scorch specific stats
2017-12-28 13:57:30 -08:00