0
0
Commit Graph

290 Commits

Author SHA1 Message Date
Steve Yen
7ae696d661 firestorm lookuper notified via batch
Previously, the firestorm.Batch() would notify the lookuper goroutine
on a document by document basis.  If the lookuper input channel became
full, then that would block the firestorm.Batch() operation.

With this change, lookuper is notified once, with a "batch" that is an
[]*InFlightItem.

This change also reuses that same []*InFlightItem to invoke the
compensator.MutateBatch().

This also has the advantage of only converting the docID's from string
to []byte just once, outside of the lock that's used by the
compensator.

Micro-benchmark of this change with null-firestorm bleve-blast does
not show large impact, neither degradation or improvement.
2016-01-02 12:21:24 -08:00
Steve Yen
38d50ed8b5 renamed var to docsUpdated to match docsDeleted naming 2016-01-02 12:21:24 -08:00
Steve Yen
3feeb14b7d firestorm.batchRows reuses buf for all IndexRows 2016-01-02 12:21:24 -08:00
Steve Yen
0a7f7e3df8 firestorm.Analyze() converts docID to bytes only once 2016-01-02 12:21:24 -08:00
Steve Yen
fd81d0364c firestorm.indexField() uses capacity of len(tokenFreqs) 2016-01-02 12:21:24 -08:00
Steve Yen
ee5ccda112 use KeyTo/ValueTo in firestorm.batchRows
After this change, with null kvstore micro-benchmark...

  GOMAXPROCS=8 ./bleve-blast -source=../../tmp/enwiki.txt \
    -count=100000 -numAnalyzers=8 -numIndexers=8 \
    -config=../../configs/null-firestorm.json -batch=100

Then TermFreqRow key and value methods dissapear as large boxes from
the cpu profile graphs.
2016-01-01 09:57:59 -08:00
Steve Yen
fd287bdfa4 firestorm.md markdown fixes 2016-01-01 09:57:59 -08:00
Steve Yen
b605224106 use shorter go idiom 2015-12-29 22:14:45 -08:00
Antoine Grondin
6806343677 firestore: fix #296 for division by zero on GC 2015-12-25 11:34:19 +07:00
Antoine Grondin
a6f7abdfa3 firestore: reproducer for division by zero on GC 2015-12-25 11:33:46 +07:00
Marty Schoch
8efbd556a3 fix indexing bug with data coming from arrays
fixes #295
2015-12-21 14:59:32 -05:00
Marty Schoch
cf67fe2cbc fix major synchronization issue in the field_cache
The field cache is expected to be the authority on which field
names are identified by which identifier.  This code was
optimized for the most common case in which fields already
exist.  However, if we deterimine the field is missing with
the read lock (shared), we incorrectly immediately proceed
to create a new row with the write lock (exclusive).  The
problem is that multiple goroutines might have come to
the same conclusion, and they all proceed to add rows.  The two
choices were to do the whole operation with the write lock, or
recheck the value again with the write lock.  We have chosen
to repeat the check inside the write-lock, as this optimizes
for what we believe to be the most common case, in which most
fields will already exist.
2015-12-15 16:39:38 -05:00
Marty Schoch
a73a178923 fix incorrect prefix search behavior
avoids double incrementing of end term when reading term dict
fixes #293
2015-12-04 14:07:16 -05:00
Marty Schoch
699c86073a make existing integration tests work with firestorm 2015-12-01 12:29:56 -05:00
Marty Schoch
6d851cfcc2 fix bug in warmup which led to docs being deleted 2015-11-30 10:18:14 -05:00
Marty Schoch
aa8d98f5fa include space after prefix in log output 2015-11-30 10:17:48 -05:00
Marty Schoch
68d8742826 correctly prefix internal rows with 'i' and print them in debug 2015-11-30 10:17:15 -05:00
Marty Schoch
c93de9734e fix issues identified by errcheck 2015-11-24 14:32:33 -05:00
Marty Schoch
bbef1980d8 Merge branch 'master' into firestorm 2015-11-24 13:04:36 -05:00
Marty Schoch
ff11f83842 properly handle errors inside metrics kvstore reporting 2015-11-24 12:52:03 -05:00
Marty Schoch
a707d44e0b Merge branch 'master' into firestorm 2015-11-24 09:44:47 -05:00
Patrick Mezard
e85c9c542e row: expose TermFrequencyRow term and freq fields
Rows content is an implementation detail of bleve index and may change
in the future. That said, they also contains information valuable to
assess the quality of the index or understand its performances. So, as
long as we agree that type asserting rows should only be done if you
know what you are doing and are ready to deal with future changes, I see
no reason to hide the row fields from external packages.

Fix #268
2015-11-17 17:21:26 +01:00
Kosov Eugene
45e670b99b BoltDB wrapper nano optimization which makes code a bit prettier too 2015-11-05 00:27:28 +03:00
Marty Schoch
4791625b9b Merge pull request #262 from pmezard/index-and-tokenizer-doc-and-fix
Index and tokenizer doc and fix
2015-11-02 11:51:21 -05:00
Marty Schoch
30651065e9 fix panic on insufficiently sized buffer
adds test case to reproduce original problem
fixes #264
2015-10-30 18:25:38 -04:00
Marty Schoch
2bd3ef4080 copy relevant k/v pairs before advancing underlying iterator 2015-10-28 12:23:54 -04:00
Marty Schoch
d1b07f4909 fix dump methods to properly copy keys and values 2015-10-28 12:06:44 -04:00
Marty Schoch
01526e971f Merge branch 'master' into firestorm 2015-10-28 11:26:01 -04:00
Patrick Mezard
f2b3d5698e index: document TermFieldReader interface 2015-10-27 18:53:03 +01:00
Patrick Mezard
3df789d258 index: document empty strings behaviour when calling DocIDReader() 2015-10-27 18:53:03 +01:00
Marty Schoch
1a978a4591 fix go vet issues and cleanup reader/iterator 2015-10-26 16:41:58 -04:00
Marty Schoch
f0d282f5f8 add test case for seeing prefix iterators outside of range
similar to #256 except for prefix iterators
includes fix for boltdb and gtreap which had incorrect behavior
2015-10-26 16:14:29 -04:00
Patrick Mezard
5100e00f20 doc: DocIDReader.Advance() is no longer implementation dependent 2015-10-20 20:32:23 +02:00
Patrick Mezard
2fa334fc27 doc: talk about "documents" not "indexed or stored documents" 2015-10-20 20:24:24 +02:00
Patrick Mezard
b174c137fd doc: document DocIDReader, and some Index bits 2015-10-20 20:24:24 +02:00
Patrick Mezard
da72d0c2b9 store_test: deduplicate store initialization 2015-10-20 19:21:01 +02:00
Patrick Mezard
873f483804 gtreap: RangeIterator.Seek should not move before start 2015-10-20 19:12:30 +02:00
Patrick Mezard
5d7628ba3b boltdb: fix RangeIterator outside of range seeks
Two issues:
- Seeking before i.start and iterating returned keys before i.start
- Seeking after the store last key did not invalidate the iterator and
  could cause infinite loops.
2015-10-20 19:09:51 +02:00
Patrick Mezard
aada2e7333 store_test: test RangeIterator.Seek on goleveldb 2015-10-20 19:09:38 +02:00
Marty Schoch
6cc21346dc fix errcheck issues 2015-10-19 14:27:03 -04:00
Marty Schoch
817c317c90 Merge branch 'master' into newkvstore 2015-10-19 12:04:07 -04:00
Marty Schoch
faceecf87b make row buffer size constant/configurable
also handle case where it is insufficiently sized
2015-10-19 12:03:38 -04:00
Marty Schoch
f0ee9a3c66 removed commented code and unused functions 2015-10-19 11:13:03 -04:00
Marty Schoch
c9471d5739 Merge pull request #244 from kevgs/master
reducing allocation count
2015-10-16 15:51:30 -04:00
Marty Schoch
e6d0fc8d95 Merge pull request #247 from pmezard/remove-update-goroutine
upside_down: no need for a goroutine to enqueue AnalysisWork
2015-10-16 10:15:55 -04:00
Marty Schoch
4c6bc23043 rewrite to keep using same buffer when possible 2015-10-13 14:04:56 -07:00
Marty Schoch
8de860bf12 2 more places that used old Key() 2015-10-13 12:35:08 -07:00
Marty Schoch
5f594d1acc Merge branch 'master' into newkvstore 2015-10-12 18:07:04 -07:00
Marty Schoch
08572e4925 move literals outside loop for more predicatble test results 2015-10-12 18:06:38 -07:00
Patrick Mezard
8c928539ee upside_down: no need for a goroutine to enqueue AnalysisWork
It boils down to:
1. client sends some work and a notification channel to a single worker,
   then waits.
2. worker processes the work
3. worker sends the result to the client using the notification channel

I do not see any problem with this, even with unbuffered channels.
2015-10-12 10:42:14 +02:00
Marty Schoch
95e06538f3 fix benchmarks for the x kvstores 2015-10-09 11:09:42 -04:00
Marty Schoch
0f05d1d3ca Merge branch 'master' into newkvstore 2015-10-09 10:33:41 -04:00
Patrick Mezard
aee82f8b49 upside_down: simplify return code in batchRows() 2015-10-09 09:57:12 +02:00
Marty Schoch
e28eb749d7 bump up buffer size 2015-10-06 16:45:38 -04:00
Marty Schoch
71cbb13e07 modify code to reuse buffer for kv generation 2015-10-05 17:49:50 -04:00
Kosov Eugene
a61c350888 reducing allocation count 2015-10-05 22:57:10 +03:00
Patrick Mezard
9d5407be13 boltdb: add "nosync" option to force boltdb.DB.NoSync=true
Use this option when rebuilding indexes from scratch. In my small case
(~17000 json documents), it reduces indexing from 520s to 250s.

I did not add any test, short of forced indexing termination it only
has performance effects, which are hard to test. And unknown options are
currently ignored.

Issue #240
2015-10-03 14:26:48 +02:00
Marty Schoch
d06b526cbf more refactoring 2015-09-28 16:50:27 -04:00
Marty Schoch
66aa1b020a Merge branch 'master' into firestorm 2015-09-23 11:32:25 -07:00
Marty Schoch
900f1b4a67 major kvstore interface and impl overhaul
clarified the interface contract
2015-09-23 11:25:47 -07:00
Marty Schoch
f81b2be334 major refactor of bleve configuration
see #221 for full details
2015-09-16 17:10:59 -04:00
Marty Schoch
c308f611cf skip unnecessary map before slice
benchmark            old ns/op     new ns/op     delta
BenchmarkBatch-4     16950972      16377194      -3.38%

benchmark            old allocs     new allocs     delta
BenchmarkBatch-4     136164         136161         -0.00%

benchmark            old bytes     new bytes     delta
BenchmarkBatch-4     7168872       7109691       -0.83%
2015-09-10 08:21:26 -04:00
Marty Schoch
f6f1628b15 avoid doing unnecessary work:
benchmark            old ns/op     new ns/op     delta
BenchmarkBatch-4     20738739      17047158      -17.80%

benchmark            old allocs     new allocs     delta
BenchmarkBatch-4     136423         136160         -0.19%

benchmark            old bytes     new bytes     delta
BenchmarkBatch-4     20277781      7168772       -64.65%
2015-09-10 08:19:05 -04:00
Marty Schoch
c8538c835f Merge branch 'master' into firestorm 2015-09-10 08:14:14 -04:00
Marty Schoch
17c64d37c7 add similar benchmarks from firestorm 2015-09-10 08:13:52 -04:00
Marty Schoch
1e4d637761 adding more benchmarks 2015-09-10 08:01:11 -04:00
Marty Schoch
f74ed6a9ae Merge remote-tracking branch 'origin' into firestorm
cathching up with changes from master
2015-09-02 13:29:03 -04:00
Marty Schoch
dbb93b75a4 refactoring to allow pluggable index encodings
this lays the foundation for supporting the new firestorm
indexing scheme.  i'm merging these changes ahead of
the rest of the firestorm branch so i can continue
to make changes to the analysis pipeline in parallel
2015-09-02 13:12:08 -04:00
Marty Schoch
7ad7659ce5 add support for using null kvstore outside of bleve internals 2015-09-02 11:50:06 -04:00
Marty Schoch
07d37ca38a add important rocksdb config options 2015-09-02 11:49:42 -04:00
Marty Schoch
18151862b5 fix go vet issues 2015-08-25 15:13:13 -04:00
Marty Schoch
84811cf5a0 made index type configurable + first version of firestorm 2015-08-25 14:52:42 -04:00
Marty Schoch
3e60ca24ec support using end key on forestdb iterator for term freq lookup
also additoanl forestdb configs
2015-08-18 16:22:02 -04:00
Marty Schoch
ae19d77b04 updated protobuf defs to be valid 2015-08-17 15:37:13 -04:00
Marty Schoch
1187436e46 changed Stored row Values to also use protobuf 2015-08-17 09:48:40 -04:00
Marty Schoch
8d8a05a842 fix more issues 2015-08-14 16:27:00 -04:00
Marty Schoch
e0802a2b39 fixed the worst of the formatting 2015-08-14 16:17:48 -04:00
Marty Schoch
f4df56eb7c add first draft of firestorm proposal 2015-08-14 16:09:19 -04:00
Marty Schoch
d3dda3d0ea fixup config parsing and add new options 2015-08-12 13:18:23 -04:00
Marty Schoch
01667dfff3 faster protobufs with gogo 2015-08-12 13:18:23 -04:00
Marty Schoch
7df66b4857 fix broken benchmark cause by index row encoding change 2015-08-06 14:48:04 -04:00
Marty Schoch
9db850a53e Merge branch 'fix/MaxVarintLen64' of https://github.com/tukdesk/bleve into tukdesk-fix/MaxVarintLen64 2015-07-31 15:16:16 -04:00
Marty Schoch
3682c25467 update to correctly work with composite fields
also updated search results to return array positions
2015-07-31 11:16:11 -04:00
Marty Schoch
c1c4941dde Merge branch 'feature/term_vector' of https://github.com/tukdesk/bleve into tukdesk-feature/term_vector 2015-07-29 14:31:15 -04:00
Marty Schoch
bf8dcae76b removing build tags 2015-07-28 18:59:10 -04:00
Marty Schoch
1b28f6218b additional row validation 2015-07-13 15:22:54 -04:00
Marty Schoch
17ef48f82a switching back to the canonical goleveldb repo 2015-07-08 12:21:17 -06:00
Marty Schoch
bf80f4628e fix bug in curent goleveldb (must copy during iteration)
also changed over to mschoch fork of goleveldb (temporary)

the change to my fork is pending some read-only issues described
here:  https://github.com/syndtr/goleveldb/issues/111

hopefully we can find a path forward, and get that addressed upstream
2015-07-06 18:00:05 -04:00
Marty Schoch
7be7ecdf8e fix batch indexing bug, incremented docCount before commit
fixes #211
2015-06-08 14:14:05 -04:00
Marty Schoch
2768c2da3c fix previous sloppy fix which hadn't been adequately tested 2015-05-27 19:15:55 -07:00
Marty Schoch
201fb91171 fix up to correctly trim off separator
even though it should never be present
2015-05-27 19:10:12 -07:00
Marty Schoch
a58592ceff fix case where NewBackIndexRowKV returns nil, nil
the logic for reading the docID from the keys
in this row relies on the keys NEVER containing
the byte separator character (0xff), this is OK
as we require that all keys be valid utf-8
however, it turns out that in the case where this
rule was violated, we would panic, because we
return nil, nil and later try to print the doc id
2015-05-27 19:04:57 -07:00
dtynn
59c97ae577 use binary.MaxVarintLen64 2015-05-26 15:35:31 +08:00
Marty Schoch
e0887f9113 fix tests which deadlock boltdb due to deferred cleanup
fixes #209
2015-05-21 12:29:31 -04:00
Marty Schoch
a52d3b5c07 put in hack to allow boltdb reader isolation test to pass
in boltdb, long readers *MAY* block a writer.  in particular if
the write requires additional allocation, it must acquire a lock
already held by the reader.  in general this is not a problem
for bleve (though it can affect performance in some cases), but
it is a problem for the reader isolation test.  this commit
adds a hack to try and avoid the need for additional allocation
closes #208
2015-05-21 11:39:59 -04:00
dtynn
b4f7496031 update the index format version number 2015-05-18 15:16:35 +08:00
dtynn
89dc2c22bc update TermVector 2015-05-17 13:07:14 +08:00
Marty Schoch
8f70def63b properly use the stored array positions when loading a document
fixes #205
2015-05-15 15:47:54 -04:00
Marty Schoch
328bc73ed0 clarify Batch is not threadsafe in docs
in some limited cases we can detect unsafe usage
in these cases, do not trip over ourselves and panic
instead return a strongly typed error upside_down.UnsafeBatchUseDetected
also, introduced Batch.Reset() to allow batch reuse
this is currently still experimental
closes #195
2015-05-15 15:04:52 -04:00
Marty Schoch
57cd67fa88 fix data race on index metadata (docCount)
closes #198
2015-05-08 08:07:20 -04:00