After this change, with null kvstore micro-benchmark...
GOMAXPROCS=8 ./bleve-blast -source=../../tmp/enwiki.txt \
-count=100000 -numAnalyzers=8 -numIndexers=8 \
-config=../../configs/null-firestorm.json -batch=100
Then TermFreqRow key and value methods dissapear as large boxes from
the cpu profile graphs.
The field cache is expected to be the authority on which field
names are identified by which identifier. This code was
optimized for the most common case in which fields already
exist. However, if we deterimine the field is missing with
the read lock (shared), we incorrectly immediately proceed
to create a new row with the write lock (exclusive). The
problem is that multiple goroutines might have come to
the same conclusion, and they all proceed to add rows. The two
choices were to do the whole operation with the write lock, or
recheck the value again with the write lock. We have chosen
to repeat the check inside the write-lock, as this optimizes
for what we believe to be the most common case, in which most
fields will already exist.
Rows content is an implementation detail of bleve index and may change
in the future. That said, they also contains information valuable to
assess the quality of the index or understand its performances. So, as
long as we agree that type asserting rows should only be done if you
know what you are doing and are ready to deal with future changes, I see
no reason to hide the row fields from external packages.
Fix#268
Two issues:
- Seeking before i.start and iterating returned keys before i.start
- Seeking after the store last key did not invalidate the iterator and
could cause infinite loops.
It boils down to:
1. client sends some work and a notification channel to a single worker,
then waits.
2. worker processes the work
3. worker sends the result to the client using the notification channel
I do not see any problem with this, even with unbuffered channels.
Use this option when rebuilding indexes from scratch. In my small case
(~17000 json documents), it reduces indexing from 520s to 250s.
I did not add any test, short of forced indexing termination it only
has performance effects, which are hard to test. And unknown options are
currently ignored.
Issue #240
benchmark old ns/op new ns/op delta
BenchmarkBatch-4 16950972 16377194 -3.38%
benchmark old allocs new allocs delta
BenchmarkBatch-4 136164 136161 -0.00%
benchmark old bytes new bytes delta
BenchmarkBatch-4 7168872 7109691 -0.83%
benchmark old ns/op new ns/op delta
BenchmarkBatch-4 20738739 17047158 -17.80%
benchmark old allocs new allocs delta
BenchmarkBatch-4 136423 136160 -0.19%
benchmark old bytes new bytes delta
BenchmarkBatch-4 20277781 7168772 -64.65%
this lays the foundation for supporting the new firestorm
indexing scheme. i'm merging these changes ahead of
the rest of the firestorm branch so i can continue
to make changes to the analysis pipeline in parallel
also changed over to mschoch fork of goleveldb (temporary)
the change to my fork is pending some read-only issues described
here: https://github.com/syndtr/goleveldb/issues/111
hopefully we can find a path forward, and get that addressed upstream
the logic for reading the docID from the keys
in this row relies on the keys NEVER containing
the byte separator character (0xff), this is OK
as we require that all keys be valid utf-8
however, it turns out that in the case where this
rule was violated, we would panic, because we
return nil, nil and later try to print the doc id
in boltdb, long readers *MAY* block a writer. in particular if
the write requires additional allocation, it must acquire a lock
already held by the reader. in general this is not a problem
for bleve (though it can affect performance in some cases), but
it is a problem for the reader isolation test. this commit
adds a hack to try and avoid the need for additional allocation
closes#208
in some limited cases we can detect unsafe usage
in these cases, do not trip over ourselves and panic
instead return a strongly typed error upside_down.UnsafeBatchUseDetected
also, introduced Batch.Reset() to allow batch reuse
this is currently still experimental
closes#195
trying to be clever, we reused the memory allocated for the left
operand when doing partial merges
this had been tested to be safe, in general. however, the
implementation was then written such that we always reused
globally defined operands, this meant that we mutated
the operands which were intended to always represent
+1/-1
this then cascades quickly to making increment/decrement
values much larger/smaller than they should be
related to #197
refactor to share code in emulated batch
refactor to share code in emulated merge
refactor index kvstore benchmarks to share more code
refactor index kvstore benchmarks to be more repeatable
This reverts commit cb8c1741289a0f00b30733e0d52d9d81d1199603.
This commit is no longer desired. The KV store API has been changed to
better address this issue.
For more details, see the google group conversation thread at:
https://groups.google.com/forum/#!topic/bleve/aHZ8gmihLiY
- this change keeps the method behavior consistent with the
levigo/leveldb implementation.
- the leveldb store_test.go and goleveldb store_test.go are now
identical.
improvements uncovered some issues with how k/v data was copied
or not. to address this, kv abstraction layer now lets impl
specify if the bytes returned are safe to use after a reader
(or writer since writers are also readers) are closed
See index/store/KVReader - BytesSafeAfterClose() bool
false is the safe value if you're not sure
it will cause index impls to copy the data
Some kv impls already have created a copy a the C-api barrier
in which case they can safely return true.
Overall this yields ~25% speedup for searches with leveldb.
It yields ~10% speedup for boltdb.
Returning stored fields is now slower with boltdb, as previously
we were returning unsafe bytes.
this introduces disk format v4
now the summary rows for a term are stored in their own
"dictionary row" format, previously the same information
was stored in special term frequency rows
this now allows us to easily iterate all the terms for a field
in sorted order (useful for many other fuzzy data structures)
at the top-level of bleve you can now browse terms within a field
using the following api on the Index interface:
FieldDict(field string) (index.FieldDict, error)
FieldDictRange(field string, startTerm []byte, endTerm []byte) (index.FieldDict, error)
FieldDictPrefix(field string, termPrefix []byte) (index.FieldDict, error)
fixes#127
this is due to forestdb auto-compaction using the provided
path as just the prefix, so if we're not careful we end
up with many stray files laying around
here, we create a sub-directory first, and just nuke the
whole subdir when we're done
more things can return error now
in a couple of places we had to swallow errors because they didn't
fit the existing API. in these case and proactively in a few
others we now return error as well.
also the batch API has been updated to allow performing
set/delete internal within the batch
1. text analysis is now done before the write lock is acquired
2. there is now a pool of analysis workers
3. the size of this pool is configurable
4. this allows for documents in a batch to be analyzed concurrently
as a part of benchmarking these changes i've also introduce a new
null storage implementation. this should never be used, as it
does not actualy build an index. it does however let us go
through all the normal indexing machinery, without incuring
any indexing I/O. this is very helpful in measuring improvements
made to the text analsysis pipeline, which are often overshadowed
by indexing times in benchmarks actually building an index.
In the index/store package
introduce KVReader
creates snapshot
all read operations consistent from this snapshot
must close to release
introduce KVWriter
only one writer active
access to all operations
allows for consisten read-modify-write
must close to release
introduce AssociativeMerge operation on batch
allows efficient read-modify-write
for associative operations
used to consolidate updates to the term summary rows
saves 1 set and 1 get op per shared instance of term in field
In the index package
introduced an IndexReader
exposes a consisten snapshot of the index for searching
At top level
All searches now operate on a consisten snapshot of the index
by default we now use the pure go boltdb kv store
it is less tested at this point but appears to work
test pass, and moves us closer to the goal of being
able to just "go get" bleve
New is now used to create new indexes
Open is used to open existing indexes
calls to Open no longer specify a mapping because the mapping
is serialized and stored along with the index
now can track array positions for field values
stored fields now include this in the key
and the back index now uses protobufs to simplify serialization
closes#73
ultimately this is make it more convenient for us to wire up
different elements of the analysis pipeline, without having to
preload everything into memory before we need it
separately the index layer now has a mechanism for storing
internal key/value pairs. this is expected to be used to
store the mapping, and possibly other pieces of data by the
top layer, but not exposed to the user at the top.
this change was then exposed at the higher levels
also the beer-sample app was upgraded to index in batches of 100
by default. this yieled an indexing speed up from 27s to 16s.
closes#57
previously we used the format:
't' <utf-8 term> <byte separator> <16-bit field id> <utf-8 docID> <byte separator>
now we have moved the field before the term, resulting in:
't' <16-bit field id> <utf-8 term> <byte separator> <utf-8 docID> <byte separator>
this means now instead of all fields with the same term being grouped together
all terms within the same field are grouped together
this allows us to enumerate the terms used with a field
this allows us to implement prefix search, and possibly improve numeric range queries
removed analyzers (these are now built as needed through config)
removed html chacter filter (now built as needed through config)
added missing license header
changed constructor signature of filters that cannot return errors
filter constructors that can have errors, now have Must variant which panics
change cdl2 tokenizer into filter (should only see lower-case input)
new top level index api, closes#5
refactored index tests to not rely directly on analyzers
moved query objects to top-level
new top level search api, closes#12
top score collector allows skipping results
index mapping supports _all by default, closes#3 and closes#6
index mapping supports disabled sections, closes#7
new http sub package with reusable http.Handler's, closes#22