This is somewhat unlikely, but if a term is (incredibly) popular, its
uvarint count value representation might go beyond 8 bytes.
Some KVStore implementations (like forestdb) provide a BatchEx cgo
optimization that depends on proper preallocated counting, so this
change provides a proper worst-case estimate based on the max-unvarint
of 10 bytes instead of the previously incorrect 8 bytes.
Performance optimization. Before this change, by using Merge()
instead of AllocMerge(), moss's internal batch buf's would be
wastefully, dramatically grown during append()'s to a mis-sized buf.
With this change, the upside_down batchRows() and firestorm
batchRows() now use the new KVWriter.NewBatchEx() API, which can
improve performance by reducing the number of cgo hops.
In order to spend less time in append(), this change in upside_down
(similar to another recent performance change in firestorm) builds up
an array of arrays as the eventual input to batchRows().
Previously, the code would gather all the backIndexRows before
processing them. This change instead merges the backIndexRows
concurrently on the theory that we might as well make progress on
compute & processing tasks while waiting for the rest of the back
index rows to be fetched from the KVStore.
Start backindex reading concurrently with analysi to try to utilize
more I/O bandwidth.
The analysis time vs indexing time stats tracking are also now "off",
since there's now concurrency between those actiivties.
One tradeoff is that the lock area in upside_down Batch() is increased
as part of this change.
Taking another optimization from firestorm, upside_down's
storeField()/indexField() funcs now also append() to passed-in arrays
rather than always allocating their own arrays.
Rather than append() all received rows into a flat []IndexRow during
the result gathering loop, this change instead collects the analysis
result rows into a [][]IndexRow, which avoids extra copying.
As part of this, firestorm batchRows() now takes the [][]IndexRow as
its input.
The new analyzeField() helper func is used for both regular fields and
for composite fields.
With this change, all analysis is done up front, for both regular
fields and composite fields.
After analysis, this change counts up all the row capacity needed and
extends the AnalysisResult.Rows in one shot, as opposed to the
previous approach of dynamically growing the array as needed during
append()'s.
Also, in this change, the TermFreqRow for _id is added first, which
seems more correct.
This uses the "backing array" technique to allocate many TermFreqRow's
at the front of firestorm.indexField(), instead of the previous
one-by-one, as-needed TermFreqRow allocation approach.
Results from micro-benchmark, null-firestorm, bleve-blast has this
change producing a ~half MB/sec improvement.
Previously, the firestorm.Batch() would notify the lookuper goroutine
on a document by document basis. If the lookuper input channel became
full, then that would block the firestorm.Batch() operation.
With this change, lookuper is notified once, with a "batch" that is an
[]*InFlightItem.
This change also reuses that same []*InFlightItem to invoke the
compensator.MutateBatch().
This also has the advantage of only converting the docID's from string
to []byte just once, outside of the lock that's used by the
compensator.
Micro-benchmark of this change with null-firestorm bleve-blast does
not show large impact, neither degradation or improvement.
After this change, with null kvstore micro-benchmark...
GOMAXPROCS=8 ./bleve-blast -source=../../tmp/enwiki.txt \
-count=100000 -numAnalyzers=8 -numIndexers=8 \
-config=../../configs/null-firestorm.json -batch=100
Then TermFreqRow key and value methods dissapear as large boxes from
the cpu profile graphs.
The field cache is expected to be the authority on which field
names are identified by which identifier. This code was
optimized for the most common case in which fields already
exist. However, if we deterimine the field is missing with
the read lock (shared), we incorrectly immediately proceed
to create a new row with the write lock (exclusive). The
problem is that multiple goroutines might have come to
the same conclusion, and they all proceed to add rows. The two
choices were to do the whole operation with the write lock, or
recheck the value again with the write lock. We have chosen
to repeat the check inside the write-lock, as this optimizes
for what we believe to be the most common case, in which most
fields will already exist.
Rows content is an implementation detail of bleve index and may change
in the future. That said, they also contains information valuable to
assess the quality of the index or understand its performances. So, as
long as we agree that type asserting rows should only be done if you
know what you are doing and are ready to deal with future changes, I see
no reason to hide the row fields from external packages.
Fix#268
Two issues:
- Seeking before i.start and iterating returned keys before i.start
- Seeking after the store last key did not invalidate the iterator and
could cause infinite loops.
It boils down to:
1. client sends some work and a notification channel to a single worker,
then waits.
2. worker processes the work
3. worker sends the result to the client using the notification channel
I do not see any problem with this, even with unbuffered channels.
Use this option when rebuilding indexes from scratch. In my small case
(~17000 json documents), it reduces indexing from 520s to 250s.
I did not add any test, short of forced indexing termination it only
has performance effects, which are hard to test. And unknown options are
currently ignored.
Issue #240
benchmark old ns/op new ns/op delta
BenchmarkBatch-4 16950972 16377194 -3.38%
benchmark old allocs new allocs delta
BenchmarkBatch-4 136164 136161 -0.00%
benchmark old bytes new bytes delta
BenchmarkBatch-4 7168872 7109691 -0.83%
benchmark old ns/op new ns/op delta
BenchmarkBatch-4 20738739 17047158 -17.80%
benchmark old allocs new allocs delta
BenchmarkBatch-4 136423 136160 -0.19%
benchmark old bytes new bytes delta
BenchmarkBatch-4 20277781 7168772 -64.65%
this lays the foundation for supporting the new firestorm
indexing scheme. i'm merging these changes ahead of
the rest of the firestorm branch so i can continue
to make changes to the analysis pipeline in parallel
also changed over to mschoch fork of goleveldb (temporary)
the change to my fork is pending some read-only issues described
here: https://github.com/syndtr/goleveldb/issues/111
hopefully we can find a path forward, and get that addressed upstream
the logic for reading the docID from the keys
in this row relies on the keys NEVER containing
the byte separator character (0xff), this is OK
as we require that all keys be valid utf-8
however, it turns out that in the case where this
rule was violated, we would panic, because we
return nil, nil and later try to print the doc id