Previously, unicode.Tokenize() would allocate a Token one-by-one, on
an as-needed basis.
This change allocates a "backing array" of Tokens, so that it goes to
the runtime object allocator much less often. It takes a heuristic
guess as to the backing array size by using the average token
(segment) length seen so far.
Results from micro-benchmark (null-firestorm, bleve-blast) seem to
give perhaps less than ~0.5 MB/second throughput improvement.
This uses the "backing array" technique to allocate many TermFreqRow's
at the front of firestorm.indexField(), instead of the previous
one-by-one, as-needed TermFreqRow allocation approach.
Results from micro-benchmark, null-firestorm, bleve-blast has this
change producing a ~half MB/sec improvement.
Previously, the firestorm.Batch() would notify the lookuper goroutine
on a document by document basis. If the lookuper input channel became
full, then that would block the firestorm.Batch() operation.
With this change, lookuper is notified once, with a "batch" that is an
[]*InFlightItem.
This change also reuses that same []*InFlightItem to invoke the
compensator.MutateBatch().
This also has the advantage of only converting the docID's from string
to []byte just once, outside of the lock that's used by the
compensator.
Micro-benchmark of this change with null-firestorm bleve-blast does
not show large impact, neither degradation or improvement.
After this change, with null kvstore micro-benchmark...
GOMAXPROCS=8 ./bleve-blast -source=../../tmp/enwiki.txt \
-count=100000 -numAnalyzers=8 -numIndexers=8 \
-config=../../configs/null-firestorm.json -batch=100
Then TermFreqRow key and value methods dissapear as large boxes from
the cpu profile graphs.
The field cache is expected to be the authority on which field
names are identified by which identifier. This code was
optimized for the most common case in which fields already
exist. However, if we deterimine the field is missing with
the read lock (shared), we incorrectly immediately proceed
to create a new row with the write lock (exclusive). The
problem is that multiple goroutines might have come to
the same conclusion, and they all proceed to add rows. The two
choices were to do the whole operation with the write lock, or
recheck the value again with the write lock. We have chosen
to repeat the check inside the write-lock, as this optimizes
for what we believe to be the most common case, in which most
fields will already exist.
The goal of the "web" tokenizer is to recognize web things like
- email addresses
- URLs
- twitter @handles and #hashtags
This implementation uses regexp exceptions. There will most
likely be endless debate about the regular expressions. These
were chosein as "good enough for now".
There is also a "web" analyzer. This is just the "standard"
analyzer, but using the "web" tokenizer instead of the "unicode"
one. NOTE: after processing the exceptions, it still falls back
to the standard "unicode" one.
For many users, you can simply set your mapping's default analyzer
to be "web".
closes#269