1. porter stemmer offers method to NOT do lowercasing, however
to use this we must convert to runes first ourself, so we did this
2. now we can invoke the version that skips lowercasing, we
already do this ourselves before stemming through separate filter
due to the fact that the stemmer modifies the runes in place
we have no way to know if there were changes, thus we must
always encode back into the term byte slice
added unit test which catches the problem found
NOTE this uses analysis.BuildTermFromRunes so perf gain is
only visible with other PR also merged
future gains are possible if we udpate the stemmer to let us
know if changes were made, thus skipping re-encoding to
[]byte when no changes were actually made
avoid allocating unnecessary intermediate buffer
also introduce new method to let a user optimistically
try and encode back into an existing buffer, if it isn't
large enough, it silently allocates a new one and returns it
previous impl always did full utf8 decode of rune
if we assume most tokens are not possessive this is unnecessary
and even if they are, we only need to chop off last to runes
so, now we only decode last rune of token, and if it looks like
s/S then we proceed to decode second to last rune, and then
only if it looks like any form of apostrophe, do we make any
changes to token, again by just reslicing original to chop
off the possessive extension
the token stream resulting from the removal of stop words must
be shorter or the same length as the original, so we just
reuse it and truncate it at the end.
change cjk bigram analyzer to work with multi-rune terms
add cjk width filter replaces full unicode normailzation
these changes make the cjk analyzer behave more like elasticsearch
they also remove the depenency on the whitespace analyzer
which is now free to also behave more like lucene/es
fixes#33
this change improves compatibility with the simple analyzer
defined by Lucene. this has important implications for
some perf tests as well as they often use the simple
analyzer.
Previously, unicode.Tokenize() would allocate a Token one-by-one, on
an as-needed basis.
This change allocates a "backing array" of Tokens, so that it goes to
the runtime object allocator much less often. It takes a heuristic
guess as to the backing array size by using the average token
(segment) length seen so far.
Results from micro-benchmark (null-firestorm, bleve-blast) seem to
give perhaps less than ~0.5 MB/second throughput improvement.
The goal of the "web" tokenizer is to recognize web things like
- email addresses
- URLs
- twitter @handles and #hashtags
This implementation uses regexp exceptions. There will most
likely be endless debate about the regular expressions. These
were chosein as "good enough for now".
There is also a "web" analyzer. This is just the "standard"
analyzer, but using the "web" tokenizer instead of the "unicode"
one. NOTE: after processing the exceptions, it still falls back
to the standard "unicode" one.
For many users, you can simply set your mapping's default analyzer
to be "web".
closes#269