this change improves compatibility with the simple analyzer
defined by Lucene. this has important implications for
some perf tests as well as they often use the simple
analyzer.
Previously, unicode.Tokenize() would allocate a Token one-by-one, on
an as-needed basis.
This change allocates a "backing array" of Tokens, so that it goes to
the runtime object allocator much less often. It takes a heuristic
guess as to the backing array size by using the average token
(segment) length seen so far.
Results from micro-benchmark (null-firestorm, bleve-blast) seem to
give perhaps less than ~0.5 MB/second throughput improvement.
The goal of the "web" tokenizer is to recognize web things like
- email addresses
- URLs
- twitter @handles and #hashtags
This implementation uses regexp exceptions. There will most
likely be endless debate about the regular expressions. These
were chosein as "good enough for now".
There is also a "web" analyzer. This is just the "standard"
analyzer, but using the "web" tokenizer instead of the "unicode"
one. NOTE: after processing the exceptions, it still falls back
to the standard "unicode" one.
For many users, you can simply set your mapping's default analyzer
to be "web".
closes#269
benchmark old ns/op new ns/op delta
BenchmarkAnalysis-4 1599218 1540991 -3.64%
benchmark old allocs new allocs delta
BenchmarkAnalysis-4 5353 5318 -0.65%
benchmark old bytes new bytes delta
BenchmarkAnalysis-4 370495 362983 -2.03%
this affects all languages using the elision filter
languages fr and it are updated now
languages ca and ga are still missing other components and
do not yet have an analyzer, but they should follow this lead
once they are ready
fixes#218
name "exception"
configure with list of regexp string "exceptions"
these exceptions regexps that match sequences you want treated
as a single token. these sequences are NOT sent to the
underlying tokenizer
configure "tokenizer" is the named tokenizer that should be
used for processing all text regions not matching exceptions
An example configuration with simple patterns to match URLs and
email addresses:
map[string]interface{}{
"type": "exception",
"tokenizer": "unicode",
"exceptions": []interface{}{
`[hH][tT][tT][pP][sS]?://(\S)*`,
`[fF][iI][lL][eE]://(\S)*`,
`[fF][tT][pP]://(\S)*`,
`\S+@\S+`,
}
}