0
0
bleve/analysis/tokenizers
Marty Schoch e472b3e807 add support for a "web" tokenizer/analyzer
The goal of the "web" tokenizer is to recognize web things like
- email addresses
- URLs
- twitter @handles and #hashtags

This implementation uses regexp exceptions.  There will most
likely be endless debate about the regular expressions. These
were chosein as "good enough for now".

There is also a "web" analyzer.  This is just the "standard"
analyzer, but using the "web" tokenizer instead of the "unicode"
one.  NOTE: after processing the exceptions, it still falls back
to the standard "unicode" one.

For many users, you can simply set your mapping's default analyzer
to be "web".

closes #269
2015-11-30 14:27:18 -05:00
..
exception exception: fail if pattern is empty, name tokenizer in error 2015-10-27 18:53:03 +01:00
regexp_tokenizer changed whitespace tokenizer to work better on cjk input 2014-09-07 14:11:01 -04:00
single_token add newline between license and package 2014-09-02 10:54:50 -04:00
unicode change unicode tokenizer to use direct segmenter api 2015-01-12 17:57:45 -05:00
web add support for a "web" tokenizer/analyzer 2015-11-30 14:27:18 -05:00
whitespace_tokenizer added benchmark for tokenizing English text 2014-10-17 18:07:01 -04:00