bleve

History

Marty Schoch 0f16eccd6b new tokenizer that allows you to pre-identify tokens with regexp name "exception" configure with list of regexp string "exceptions" these exceptions regexps that match sequences you want treated as a single token. these sequences are NOT sent to the underlying tokenizer configure "tokenizer" is the named tokenizer that should be used for processing all text regions not matching exceptions An example configuration with simple patterns to match URLs and email addresses: map[string]interface{}{ "type": "exception", "tokenizer": "unicode", "exceptions": []interface{}{ `[hH][tT][tT][pP][sS]?://(\S)`, `[fF][iI][lL][eE]://(\S)`, `[fF][tT][pP]://(\S)*`, `\S+@\S+`, } }		2015-04-08 15:31:58 -04:00
..
exception	new tokenizer that allows you to pre-identify tokens with regexp	2015-04-08 15:31:58 -04:00
icu	clean up logging to use package level *log.Logger	2014-12-28 12:14:48 -08:00
regexp_tokenizer	changed whitespace tokenizer to work better on cjk input	2014-09-07 14:11:01 -04:00
single_token	add newline between license and package	2014-09-02 10:54:50 -04:00
unicode	change unicode tokenizer to use direct segmenter api	2015-01-12 17:57:45 -05:00
whitespace_tokenizer	added benchmark for tokenizing English text	2014-10-17 18:07:01 -04:00