0
0
bleve/analysis/tokenizers
Marty Schoch 0f16eccd6b new tokenizer that allows you to pre-identify tokens with regexp
name "exception"
configure with list of regexp string "exceptions"
these exceptions regexps that match sequences you want treated
as a single token.  these sequences are NOT sent to the
underlying tokenizer
configure "tokenizer" is the named tokenizer that should be
used for processing all text regions not matching exceptions

An example configuration with simple patterns to match URLs and
email addresses:

map[string]interface{}{
	"type":      "exception",
	"tokenizer": "unicode",
	"exceptions": []interface{}{
		`[hH][tT][tT][pP][sS]?://(\S)*`,
		`[fF][iI][lL][eE]://(\S)*`,
		`[fF][tT][pP]://(\S)*`,
		`\S+@\S+`,
  }
}
2015-04-08 15:31:58 -04:00
..
exception new tokenizer that allows you to pre-identify tokens with regexp 2015-04-08 15:31:58 -04:00
icu clean up logging to use package level *log.Logger 2014-12-28 12:14:48 -08:00
regexp_tokenizer changed whitespace tokenizer to work better on cjk input 2014-09-07 14:11:01 -04:00
single_token add newline between license and package 2014-09-02 10:54:50 -04:00
unicode change unicode tokenizer to use direct segmenter api 2015-01-12 17:57:45 -05:00
whitespace_tokenizer added benchmark for tokenizing English text 2014-10-17 18:07:01 -04:00