bleve

Author	SHA1	Message	Date
Marty Schoch	fedb46269e	updated whtitepsace to behave more like lucene/es	2016-06-10 15:30:43 -04:00
Marty Schoch	0b171c85da	change "simple" analyzer to use "letter" tokenizer this change improves compatibility with the simple analyzer defined by Lucene. this has important implications for some perf tests as well as they often use the simple analyzer.	2016-03-31 15:13:17 -04:00
Steve Yen	325a616993	unicode.Tokenize() avoids array growth via array of arrays	2016-01-02 12:21:25 -08:00
Steve Yen	918732f3d8	unicode.Tokenize() allocs backing array of Tokens Previously, unicode.Tokenize() would allocate a Token one-by-one, on an as-needed basis. This change allocates a "backing array" of Tokens, so that it goes to the runtime object allocator much less often. It takes a heuristic guess as to the backing array size by using the average token (segment) length seen so far. Results from micro-benchmark (null-firestorm, bleve-blast) seem to give perhaps less than ~0.5 MB/second throughput improvement.	2016-01-02 12:21:25 -08:00
Marty Schoch	e472b3e807	add support for a "web" tokenizer/analyzer The goal of the "web" tokenizer is to recognize web things like - email addresses - URLs - twitter @handles and #hashtags This implementation uses regexp exceptions. There will most likely be endless debate about the regular expressions. These were chosein as "good enough for now". There is also a "web" analyzer. This is just the "standard" analyzer, but using the "web" tokenizer instead of the "unicode" one. NOTE: after processing the exceptions, it still falls back to the standard "unicode" one. For many users, you can simply set your mapping's default analyzer to be "web". closes #269	2015-11-30 14:27:18 -05:00
Patrick Mezard	f95f1d29a0	exception: fail if pattern is empty, name tokenizer in error	2015-10-27 18:53:03 +01:00
Patrick Mezard	8b17787a65	analysis: document "exception" tokenizer, and Tokenizer interface	2015-10-27 18:53:03 +01:00
Marty Schoch	f81b2be334	major refactor of bleve configuration see #221 for full details	2015-09-16 17:10:59 -04:00
Marty Schoch	0f16eccd6b	new tokenizer that allows you to pre-identify tokens with regexp name "exception" configure with list of regexp string "exceptions" these exceptions regexps that match sequences you want treated as a single token. these sequences are NOT sent to the underlying tokenizer configure "tokenizer" is the named tokenizer that should be used for processing all text regions not matching exceptions An example configuration with simple patterns to match URLs and email addresses: map[string]interface{}{ "type": "exception", "tokenizer": "unicode", "exceptions": []interface{}{ `[hH][tT][tT][pP][sS]?://(\S)`, `[fF][iI][lL][eE]://(\S)`, `[fF][tT][pP]://(\S)*`, `\S+@\S+`, } }	2015-04-08 15:31:58 -04:00
Marty Schoch	0a4844f9d0	change unicode tokenizer to use direct segmenter api	2015-01-12 17:57:45 -05:00
Marty Schoch	0ddfa774ec	clean up logging to use package level *log.Logger by default messages go to ioutil.Discard	2014-12-28 12:14:48 -08:00
Marty Schoch	fcab645f96	add test to cover kana/ideographic case	2014-11-26 08:42:40 -05:00
Marty Schoch	cf3643f292	added pure go tokenizer to do unicode word boundary segmentation	2014-10-17 18:07:48 -04:00
Marty Schoch	dcb90ad176	added benchmark for tokenizing English text	2014-10-17 18:07:01 -04:00
Marty Schoch	febb8d2df1	renamed unicode_word_boundary package to icu this is in preparation of alternative unicode word boundary impls	2014-10-17 15:15:13 -04:00
Marty Schoch	55c0e84665	relocated kagome tokenizer and introduced ja analyzer	2014-09-16 11:21:29 -04:00
Silvan Jegen	29bdc094a9	Use byte positions instead of character positions	2014-09-14 13:19:30 +02:00
Silvan Jegen	a8ec7f7af2	Add tests for the Kagome tokenizer	2014-09-13 17:45:22 +02:00
Silvan Jegen	ebf100c097	Add the Kagome tokenizer for Japanese	2014-09-13 17:45:19 +02:00
Marty Schoch	cb5ccd2b1d	fix whitespace tokenizer previously would fail to split ascii running into ideographic	2014-09-11 10:38:02 -04:00
Marty Schoch	8debf26cb7	changed many components to not have defaults many of these defaults were arbitrary, and not having defaults lets us more easily flag them for configuration added a shingle filter introduce new toke type for shingles	2014-09-09 18:15:14 -04:00
Marty Schoch	6b4c86b35a	changed whitespace tokenizer to work better on cjk input now it will return each cjk character as a separate token this will pair well with a cjk bigram filter for indexing	2014-09-07 14:11:01 -04:00
Marty Schoch	9e78643bad	icu tokenier uses brk status to set token type part of #34	2014-09-07 10:24:02 -04:00
Marty Schoch	7a7eb2e94c	add newline between license and package this avoids cluttering godocs with the license	2014-09-02 10:54:50 -04:00
Marty Schoch	1161361bea	rename imports from couchbaselabs to blevesearch	2014-08-28 15:38:57 -04:00
Marty Schoch	e8959d03ae	added build tag 'icu' to enable functionality dependent on it	2014-08-25 12:22:01 -04:00
Marty Schoch	b48dc87afa	added test case clarifying whitespace tokenizer on empty input	2014-08-19 10:43:52 -04:00
Marty Schoch	6a951b9372	added analyzer test for english	2014-08-14 14:28:24 -04:00
Marty Schoch	c526a38369	major refactor of analysis files, now wired up to registry ultimately this is make it more convenient for us to wire up different elements of the analysis pipeline, without having to preload everything into memory before we need it separately the index layer now has a mechanism for storing internal key/value pairs. this is expected to be used to store the mapping, and possibly other pieces of data by the top layer, but not exposed to the user at the top.	2014-08-13 21:14:47 -04:00
Marty Schoch	964b87f76e	added rune tokenizer not used directly right now, but basis for other simple tokenizers	2014-08-07 22:14:26 -04:00
Marty Schoch	25540c736a	introduced token type	2014-07-31 13:54:12 -04:00
Marty Schoch	2968d3538a	major refactor, apologies for the large commit removed analyzers (these are now built as needed through config) removed html chacter filter (now built as needed through config) added missing license header changed constructor signature of filters that cannot return errors filter constructors that can have errors, now have Must variant which panics change cdl2 tokenizer into filter (should only see lower-case input) new top level index api, closes #5 refactored index tests to not rely directly on analyzers moved query objects to top-level new top level search api, closes #12 top score collector allows skipping results index mapping supports _all by default, closes #3 and closes #6 index mapping supports disabled sections, closes #7 new http sub package with reusable http.Handler's, closes #22	2014-07-30 12:30:38 -04:00
Marty Schoch	d7341524aa	trying to fix compilation on drone	2014-07-21 18:00:59 -04:00
Marty Schoch	737dcb6118	fixing c++ issues on drone.io	2014-07-21 17:49:53 -04:00
Marty Schoch	b629636424	new tokenizer which uses cld2 to guess the field's language	2014-07-21 17:21:31 -04:00
Marty Schoch	900b54e240	changed to not use pkg-config, brittle on some platforms	2014-04-18 11:50:14 -04:00
Marty Schoch	3d842dfaf2	initial commit	2014-04-17 16:55:53 -04:00

37 Commits