0
0
Commit Graph

37 Commits

Author SHA1 Message Date
Marty Schoch
fedb46269e updated whtitepsace to behave more like lucene/es 2016-06-10 15:30:43 -04:00
Marty Schoch
0b171c85da change "simple" analyzer to use "letter" tokenizer
this change improves compatibility with the simple analyzer
defined by Lucene.  this has important implications for
some perf tests as well as they often use the simple
analyzer.
2016-03-31 15:13:17 -04:00
Steve Yen
325a616993 unicode.Tokenize() avoids array growth via array of arrays 2016-01-02 12:21:25 -08:00
Steve Yen
918732f3d8 unicode.Tokenize() allocs backing array of Tokens
Previously, unicode.Tokenize() would allocate a Token one-by-one, on
an as-needed basis.

This change allocates a "backing array" of Tokens, so that it goes to
the runtime object allocator much less often.  It takes a heuristic
guess as to the backing array size by using the average token
(segment) length seen so far.

Results from micro-benchmark (null-firestorm, bleve-blast) seem to
give perhaps less than ~0.5 MB/second throughput improvement.
2016-01-02 12:21:25 -08:00
Marty Schoch
e472b3e807 add support for a "web" tokenizer/analyzer
The goal of the "web" tokenizer is to recognize web things like
- email addresses
- URLs
- twitter @handles and #hashtags

This implementation uses regexp exceptions.  There will most
likely be endless debate about the regular expressions. These
were chosein as "good enough for now".

There is also a "web" analyzer.  This is just the "standard"
analyzer, but using the "web" tokenizer instead of the "unicode"
one.  NOTE: after processing the exceptions, it still falls back
to the standard "unicode" one.

For many users, you can simply set your mapping's default analyzer
to be "web".

closes #269
2015-11-30 14:27:18 -05:00
Patrick Mezard
f95f1d29a0 exception: fail if pattern is empty, name tokenizer in error 2015-10-27 18:53:03 +01:00
Patrick Mezard
8b17787a65 analysis: document "exception" tokenizer, and Tokenizer interface 2015-10-27 18:53:03 +01:00
Marty Schoch
f81b2be334 major refactor of bleve configuration
see #221 for full details
2015-09-16 17:10:59 -04:00
Marty Schoch
0f16eccd6b new tokenizer that allows you to pre-identify tokens with regexp
name "exception"
configure with list of regexp string "exceptions"
these exceptions regexps that match sequences you want treated
as a single token.  these sequences are NOT sent to the
underlying tokenizer
configure "tokenizer" is the named tokenizer that should be
used for processing all text regions not matching exceptions

An example configuration with simple patterns to match URLs and
email addresses:

map[string]interface{}{
	"type":      "exception",
	"tokenizer": "unicode",
	"exceptions": []interface{}{
		`[hH][tT][tT][pP][sS]?://(\S)*`,
		`[fF][iI][lL][eE]://(\S)*`,
		`[fF][tT][pP]://(\S)*`,
		`\S+@\S+`,
  }
}
2015-04-08 15:31:58 -04:00
Marty Schoch
0a4844f9d0 change unicode tokenizer to use direct segmenter api 2015-01-12 17:57:45 -05:00
Marty Schoch
0ddfa774ec clean up logging to use package level *log.Logger
by default messages go to ioutil.Discard
2014-12-28 12:14:48 -08:00
Marty Schoch
fcab645f96 add test to cover kana/ideographic case 2014-11-26 08:42:40 -05:00
Marty Schoch
cf3643f292 added pure go tokenizer to do unicode word boundary segmentation 2014-10-17 18:07:48 -04:00
Marty Schoch
dcb90ad176 added benchmark for tokenizing English text 2014-10-17 18:07:01 -04:00
Marty Schoch
febb8d2df1 renamed unicode_word_boundary package to icu
this is in preparation of alternative unicode word boundary impls
2014-10-17 15:15:13 -04:00
Marty Schoch
55c0e84665 relocated kagome tokenizer and introduced ja analyzer 2014-09-16 11:21:29 -04:00
Silvan Jegen
29bdc094a9 Use byte positions instead of character positions 2014-09-14 13:19:30 +02:00
Silvan Jegen
a8ec7f7af2 Add tests for the Kagome tokenizer 2014-09-13 17:45:22 +02:00
Silvan Jegen
ebf100c097 Add the Kagome tokenizer for Japanese 2014-09-13 17:45:19 +02:00
Marty Schoch
cb5ccd2b1d fix whitespace tokenizer
previously would fail to split ascii running into ideographic
2014-09-11 10:38:02 -04:00
Marty Schoch
8debf26cb7 changed many components to not have defaults
many of these defaults were arbitrary, and not having
defaults lets us more easily flag them for configuration
added a shingle filter
introduce new toke type for shingles
2014-09-09 18:15:14 -04:00
Marty Schoch
6b4c86b35a changed whitespace tokenizer to work better on cjk input
now it will return each cjk character as a separate token
this will pair well with a cjk bigram filter for indexing
2014-09-07 14:11:01 -04:00
Marty Schoch
9e78643bad icu tokenier uses brk status to set token type
part of #34
2014-09-07 10:24:02 -04:00
Marty Schoch
7a7eb2e94c add newline between license and package
this avoids cluttering godocs with the license
2014-09-02 10:54:50 -04:00
Marty Schoch
1161361bea rename imports from couchbaselabs to blevesearch 2014-08-28 15:38:57 -04:00
Marty Schoch
e8959d03ae added build tag 'icu' to enable functionality dependent on it 2014-08-25 12:22:01 -04:00
Marty Schoch
b48dc87afa added test case clarifying whitespace tokenizer on empty input 2014-08-19 10:43:52 -04:00
Marty Schoch
6a951b9372 added analyzer test for english 2014-08-14 14:28:24 -04:00
Marty Schoch
c526a38369 major refactor of analysis files, now wired up to registry
ultimately this is make it more convenient for us to wire up
different elements of the analysis pipeline, without having to
preload everything into memory before we need it

separately the index layer now has a mechanism for storing
internal key/value pairs.  this is expected to be used to
store the mapping, and possibly other pieces of data by the
top layer, but not exposed to the user at the top.
2014-08-13 21:14:47 -04:00
Marty Schoch
964b87f76e added rune tokenizer
not used directly right now, but basis for other simple tokenizers
2014-08-07 22:14:26 -04:00
Marty Schoch
25540c736a introduced token type 2014-07-31 13:54:12 -04:00
Marty Schoch
2968d3538a major refactor, apologies for the large commit
removed analyzers (these are now built as needed through config)
removed html chacter filter (now built as needed through config)
added missing license header
changed constructor signature of filters that cannot return errors
filter constructors that can have errors, now have Must variant which panics
change cdl2 tokenizer into filter (should only see lower-case input)
new top level index api, closes #5
refactored index tests to not rely directly on analyzers
moved query objects to top-level
new top level search api, closes #12
top score collector allows skipping results
index mapping supports _all by default, closes #3 and closes #6
index mapping supports disabled sections, closes #7
new http sub package with reusable http.Handler's, closes #22
2014-07-30 12:30:38 -04:00
Marty Schoch
d7341524aa trying to fix compilation on drone 2014-07-21 18:00:59 -04:00
Marty Schoch
737dcb6118 fixing c++ issues on drone.io 2014-07-21 17:49:53 -04:00
Marty Schoch
b629636424 new tokenizer which uses cld2 to guess the field's language 2014-07-21 17:21:31 -04:00
Marty Schoch
900b54e240 changed to not use pkg-config, brittle on some platforms 2014-04-18 11:50:14 -04:00
Marty Schoch
3d842dfaf2 initial commit 2014-04-17 16:55:53 -04:00