0
0
Fork 0
Commit Graph

185 Commits

Author SHA1 Message Date
slavikm 680be52f87 Implemented boolean field support 2016-01-11 17:18:03 -08:00
Steve Yen 89d17f01ef analyze locations only if includeTermVectors enabled
With this change, TermLocations are computed and maintained only if
includeTermVectors is enabled, for higher performance.
2016-01-05 12:46:46 -08:00
Steve Yen 325a616993 unicode.Tokenize() avoids array growth via array of arrays 2016-01-02 12:21:25 -08:00
Steve Yen 918732f3d8 unicode.Tokenize() allocs backing array of Tokens
Previously, unicode.Tokenize() would allocate a Token one-by-one, on
an as-needed basis.

This change allocates a "backing array" of Tokens, so that it goes to
the runtime object allocator much less often.  It takes a heuristic
guess as to the backing array size by using the average token
(segment) length seen so far.

Results from micro-benchmark (null-firestorm, bleve-blast) seem to
give perhaps less than ~0.5 MB/second throughput improvement.
2016-01-02 12:21:25 -08:00
Steve Yen a345e7951e TokenFrequency() alloc's all TokenLocations up front 2016-01-02 12:21:17 -08:00
Marty Schoch 9777846206 Merge branch 'master' into firestorm 2015-11-30 15:02:46 -05:00
Marty Schoch e472b3e807 add support for a "web" tokenizer/analyzer
The goal of the "web" tokenizer is to recognize web things like
- email addresses
- URLs
- twitter @handles and #hashtags

This implementation uses regexp exceptions.  There will most
likely be endless debate about the regular expressions. These
were chosein as "good enough for now".

There is also a "web" analyzer.  This is just the "standard"
analyzer, but using the "web" tokenizer instead of the "unicode"
one.  NOTE: after processing the exceptions, it still falls back
to the standard "unicode" one.

For many users, you can simply set your mapping's default analyzer
to be "web".

closes #269
2015-11-30 14:27:18 -05:00
Marty Schoch a707d44e0b Merge branch 'master' into firestorm 2015-11-24 09:44:47 -05:00
Ben Campbell 994f4b4d11 added some godoc documentation for the en analyzer 2015-11-18 15:28:57 +13:00
Patrick Mezard ff03874f19 token_map: document it along with stop_token_filter 2015-11-05 14:07:54 +01:00
Patrick Mezard eb26402924 elision_filter: correctly strip multi-bytes quotation marks 2015-11-04 10:59:10 +01:00
Patrick Mezard bae2079eb2 token_filters: fix typo in right single quotation mark name 2015-11-04 10:29:56 +01:00
Marty Schoch 01526e971f Merge branch 'master' into firestorm 2015-10-28 11:26:01 -04:00
Patrick Mezard f95f1d29a0 exception: fail if pattern is empty, name tokenizer in error 2015-10-27 18:53:03 +01:00
Patrick Mezard 8b17787a65 analysis: document "exception" tokenizer, and Tokenizer interface 2015-10-27 18:53:03 +01:00
Patrick Mezard e2fa3d6351 doc: document Token, TokenFrequencies and Field structs
It helps understanding what is going on in indexing code.
ArrayPositions() was particularly puzzling.
2015-10-09 12:32:44 +02:00
Marty Schoch c3a4fab911 Merge pull request #238 from ikawaha/ja-morph-analyzer
fix compliation with the latest changes to kagome
2015-09-28 17:05:46 -04:00
ikawaha 89af7978a9 fix compliation with the latest changes to kagome 2015-09-28 15:53:08 +09:00
Marty Schoch 66aa1b020a Merge branch 'master' into firestorm 2015-09-23 11:32:25 -07:00
Marty Schoch f81b2be334 major refactor of bleve configuration
see #221 for full details
2015-09-16 17:10:59 -04:00
Marty Schoch 37aa5cb027 Merge branch 'master' into firestorm 2015-09-09 09:03:42 -04:00
Marty Schoch d00bc91dc9 minor speed up in token frequency calculations
benchmark               old ns/op     new ns/op     delta
BenchmarkAnalysis-4     1599218       1540991       -3.64%

benchmark               old allocs     new allocs     delta
BenchmarkAnalysis-4     5353           5318           -0.65%

benchmark               old bytes     new bytes     delta
BenchmarkAnalysis-4     370495        362983        -2.03%
2015-09-04 18:57:39 -04:00
Marty Schoch 84811cf5a0 made index type configurable + first version of firestorm 2015-08-25 14:52:42 -04:00
Donald Huang 767831d87c move custom_analyzer to custom_analyzer package 2015-08-11 21:22:03 +00:00
Marty Schoch 3682c25467 update to correctly work with composite fields
also updated search results to return array positions
2015-07-31 11:16:11 -04:00
Marty Schoch 1f4ef3da8b move elision filter after lowercase filter
this affects all languages using the elision filter
languages fr and it are updated now
languages ca and ga are still missing other components and
do not yet have an analyzer, but they should follow this lead
once they are ready

fixes #218
2015-07-21 10:43:53 -04:00
Marty Schoch 65556f45c7 added additional tests for bug #214 2015-07-06 18:00:05 -04:00
Marty Schoch 0f16eccd6b new tokenizer that allows you to pre-identify tokens with regexp
name "exception"
configure with list of regexp string "exceptions"
these exceptions regexps that match sequences you want treated
as a single token.  these sequences are NOT sent to the
underlying tokenizer
configure "tokenizer" is the named tokenizer that should be
used for processing all text regions not matching exceptions

An example configuration with simple patterns to match URLs and
email addresses:

map[string]interface{}{
	"type":      "exception",
	"tokenizer": "unicode",
	"exceptions": []interface{}{
		`[hH][tT][tT][pP][sS]?://(\S)*`,
		`[fF][iI][lL][eE]://(\S)*`,
		`[fF][tT][pP]://(\S)*`,
		`\S+@\S+`,
  }
}
2015-04-08 15:31:58 -04:00
Marty Schoch 93e01a803e fix issues identified by errcheck
part of #169
2015-04-07 14:52:00 -04:00
Marty Schoch 50bd082257 fixed issues with portuguese analyzer
fixes #70
2015-03-11 14:22:11 -04:00
Marty Schoch 7970f42c29 fix issues with italian analyzer
switch it to not require icu/libstemmer
fixes #69
2015-03-11 11:48:13 -04:00
Marty Schoch eeaf514848 switch fr to not require icu/libstemmer
also corrected copy/paste bug in test
2015-03-11 11:46:33 -04:00
Marty Schoch 8ae30fb6f0 fix issues with lucene stemmer
fixes issue #68
2015-03-11 11:14:29 -04:00
Marty Schoch 300ec79c96 first pass at checking errors that were ignored
part of #169
2015-03-06 14:46:29 -05:00
Salmān Aljammāz 9444af9366 arabic: add unicode normalization to analyzer 2015-02-06 19:50:58 +03:00
Salmān Aljammāz 91a8d5da9f arabic: check minimum length before stemming
This invloves converting tokens to a rune slice in the filter, but
at least we're now compatable with Lucene's stemmer.
2015-02-06 19:50:58 +03:00
Salmān Aljammāz 0470f93955 arabic: add more stemmer tests
These came from org.apache.lucene.analysis.ar.
2015-02-06 19:49:30 +03:00
Salmān Aljammāz e461fed92a arabic stemmer: strip multiple suffixes
updates #150
2015-02-05 16:07:58 +03:00
Marty Schoch 4be974f489 added first implementation of arabic analyzer
one test cases is not passing and is commented out temporarily
updates #150
2015-02-05 07:44:55 -05:00
Marty Schoch b9c22fe50d Merge pull request #154 from saljam/arabic
add arabic light stemmer
2015-02-05 07:09:54 -05:00
Salmān Aljammāz 945ef8158f add arabic light stemmer
fixes #28
updates #150
2015-02-05 13:24:30 +03:00
Marty Schoch dd1cd189a7 added initial implementation of hindi analyzer
closes #66
2015-02-04 15:12:08 -05:00
Marty Schoch a9f153bac7 fix typo in unicode normalization form constant
also adjusted incorrect tests
fixes #149
2015-01-26 14:09:20 -05:00
Marty Schoch 530613a239 rewrite map access to take advantage of optimization 2015-01-14 12:57:34 -05:00
Marty Schoch 890b1abfe6 new version of lower case filter which tries to avoid copying bytes 2015-01-14 11:34:30 -05:00
Marty Schoch 7cc544adf2 switched to bytes.ToLower for minor speedup 2015-01-14 09:28:57 -05:00
Marty Schoch f000092201 added benchmark for lowercase filter 2015-01-14 09:28:57 -05:00
Steve Yen db82eae3f4 go fmt 2015-01-13 11:04:45 -08:00
Marty Schoch ed06dd0581 switching to unicode tokenizer now that its faster than regexp 2015-01-12 18:04:34 -05:00
Marty Schoch 0a4844f9d0 change unicode tokenizer to use direct segmenter api 2015-01-12 17:57:45 -05:00
Sacheendra Talluri 4b3967a68e rewrite custom analyzer without using reflect 2015-01-08 00:25:16 +05:30
Sacheendra Talluri 4abf2a638e adds handling of []string type attributes to custom analyzer 2015-01-08 00:08:20 +05:30
Marty Schoch 0ddfa774ec clean up logging to use package level *log.Logger
by default messages go to ioutil.Discard
2014-12-28 12:14:48 -08:00
Silvan Jegen ef18dfe4cd Fix typos in comments and strings 2014-12-18 18:43:12 +01:00
Sergey Avseyev 570109a983
Update "code.google.com" import paths
https://github.com/couchbase/sync_gateway/issues/492
2014-12-10 01:17:49 +03:00
Silvan Jegen 412049d63c Remove unneeded import statements 2014-11-29 14:25:24 +01:00
Marty Schoch fcab645f96 add test to cover kana/ideographic case 2014-11-26 08:42:40 -05:00
Marty Schoch d452b2a10e add support for dictionary based compound word filter
partially addresses #115
2014-11-18 15:18:42 -05:00
Marty Schoch 40a8154bab changed en analyzer to use pure go components
behavior should be similar with unicode segmentation
and a porter stemmer
2014-10-21 16:38:58 -04:00
Marty Schoch c4d1782689 new pure go porter stemmer integrated
renamed original libstemmer porter to "stemmer_porter_classic"
new pure go stemmer is "stemmer_porter"
2014-10-20 16:55:24 -04:00
Marty Schoch cf3643f292 added pure go tokenizer to do unicode word boundary segmentation 2014-10-17 18:07:48 -04:00
Marty Schoch dcb90ad176 added benchmark for tokenizing English text 2014-10-17 18:07:01 -04:00
Marty Schoch febb8d2df1 renamed unicode_word_boundary package to icu
this is in preparation of alternative unicode word boundary impls
2014-10-17 15:15:13 -04:00
Marty Schoch 19d45dfdb6 fix compliation with the latest changes to kagome 2014-10-10 19:59:24 -07:00
Marty Schoch 1dc466a800 modified token filters to avoid creating new token stream
often the result stream was the same length, so can reuse the
existing token stream
also, in cases where a new stream was required, set capacity to
the length of the input stream.  most output stream are at least
as long as the input, so this may avoid some subsequent resizing
2014-09-23 18:41:32 -04:00
Marty Schoch 95e6e37e67 added build tag to fix runngin tests without tag 2014-09-16 11:28:44 -04:00
Marty Schoch 55c0e84665 relocated kagome tokenizer and introduced ja analyzer 2014-09-16 11:21:29 -04:00
Silvan Jegen 29bdc094a9 Use byte positions instead of character positions 2014-09-14 13:19:30 +02:00
Silvan Jegen a8ec7f7af2 Add tests for the Kagome tokenizer 2014-09-13 17:45:22 +02:00
Silvan Jegen ebf100c097 Add the Kagome tokenizer for Japanese 2014-09-13 17:45:19 +02:00
Marty Schoch 1a1cf32a86 introducing cjk_bigram filter and cjk analyzer
closes #34
2014-09-11 10:39:05 -04:00
Marty Schoch cb5ccd2b1d fix whitespace tokenizer
previously would fail to split ascii running into ideographic
2014-09-11 10:38:02 -04:00
Marty Schoch 8debf26cb7 changed many components to not have defaults
many of these defaults were arbitrary, and not having
defaults lets us more easily flag them for configuration
added a shingle filter
introduce new toke type for shingles
2014-09-09 18:15:14 -04:00
Marty Schoch 6b4c86b35a changed whitespace tokenizer to work better on cjk input
now it will return each cjk character as a separate token
this will pair well with a cjk bigram filter for indexing
2014-09-07 14:11:01 -04:00
Marty Schoch 933d99c576 rename the configurable token map from standard to custom
this makes it consistent with the "custom" analyzer
which operates similarly
also, added it to the config.go so its registerd and
available for use
2014-09-07 14:09:38 -04:00
Marty Schoch 9e78643bad icu tokenier uses brk status to set token type
part of #34
2014-09-07 10:24:02 -04:00
Marty Schoch 377ae090d0 additional golint issues resolved 2014-09-03 18:17:26 -04:00
Marty Schoch d534b0836b converted ALL_CAPS constants to CamelCase 2014-09-03 17:48:40 -04:00
Marty Schoch 7a7eb2e94c add newline between license and package
this avoids cluttering godocs with the license
2014-09-02 10:54:50 -04:00
Marty Schoch 1dcd06e412 add ability to define custom analysis as part of index mapping
now, as part of your index mapping you can create custom
analysis components.  these custome analysis components
are serialized as part of the mapping, and reused
as you would expect on subsequent accesses.
2014-09-01 13:55:23 -04:00
Marty Schoch 7bfad18d40 moved byte array converts into the analysis package 2014-08-29 19:23:21 -04:00
Marty Schoch 1161361bea rename imports from couchbaselabs to blevesearch 2014-08-28 15:38:57 -04:00
Marty Schoch e8959d03ae added build tag 'icu' to enable functionality dependent on it 2014-08-25 12:22:01 -04:00
Marty Schoch 21ef6e9878 added build tag for things depending on libstemmer 2014-08-25 12:06:10 -04:00
Marty Schoch 08db2eae42 added alternate build tag 'full' which will be an alias to enable all 2014-08-25 11:40:58 -04:00
Marty Schoch f37bb77794 added build tag to enable cld2 2014-08-25 11:24:20 -04:00
Marty Schoch 092e30a38e tried to word the instructions for static and dynamic linking 2014-08-25 10:54:15 -04:00
deoxxa 22b7b3bc24 compile libcld2 statically 2014-08-24 03:44:57 +10:00
Marty Schoch b48dc87afa added test case clarifying whitespace tokenizer on empty input 2014-08-19 10:43:52 -04:00
Marty Schoch 5dcd39ade7 added turkish analyzer test 2014-08-14 16:42:41 -04:00
Marty Schoch 21408e49eb added thai analyzer test 2014-08-14 16:39:37 -04:00
Marty Schoch 599ef6edce added swedish analyzer test 2014-08-14 16:12:48 -04:00
Marty Schoch 64255e3eb9 added russian analyzer test 2014-08-14 16:11:23 -04:00
Marty Schoch 8896de2039 added romanian analyzer test 2014-08-14 16:06:17 -04:00
Marty Schoch c2937b4b81 added portuguese analyzer test
discrepencies found, logged in #70
failing tests commented out for now
2014-08-14 16:04:29 -04:00
Marty Schoch 81a9d325a2 added norwegian analyzer test 2014-08-14 16:01:03 -04:00
Marty Schoch a3a97a09d3 added dutch analyzer test 2014-08-14 15:59:39 -04:00
Marty Schoch 6714d5d765 added italian analyzer test
discrepencies found between us and lucene, documented in #69
failing tests commented out for now
2014-08-14 15:56:47 -04:00
Marty Schoch b9c0477762 added hungarian analyzer test 2014-08-14 15:51:55 -04:00
Marty Schoch 6a9f8e85ae added french analyzer test
many discrepencies noted, opened issue #68 to track this
failing tests commented out for now
2014-08-14 15:48:32 -04:00