0
0
Commit Graph

116 Commits

Author SHA1 Message Date
Marty Schoch
c3a4fab911 Merge pull request #238 from ikawaha/ja-morph-analyzer
fix compliation with the latest changes to kagome
2015-09-28 17:05:46 -04:00
ikawaha
89af7978a9 fix compliation with the latest changes to kagome 2015-09-28 15:53:08 +09:00
Marty Schoch
f81b2be334 major refactor of bleve configuration
see #221 for full details
2015-09-16 17:10:59 -04:00
Marty Schoch
d00bc91dc9 minor speed up in token frequency calculations
benchmark               old ns/op     new ns/op     delta
BenchmarkAnalysis-4     1599218       1540991       -3.64%

benchmark               old allocs     new allocs     delta
BenchmarkAnalysis-4     5353           5318           -0.65%

benchmark               old bytes     new bytes     delta
BenchmarkAnalysis-4     370495        362983        -2.03%
2015-09-04 18:57:39 -04:00
Donald Huang
767831d87c move custom_analyzer to custom_analyzer package 2015-08-11 21:22:03 +00:00
Marty Schoch
3682c25467 update to correctly work with composite fields
also updated search results to return array positions
2015-07-31 11:16:11 -04:00
Marty Schoch
1f4ef3da8b move elision filter after lowercase filter
this affects all languages using the elision filter
languages fr and it are updated now
languages ca and ga are still missing other components and
do not yet have an analyzer, but they should follow this lead
once they are ready

fixes #218
2015-07-21 10:43:53 -04:00
Marty Schoch
65556f45c7 added additional tests for bug #214 2015-07-06 18:00:05 -04:00
Marty Schoch
0f16eccd6b new tokenizer that allows you to pre-identify tokens with regexp
name "exception"
configure with list of regexp string "exceptions"
these exceptions regexps that match sequences you want treated
as a single token.  these sequences are NOT sent to the
underlying tokenizer
configure "tokenizer" is the named tokenizer that should be
used for processing all text regions not matching exceptions

An example configuration with simple patterns to match URLs and
email addresses:

map[string]interface{}{
	"type":      "exception",
	"tokenizer": "unicode",
	"exceptions": []interface{}{
		`[hH][tT][tT][pP][sS]?://(\S)*`,
		`[fF][iI][lL][eE]://(\S)*`,
		`[fF][tT][pP]://(\S)*`,
		`\S+@\S+`,
  }
}
2015-04-08 15:31:58 -04:00
Marty Schoch
93e01a803e fix issues identified by errcheck
part of #169
2015-04-07 14:52:00 -04:00
Marty Schoch
50bd082257 fixed issues with portuguese analyzer
fixes #70
2015-03-11 14:22:11 -04:00
Marty Schoch
7970f42c29 fix issues with italian analyzer
switch it to not require icu/libstemmer
fixes #69
2015-03-11 11:48:13 -04:00
Marty Schoch
eeaf514848 switch fr to not require icu/libstemmer
also corrected copy/paste bug in test
2015-03-11 11:46:33 -04:00
Marty Schoch
8ae30fb6f0 fix issues with lucene stemmer
fixes issue #68
2015-03-11 11:14:29 -04:00
Marty Schoch
300ec79c96 first pass at checking errors that were ignored
part of #169
2015-03-06 14:46:29 -05:00
Salmān Aljammāz
9444af9366 arabic: add unicode normalization to analyzer 2015-02-06 19:50:58 +03:00
Salmān Aljammāz
91a8d5da9f arabic: check minimum length before stemming
This invloves converting tokens to a rune slice in the filter, but
at least we're now compatable with Lucene's stemmer.
2015-02-06 19:50:58 +03:00
Salmān Aljammāz
0470f93955 arabic: add more stemmer tests
These came from org.apache.lucene.analysis.ar.
2015-02-06 19:49:30 +03:00
Salmān Aljammāz
e461fed92a arabic stemmer: strip multiple suffixes
updates #150
2015-02-05 16:07:58 +03:00
Marty Schoch
4be974f489 added first implementation of arabic analyzer
one test cases is not passing and is commented out temporarily
updates #150
2015-02-05 07:44:55 -05:00
Marty Schoch
b9c22fe50d Merge pull request #154 from saljam/arabic
add arabic light stemmer
2015-02-05 07:09:54 -05:00
Salmān Aljammāz
945ef8158f add arabic light stemmer
fixes #28
updates #150
2015-02-05 13:24:30 +03:00
Marty Schoch
dd1cd189a7 added initial implementation of hindi analyzer
closes #66
2015-02-04 15:12:08 -05:00
Marty Schoch
a9f153bac7 fix typo in unicode normalization form constant
also adjusted incorrect tests
fixes #149
2015-01-26 14:09:20 -05:00
Marty Schoch
530613a239 rewrite map access to take advantage of optimization 2015-01-14 12:57:34 -05:00
Marty Schoch
890b1abfe6 new version of lower case filter which tries to avoid copying bytes 2015-01-14 11:34:30 -05:00
Marty Schoch
7cc544adf2 switched to bytes.ToLower for minor speedup 2015-01-14 09:28:57 -05:00
Marty Schoch
f000092201 added benchmark for lowercase filter 2015-01-14 09:28:57 -05:00
Steve Yen
db82eae3f4 go fmt 2015-01-13 11:04:45 -08:00
Marty Schoch
ed06dd0581 switching to unicode tokenizer now that its faster than regexp 2015-01-12 18:04:34 -05:00
Marty Schoch
0a4844f9d0 change unicode tokenizer to use direct segmenter api 2015-01-12 17:57:45 -05:00
Sacheendra Talluri
4b3967a68e rewrite custom analyzer without using reflect 2015-01-08 00:25:16 +05:30
Sacheendra Talluri
4abf2a638e adds handling of []string type attributes to custom analyzer 2015-01-08 00:08:20 +05:30
Marty Schoch
0ddfa774ec clean up logging to use package level *log.Logger
by default messages go to ioutil.Discard
2014-12-28 12:14:48 -08:00
Silvan Jegen
ef18dfe4cd Fix typos in comments and strings 2014-12-18 18:43:12 +01:00
Sergey Avseyev
570109a983
Update "code.google.com" import paths
https://github.com/couchbase/sync_gateway/issues/492
2014-12-10 01:17:49 +03:00
Silvan Jegen
412049d63c Remove unneeded import statements 2014-11-29 14:25:24 +01:00
Marty Schoch
fcab645f96 add test to cover kana/ideographic case 2014-11-26 08:42:40 -05:00
Marty Schoch
d452b2a10e add support for dictionary based compound word filter
partially addresses #115
2014-11-18 15:18:42 -05:00
Marty Schoch
40a8154bab changed en analyzer to use pure go components
behavior should be similar with unicode segmentation
and a porter stemmer
2014-10-21 16:38:58 -04:00
Marty Schoch
c4d1782689 new pure go porter stemmer integrated
renamed original libstemmer porter to "stemmer_porter_classic"
new pure go stemmer is "stemmer_porter"
2014-10-20 16:55:24 -04:00
Marty Schoch
cf3643f292 added pure go tokenizer to do unicode word boundary segmentation 2014-10-17 18:07:48 -04:00
Marty Schoch
dcb90ad176 added benchmark for tokenizing English text 2014-10-17 18:07:01 -04:00
Marty Schoch
febb8d2df1 renamed unicode_word_boundary package to icu
this is in preparation of alternative unicode word boundary impls
2014-10-17 15:15:13 -04:00
Marty Schoch
19d45dfdb6 fix compliation with the latest changes to kagome 2014-10-10 19:59:24 -07:00
Marty Schoch
1dc466a800 modified token filters to avoid creating new token stream
often the result stream was the same length, so can reuse the
existing token stream
also, in cases where a new stream was required, set capacity to
the length of the input stream.  most output stream are at least
as long as the input, so this may avoid some subsequent resizing
2014-09-23 18:41:32 -04:00
Marty Schoch
95e6e37e67 added build tag to fix runngin tests without tag 2014-09-16 11:28:44 -04:00
Marty Schoch
55c0e84665 relocated kagome tokenizer and introduced ja analyzer 2014-09-16 11:21:29 -04:00
Silvan Jegen
29bdc094a9 Use byte positions instead of character positions 2014-09-14 13:19:30 +02:00
Silvan Jegen
a8ec7f7af2 Add tests for the Kagome tokenizer 2014-09-13 17:45:22 +02:00