Marty Schoch
d00bc91dc9
minor speed up in token frequency calculations
...
benchmark old ns/op new ns/op delta
BenchmarkAnalysis-4 1599218 1540991 -3.64%
benchmark old allocs new allocs delta
BenchmarkAnalysis-4 5353 5318 -0.65%
benchmark old bytes new bytes delta
BenchmarkAnalysis-4 370495 362983 -2.03%
2015-09-04 18:57:39 -04:00
Marty Schoch
84811cf5a0
made index type configurable + first version of firestorm
2015-08-25 14:52:42 -04:00
Donald Huang
767831d87c
move custom_analyzer to custom_analyzer package
2015-08-11 21:22:03 +00:00
Marty Schoch
3682c25467
update to correctly work with composite fields
...
also updated search results to return array positions
2015-07-31 11:16:11 -04:00
Marty Schoch
1f4ef3da8b
move elision filter after lowercase filter
...
this affects all languages using the elision filter
languages fr and it are updated now
languages ca and ga are still missing other components and
do not yet have an analyzer, but they should follow this lead
once they are ready
fixes #218
2015-07-21 10:43:53 -04:00
Marty Schoch
65556f45c7
added additional tests for bug #214
2015-07-06 18:00:05 -04:00
Marty Schoch
0f16eccd6b
new tokenizer that allows you to pre-identify tokens with regexp
...
name "exception"
configure with list of regexp string "exceptions"
these exceptions regexps that match sequences you want treated
as a single token. these sequences are NOT sent to the
underlying tokenizer
configure "tokenizer" is the named tokenizer that should be
used for processing all text regions not matching exceptions
An example configuration with simple patterns to match URLs and
email addresses:
map[string]interface{}{
"type": "exception",
"tokenizer": "unicode",
"exceptions": []interface{}{
`[hH][tT][tT][pP][sS]?://(\S)*`,
`[fF][iI][lL][eE]://(\S)*`,
`[fF][tT][pP]://(\S)*`,
`\S+@\S+`,
}
}
2015-04-08 15:31:58 -04:00
Marty Schoch
93e01a803e
fix issues identified by errcheck
...
part of #169
2015-04-07 14:52:00 -04:00
Marty Schoch
50bd082257
fixed issues with portuguese analyzer
...
fixes #70
2015-03-11 14:22:11 -04:00
Marty Schoch
7970f42c29
fix issues with italian analyzer
...
switch it to not require icu/libstemmer
fixes #69
2015-03-11 11:48:13 -04:00
Marty Schoch
eeaf514848
switch fr to not require icu/libstemmer
...
also corrected copy/paste bug in test
2015-03-11 11:46:33 -04:00
Marty Schoch
8ae30fb6f0
fix issues with lucene stemmer
...
fixes issue #68
2015-03-11 11:14:29 -04:00
Marty Schoch
300ec79c96
first pass at checking errors that were ignored
...
part of #169
2015-03-06 14:46:29 -05:00
Salmān Aljammāz
9444af9366
arabic: add unicode normalization to analyzer
2015-02-06 19:50:58 +03:00
Salmān Aljammāz
91a8d5da9f
arabic: check minimum length before stemming
...
This invloves converting tokens to a rune slice in the filter, but
at least we're now compatable with Lucene's stemmer.
2015-02-06 19:50:58 +03:00
Salmān Aljammāz
0470f93955
arabic: add more stemmer tests
...
These came from org.apache.lucene.analysis.ar.
2015-02-06 19:49:30 +03:00
Salmān Aljammāz
e461fed92a
arabic stemmer: strip multiple suffixes
...
updates #150
2015-02-05 16:07:58 +03:00
Marty Schoch
4be974f489
added first implementation of arabic analyzer
...
one test cases is not passing and is commented out temporarily
updates #150
2015-02-05 07:44:55 -05:00
Marty Schoch
b9c22fe50d
Merge pull request #154 from saljam/arabic
...
add arabic light stemmer
2015-02-05 07:09:54 -05:00
Salmān Aljammāz
945ef8158f
add arabic light stemmer
...
fixes #28
updates #150
2015-02-05 13:24:30 +03:00
Marty Schoch
dd1cd189a7
added initial implementation of hindi analyzer
...
closes #66
2015-02-04 15:12:08 -05:00
Marty Schoch
a9f153bac7
fix typo in unicode normalization form constant
...
also adjusted incorrect tests
fixes #149
2015-01-26 14:09:20 -05:00
Marty Schoch
530613a239
rewrite map access to take advantage of optimization
2015-01-14 12:57:34 -05:00
Marty Schoch
890b1abfe6
new version of lower case filter which tries to avoid copying bytes
2015-01-14 11:34:30 -05:00
Marty Schoch
7cc544adf2
switched to bytes.ToLower for minor speedup
2015-01-14 09:28:57 -05:00
Marty Schoch
f000092201
added benchmark for lowercase filter
2015-01-14 09:28:57 -05:00
Steve Yen
db82eae3f4
go fmt
2015-01-13 11:04:45 -08:00
Marty Schoch
ed06dd0581
switching to unicode tokenizer now that its faster than regexp
2015-01-12 18:04:34 -05:00
Marty Schoch
0a4844f9d0
change unicode tokenizer to use direct segmenter api
2015-01-12 17:57:45 -05:00
Sacheendra Talluri
4b3967a68e
rewrite custom analyzer without using reflect
2015-01-08 00:25:16 +05:30
Sacheendra Talluri
4abf2a638e
adds handling of []string type attributes to custom analyzer
2015-01-08 00:08:20 +05:30
Marty Schoch
0ddfa774ec
clean up logging to use package level *log.Logger
...
by default messages go to ioutil.Discard
2014-12-28 12:14:48 -08:00
Silvan Jegen
ef18dfe4cd
Fix typos in comments and strings
2014-12-18 18:43:12 +01:00
Sergey Avseyev
570109a983
Update "code.google.com" import paths
...
https://github.com/couchbase/sync_gateway/issues/492
2014-12-10 01:17:49 +03:00
Silvan Jegen
412049d63c
Remove unneeded import statements
2014-11-29 14:25:24 +01:00
Marty Schoch
fcab645f96
add test to cover kana/ideographic case
2014-11-26 08:42:40 -05:00
Marty Schoch
d452b2a10e
add support for dictionary based compound word filter
...
partially addresses #115
2014-11-18 15:18:42 -05:00
Marty Schoch
40a8154bab
changed en analyzer to use pure go components
...
behavior should be similar with unicode segmentation
and a porter stemmer
2014-10-21 16:38:58 -04:00
Marty Schoch
c4d1782689
new pure go porter stemmer integrated
...
renamed original libstemmer porter to "stemmer_porter_classic"
new pure go stemmer is "stemmer_porter"
2014-10-20 16:55:24 -04:00
Marty Schoch
cf3643f292
added pure go tokenizer to do unicode word boundary segmentation
2014-10-17 18:07:48 -04:00
Marty Schoch
dcb90ad176
added benchmark for tokenizing English text
2014-10-17 18:07:01 -04:00
Marty Schoch
febb8d2df1
renamed unicode_word_boundary package to icu
...
this is in preparation of alternative unicode word boundary impls
2014-10-17 15:15:13 -04:00
Marty Schoch
19d45dfdb6
fix compliation with the latest changes to kagome
2014-10-10 19:59:24 -07:00
Marty Schoch
1dc466a800
modified token filters to avoid creating new token stream
...
often the result stream was the same length, so can reuse the
existing token stream
also, in cases where a new stream was required, set capacity to
the length of the input stream. most output stream are at least
as long as the input, so this may avoid some subsequent resizing
2014-09-23 18:41:32 -04:00
Marty Schoch
95e6e37e67
added build tag to fix runngin tests without tag
2014-09-16 11:28:44 -04:00
Marty Schoch
55c0e84665
relocated kagome tokenizer and introduced ja analyzer
2014-09-16 11:21:29 -04:00
Silvan Jegen
29bdc094a9
Use byte positions instead of character positions
2014-09-14 13:19:30 +02:00
Silvan Jegen
a8ec7f7af2
Add tests for the Kagome tokenizer
2014-09-13 17:45:22 +02:00
Silvan Jegen
ebf100c097
Add the Kagome tokenizer for Japanese
2014-09-13 17:45:19 +02:00
Marty Schoch
1a1cf32a86
introducing cjk_bigram filter and cjk analyzer
...
closes #34
2014-09-11 10:39:05 -04:00
Marty Schoch
cb5ccd2b1d
fix whitespace tokenizer
...
previously would fail to split ascii running into ideographic
2014-09-11 10:38:02 -04:00
Marty Schoch
8debf26cb7
changed many components to not have defaults
...
many of these defaults were arbitrary, and not having
defaults lets us more easily flag them for configuration
added a shingle filter
introduce new toke type for shingles
2014-09-09 18:15:14 -04:00
Marty Schoch
6b4c86b35a
changed whitespace tokenizer to work better on cjk input
...
now it will return each cjk character as a separate token
this will pair well with a cjk bigram filter for indexing
2014-09-07 14:11:01 -04:00
Marty Schoch
933d99c576
rename the configurable token map from standard to custom
...
this makes it consistent with the "custom" analyzer
which operates similarly
also, added it to the config.go so its registerd and
available for use
2014-09-07 14:09:38 -04:00
Marty Schoch
9e78643bad
icu tokenier uses brk status to set token type
...
part of #34
2014-09-07 10:24:02 -04:00
Marty Schoch
377ae090d0
additional golint issues resolved
2014-09-03 18:17:26 -04:00
Marty Schoch
d534b0836b
converted ALL_CAPS constants to CamelCase
2014-09-03 17:48:40 -04:00
Marty Schoch
7a7eb2e94c
add newline between license and package
...
this avoids cluttering godocs with the license
2014-09-02 10:54:50 -04:00
Marty Schoch
1dcd06e412
add ability to define custom analysis as part of index mapping
...
now, as part of your index mapping you can create custom
analysis components. these custome analysis components
are serialized as part of the mapping, and reused
as you would expect on subsequent accesses.
2014-09-01 13:55:23 -04:00
Marty Schoch
7bfad18d40
moved byte array converts into the analysis package
2014-08-29 19:23:21 -04:00
Marty Schoch
1161361bea
rename imports from couchbaselabs to blevesearch
2014-08-28 15:38:57 -04:00
Marty Schoch
e8959d03ae
added build tag 'icu' to enable functionality dependent on it
2014-08-25 12:22:01 -04:00
Marty Schoch
21ef6e9878
added build tag for things depending on libstemmer
2014-08-25 12:06:10 -04:00
Marty Schoch
08db2eae42
added alternate build tag 'full' which will be an alias to enable all
2014-08-25 11:40:58 -04:00
Marty Schoch
f37bb77794
added build tag to enable cld2
2014-08-25 11:24:20 -04:00
Marty Schoch
092e30a38e
tried to word the instructions for static and dynamic linking
2014-08-25 10:54:15 -04:00
deoxxa
22b7b3bc24
compile libcld2 statically
2014-08-24 03:44:57 +10:00
Marty Schoch
b48dc87afa
added test case clarifying whitespace tokenizer on empty input
2014-08-19 10:43:52 -04:00
Marty Schoch
5dcd39ade7
added turkish analyzer test
2014-08-14 16:42:41 -04:00
Marty Schoch
21408e49eb
added thai analyzer test
2014-08-14 16:39:37 -04:00
Marty Schoch
599ef6edce
added swedish analyzer test
2014-08-14 16:12:48 -04:00
Marty Schoch
64255e3eb9
added russian analyzer test
2014-08-14 16:11:23 -04:00
Marty Schoch
8896de2039
added romanian analyzer test
2014-08-14 16:06:17 -04:00
Marty Schoch
c2937b4b81
added portuguese analyzer test
...
discrepencies found, logged in #70
failing tests commented out for now
2014-08-14 16:04:29 -04:00
Marty Schoch
81a9d325a2
added norwegian analyzer test
2014-08-14 16:01:03 -04:00
Marty Schoch
a3a97a09d3
added dutch analyzer test
2014-08-14 15:59:39 -04:00
Marty Schoch
6714d5d765
added italian analyzer test
...
discrepencies found between us and lucene, documented in #69
failing tests commented out for now
2014-08-14 15:56:47 -04:00
Marty Schoch
b9c0477762
added hungarian analyzer test
2014-08-14 15:51:55 -04:00
Marty Schoch
6a9f8e85ae
added french analyzer test
...
many discrepencies noted, opened issue #68 to track this
failing tests commented out for now
2014-08-14 15:48:32 -04:00
Marty Schoch
f6f17c7a9e
added finish analyzer test
2014-08-14 15:27:45 -04:00
Marty Schoch
80d7c4f870
added persian analyzer test
2014-08-14 15:24:42 -04:00
Marty Schoch
2ef7c80c92
added spanish analyzer test
2014-08-14 14:44:46 -04:00
Marty Schoch
4398aab723
added sorani analyzer test
2014-08-14 14:42:36 -04:00
Marty Schoch
b22941ee37
added test for danish anlyzer
2014-08-14 14:36:24 -04:00
Marty Schoch
8c9997f1e2
added test for german analyzer
2014-08-14 14:33:30 -04:00
Marty Schoch
6a951b9372
added analyzer test for english
2014-08-14 14:28:24 -04:00
Marty Schoch
c526a38369
major refactor of analysis files, now wired up to registry
...
ultimately this is make it more convenient for us to wire up
different elements of the analysis pipeline, without having to
preload everything into memory before we need it
separately the index layer now has a mechanism for storing
internal key/value pairs. this is expected to be used to
store the mapping, and possibly other pieces of data by the
top layer, but not exposed to the user at the top.
2014-08-13 21:14:47 -04:00
Marty Schoch
3481ec9cef
added hindi stemmer
...
closes #40
2014-08-11 22:29:47 -04:00
Marty Schoch
c65f7415ff
added hindi normalizer
...
closes #64
2014-08-11 19:51:47 -04:00
Marty Schoch
cd0e3fd85b
added german normalizer
...
updated german analyzer to use this normalizer
closes #65
2014-08-11 19:25:37 -04:00
Marty Schoch
a4707ebb4e
configured zero width non joiner char filter, and persian analyzer
2014-08-11 18:57:04 -04:00
Marty Schoch
4ccd69ed45
added arabic normalizer
...
closes #63
2014-08-11 18:35:35 -04:00
Marty Schoch
73b252f6a6
added persian normalizer
...
closes #67
2014-08-11 18:15:41 -04:00
Marty Schoch
e21b7f4436
added sorani normalizer and stemmer, now have analyzer
...
closes #43
2014-08-08 09:38:28 -04:00
Marty Schoch
ef35ea1985
added czech stop word list
...
closes #36
2014-08-07 22:32:49 -04:00
Marty Schoch
964b87f76e
added rune tokenizer
...
not used directly right now, but basis for other simple tokenizers
2014-08-07 22:14:26 -04:00
Marty Schoch
0e54fbd8da
added keyword marker filter
...
updated stemmer filter to not stem tokens marked as keyword
closes #48
2014-08-07 08:13:00 -04:00
Marty Schoch
c19270108c
added ngram and edge ngram token filters
...
closes #46 and closes #47
2014-08-06 22:11:42 -04:00
Marty Schoch
9a777aaa80
added token truncate filter
...
closes #49
2014-08-06 20:39:42 -04:00
Marty Schoch
d84187fd24
added apostrophe filter to improve turkish analyzer
...
closes #27
2014-08-06 08:50:00 -04:00