0
0
Fork 0
Commit Graph

185 Commits

Author SHA1 Message Date
Sacheendra Talluri 4b3967a68e rewrite custom analyzer without using reflect 2015-01-08 00:25:16 +05:30
Sacheendra Talluri 4abf2a638e adds handling of []string type attributes to custom analyzer 2015-01-08 00:08:20 +05:30
Marty Schoch 0ddfa774ec clean up logging to use package level *log.Logger
by default messages go to ioutil.Discard
2014-12-28 12:14:48 -08:00
Silvan Jegen ef18dfe4cd Fix typos in comments and strings 2014-12-18 18:43:12 +01:00
Sergey Avseyev 570109a983
Update "code.google.com" import paths
https://github.com/couchbase/sync_gateway/issues/492
2014-12-10 01:17:49 +03:00
Silvan Jegen 412049d63c Remove unneeded import statements 2014-11-29 14:25:24 +01:00
Marty Schoch fcab645f96 add test to cover kana/ideographic case 2014-11-26 08:42:40 -05:00
Marty Schoch d452b2a10e add support for dictionary based compound word filter
partially addresses #115
2014-11-18 15:18:42 -05:00
Marty Schoch 40a8154bab changed en analyzer to use pure go components
behavior should be similar with unicode segmentation
and a porter stemmer
2014-10-21 16:38:58 -04:00
Marty Schoch c4d1782689 new pure go porter stemmer integrated
renamed original libstemmer porter to "stemmer_porter_classic"
new pure go stemmer is "stemmer_porter"
2014-10-20 16:55:24 -04:00
Marty Schoch cf3643f292 added pure go tokenizer to do unicode word boundary segmentation 2014-10-17 18:07:48 -04:00
Marty Schoch dcb90ad176 added benchmark for tokenizing English text 2014-10-17 18:07:01 -04:00
Marty Schoch febb8d2df1 renamed unicode_word_boundary package to icu
this is in preparation of alternative unicode word boundary impls
2014-10-17 15:15:13 -04:00
Marty Schoch 19d45dfdb6 fix compliation with the latest changes to kagome 2014-10-10 19:59:24 -07:00
Marty Schoch 1dc466a800 modified token filters to avoid creating new token stream
often the result stream was the same length, so can reuse the
existing token stream
also, in cases where a new stream was required, set capacity to
the length of the input stream.  most output stream are at least
as long as the input, so this may avoid some subsequent resizing
2014-09-23 18:41:32 -04:00
Marty Schoch 95e6e37e67 added build tag to fix runngin tests without tag 2014-09-16 11:28:44 -04:00
Marty Schoch 55c0e84665 relocated kagome tokenizer and introduced ja analyzer 2014-09-16 11:21:29 -04:00
Silvan Jegen 29bdc094a9 Use byte positions instead of character positions 2014-09-14 13:19:30 +02:00
Silvan Jegen a8ec7f7af2 Add tests for the Kagome tokenizer 2014-09-13 17:45:22 +02:00
Silvan Jegen ebf100c097 Add the Kagome tokenizer for Japanese 2014-09-13 17:45:19 +02:00
Marty Schoch 1a1cf32a86 introducing cjk_bigram filter and cjk analyzer
closes #34
2014-09-11 10:39:05 -04:00
Marty Schoch cb5ccd2b1d fix whitespace tokenizer
previously would fail to split ascii running into ideographic
2014-09-11 10:38:02 -04:00
Marty Schoch 8debf26cb7 changed many components to not have defaults
many of these defaults were arbitrary, and not having
defaults lets us more easily flag them for configuration
added a shingle filter
introduce new toke type for shingles
2014-09-09 18:15:14 -04:00
Marty Schoch 6b4c86b35a changed whitespace tokenizer to work better on cjk input
now it will return each cjk character as a separate token
this will pair well with a cjk bigram filter for indexing
2014-09-07 14:11:01 -04:00
Marty Schoch 933d99c576 rename the configurable token map from standard to custom
this makes it consistent with the "custom" analyzer
which operates similarly
also, added it to the config.go so its registerd and
available for use
2014-09-07 14:09:38 -04:00
Marty Schoch 9e78643bad icu tokenier uses brk status to set token type
part of #34
2014-09-07 10:24:02 -04:00
Marty Schoch 377ae090d0 additional golint issues resolved 2014-09-03 18:17:26 -04:00
Marty Schoch d534b0836b converted ALL_CAPS constants to CamelCase 2014-09-03 17:48:40 -04:00
Marty Schoch 7a7eb2e94c add newline between license and package
this avoids cluttering godocs with the license
2014-09-02 10:54:50 -04:00
Marty Schoch 1dcd06e412 add ability to define custom analysis as part of index mapping
now, as part of your index mapping you can create custom
analysis components.  these custome analysis components
are serialized as part of the mapping, and reused
as you would expect on subsequent accesses.
2014-09-01 13:55:23 -04:00
Marty Schoch 7bfad18d40 moved byte array converts into the analysis package 2014-08-29 19:23:21 -04:00
Marty Schoch 1161361bea rename imports from couchbaselabs to blevesearch 2014-08-28 15:38:57 -04:00
Marty Schoch e8959d03ae added build tag 'icu' to enable functionality dependent on it 2014-08-25 12:22:01 -04:00
Marty Schoch 21ef6e9878 added build tag for things depending on libstemmer 2014-08-25 12:06:10 -04:00
Marty Schoch 08db2eae42 added alternate build tag 'full' which will be an alias to enable all 2014-08-25 11:40:58 -04:00
Marty Schoch f37bb77794 added build tag to enable cld2 2014-08-25 11:24:20 -04:00
Marty Schoch 092e30a38e tried to word the instructions for static and dynamic linking 2014-08-25 10:54:15 -04:00
deoxxa 22b7b3bc24 compile libcld2 statically 2014-08-24 03:44:57 +10:00
Marty Schoch b48dc87afa added test case clarifying whitespace tokenizer on empty input 2014-08-19 10:43:52 -04:00
Marty Schoch 5dcd39ade7 added turkish analyzer test 2014-08-14 16:42:41 -04:00
Marty Schoch 21408e49eb added thai analyzer test 2014-08-14 16:39:37 -04:00
Marty Schoch 599ef6edce added swedish analyzer test 2014-08-14 16:12:48 -04:00
Marty Schoch 64255e3eb9 added russian analyzer test 2014-08-14 16:11:23 -04:00
Marty Schoch 8896de2039 added romanian analyzer test 2014-08-14 16:06:17 -04:00
Marty Schoch c2937b4b81 added portuguese analyzer test
discrepencies found, logged in #70
failing tests commented out for now
2014-08-14 16:04:29 -04:00
Marty Schoch 81a9d325a2 added norwegian analyzer test 2014-08-14 16:01:03 -04:00
Marty Schoch a3a97a09d3 added dutch analyzer test 2014-08-14 15:59:39 -04:00
Marty Schoch 6714d5d765 added italian analyzer test
discrepencies found between us and lucene, documented in #69
failing tests commented out for now
2014-08-14 15:56:47 -04:00
Marty Schoch b9c0477762 added hungarian analyzer test 2014-08-14 15:51:55 -04:00
Marty Schoch 6a9f8e85ae added french analyzer test
many discrepencies noted, opened issue #68 to track this
failing tests commented out for now
2014-08-14 15:48:32 -04:00
Marty Schoch f6f17c7a9e added finish analyzer test 2014-08-14 15:27:45 -04:00
Marty Schoch 80d7c4f870 added persian analyzer test 2014-08-14 15:24:42 -04:00
Marty Schoch 2ef7c80c92 added spanish analyzer test 2014-08-14 14:44:46 -04:00
Marty Schoch 4398aab723 added sorani analyzer test 2014-08-14 14:42:36 -04:00
Marty Schoch b22941ee37 added test for danish anlyzer 2014-08-14 14:36:24 -04:00
Marty Schoch 8c9997f1e2 added test for german analyzer 2014-08-14 14:33:30 -04:00
Marty Schoch 6a951b9372 added analyzer test for english 2014-08-14 14:28:24 -04:00
Marty Schoch c526a38369 major refactor of analysis files, now wired up to registry
ultimately this is make it more convenient for us to wire up
different elements of the analysis pipeline, without having to
preload everything into memory before we need it

separately the index layer now has a mechanism for storing
internal key/value pairs.  this is expected to be used to
store the mapping, and possibly other pieces of data by the
top layer, but not exposed to the user at the top.
2014-08-13 21:14:47 -04:00
Marty Schoch 3481ec9cef added hindi stemmer
closes #40
2014-08-11 22:29:47 -04:00
Marty Schoch c65f7415ff added hindi normalizer
closes #64
2014-08-11 19:51:47 -04:00
Marty Schoch cd0e3fd85b added german normalizer
updated german analyzer to use this normalizer
closes #65
2014-08-11 19:25:37 -04:00
Marty Schoch a4707ebb4e configured zero width non joiner char filter, and persian analyzer 2014-08-11 18:57:04 -04:00
Marty Schoch 4ccd69ed45 added arabic normalizer
closes #63
2014-08-11 18:35:35 -04:00
Marty Schoch 73b252f6a6 added persian normalizer
closes #67
2014-08-11 18:15:41 -04:00
Marty Schoch e21b7f4436 added sorani normalizer and stemmer, now have analyzer
closes #43
2014-08-08 09:38:28 -04:00
Marty Schoch ef35ea1985 added czech stop word list
closes #36
2014-08-07 22:32:49 -04:00
Marty Schoch 964b87f76e added rune tokenizer
not used directly right now, but basis for other simple tokenizers
2014-08-07 22:14:26 -04:00
Marty Schoch 0e54fbd8da added keyword marker filter
updated stemmer filter to not stem tokens marked as keyword
closes #48
2014-08-07 08:13:00 -04:00
Marty Schoch c19270108c added ngram and edge ngram token filters
closes #46 and closes #47
2014-08-06 22:11:42 -04:00
Marty Schoch 9a777aaa80 added token truncate filter
closes #49
2014-08-06 20:39:42 -04:00
Marty Schoch d84187fd24 added apostrophe filter to improve turkish analyzer
closes #27
2014-08-06 08:50:00 -04:00
Marty Schoch 79ab2b9b3d added unicode normalization filter 2014-08-04 21:59:57 -04:00
Marty Schoch 2c0bf23fac added elision filter
defined article word maps for french, italian, irish and catalan
defined elision filters for these same languages
updated analyers for french and italian to use this new filter
irish and catalan still depend on other missing pieces
closes #25
2014-08-03 19:17:35 -04:00
Marty Schoch 0960cab0ae refactored StopWordsMap into WordMap so it can be reused
the ElisionFilter will need a word list of articles and plan to reuse this
2014-08-03 17:46:35 -04:00
Marty Schoch 00d6f9700b added support for date range fields and queries
closes #9 and closes #11
2014-08-03 17:19:04 -04:00
Marty Schoch 25540c736a introduced token type 2014-07-31 13:54:12 -04:00
Marty Schoch 3eb63a887b improved stop word support and related config
stop words can be loaded from files/bytes, closes #19
stop words loaded for large list of languages, closes #20
defined language specific analyzers for as much as possible right now, closes #21
opened new issues for some of the remaining gaps
2014-07-30 19:29:52 -04:00
Marty Schoch 2968d3538a major refactor, apologies for the large commit
removed analyzers (these are now built as needed through config)
removed html chacter filter (now built as needed through config)
added missing license header
changed constructor signature of filters that cannot return errors
filter constructors that can have errors, now have Must variant which panics
change cdl2 tokenizer into filter (should only see lower-case input)
new top level index api, closes #5
refactored index tests to not rely directly on analyzers
moved query objects to top-level
new top level search api, closes #12
top score collector allows skipping results
index mapping supports _all by default, closes #3 and closes #6
index mapping supports disabled sections, closes #7
new http sub package with reusable http.Handler's, closes #22
2014-07-30 12:30:38 -04:00
Marty Schoch d7341524aa trying to fix compilation on drone 2014-07-21 18:00:59 -04:00
Marty Schoch 737dcb6118 fixing c++ issues on drone.io 2014-07-21 17:49:53 -04:00
Marty Schoch b629636424 new tokenizer which uses cld2 to guess the field's language 2014-07-21 17:21:31 -04:00
Marty Schoch 70a8b03bed added support for composite fields 2014-07-21 17:05:55 -04:00
Marty Schoch 900b54e240 changed to not use pkg-config, brittle on some platforms 2014-04-18 11:50:14 -04:00
Marty Schoch 9058db20ec fix commit of old file 2014-04-18 11:09:36 -04:00
Marty Schoch 3d842dfaf2 initial commit 2014-04-17 16:55:53 -04:00