0
0
Commit Graph

50 Commits

Author SHA1 Message Date
a-little-srdjan
efe573bc10 removing duplicate code by reusing util.go in analysis 2016-06-09 15:13:30 -04:00
Marty Schoch
5722d7b1d1 Merge pull request #384 from a-little-srdjan/ngram_int_bounds
Preventing panic on ngram initialization, and extending the type conversion
2016-06-08 23:45:32 -04:00
a-little-srdjan
9341cc835e Preventing panic on ngram initialization, and extending the type conversion. 2016-06-06 15:54:18 -04:00
a-little-srdjan
3f2701a97c init. simple camel case parser. 2016-06-03 11:04:21 -04:00
Ben Campbell
994f4b4d11 added some godoc documentation for the en analyzer 2015-11-18 15:28:57 +13:00
Patrick Mezard
ff03874f19 token_map: document it along with stop_token_filter 2015-11-05 14:07:54 +01:00
Patrick Mezard
eb26402924 elision_filter: correctly strip multi-bytes quotation marks 2015-11-04 10:59:10 +01:00
Patrick Mezard
bae2079eb2 token_filters: fix typo in right single quotation mark name 2015-11-04 10:29:56 +01:00
Marty Schoch
f81b2be334 major refactor of bleve configuration
see #221 for full details
2015-09-16 17:10:59 -04:00
Marty Schoch
a9f153bac7 fix typo in unicode normalization form constant
also adjusted incorrect tests
fixes #149
2015-01-26 14:09:20 -05:00
Marty Schoch
530613a239 rewrite map access to take advantage of optimization 2015-01-14 12:57:34 -05:00
Marty Schoch
890b1abfe6 new version of lower case filter which tries to avoid copying bytes 2015-01-14 11:34:30 -05:00
Marty Schoch
7cc544adf2 switched to bytes.ToLower for minor speedup 2015-01-14 09:28:57 -05:00
Marty Schoch
f000092201 added benchmark for lowercase filter 2015-01-14 09:28:57 -05:00
Steve Yen
db82eae3f4 go fmt 2015-01-13 11:04:45 -08:00
Silvan Jegen
ef18dfe4cd Fix typos in comments and strings 2014-12-18 18:43:12 +01:00
Sergey Avseyev
570109a983
Update "code.google.com" import paths
https://github.com/couchbase/sync_gateway/issues/492
2014-12-10 01:17:49 +03:00
Marty Schoch
d452b2a10e add support for dictionary based compound word filter
partially addresses #115
2014-11-18 15:18:42 -05:00
Marty Schoch
c4d1782689 new pure go porter stemmer integrated
renamed original libstemmer porter to "stemmer_porter_classic"
new pure go stemmer is "stemmer_porter"
2014-10-20 16:55:24 -04:00
Marty Schoch
1dc466a800 modified token filters to avoid creating new token stream
often the result stream was the same length, so can reuse the
existing token stream
also, in cases where a new stream was required, set capacity to
the length of the input stream.  most output stream are at least
as long as the input, so this may avoid some subsequent resizing
2014-09-23 18:41:32 -04:00
Marty Schoch
8debf26cb7 changed many components to not have defaults
many of these defaults were arbitrary, and not having
defaults lets us more easily flag them for configuration
added a shingle filter
introduce new toke type for shingles
2014-09-09 18:15:14 -04:00
Marty Schoch
d534b0836b converted ALL_CAPS constants to CamelCase 2014-09-03 17:48:40 -04:00
Marty Schoch
7a7eb2e94c add newline between license and package
this avoids cluttering godocs with the license
2014-09-02 10:54:50 -04:00
Marty Schoch
1dcd06e412 add ability to define custom analysis as part of index mapping
now, as part of your index mapping you can create custom
analysis components.  these custome analysis components
are serialized as part of the mapping, and reused
as you would expect on subsequent accesses.
2014-09-01 13:55:23 -04:00
Marty Schoch
1161361bea rename imports from couchbaselabs to blevesearch 2014-08-28 15:38:57 -04:00
Marty Schoch
21ef6e9878 added build tag for things depending on libstemmer 2014-08-25 12:06:10 -04:00
Marty Schoch
08db2eae42 added alternate build tag 'full' which will be an alias to enable all 2014-08-25 11:40:58 -04:00
Marty Schoch
f37bb77794 added build tag to enable cld2 2014-08-25 11:24:20 -04:00
Marty Schoch
092e30a38e tried to word the instructions for static and dynamic linking 2014-08-25 10:54:15 -04:00
deoxxa
22b7b3bc24 compile libcld2 statically 2014-08-24 03:44:57 +10:00
Marty Schoch
6a951b9372 added analyzer test for english 2014-08-14 14:28:24 -04:00
Marty Schoch
c526a38369 major refactor of analysis files, now wired up to registry
ultimately this is make it more convenient for us to wire up
different elements of the analysis pipeline, without having to
preload everything into memory before we need it

separately the index layer now has a mechanism for storing
internal key/value pairs.  this is expected to be used to
store the mapping, and possibly other pieces of data by the
top layer, but not exposed to the user at the top.
2014-08-13 21:14:47 -04:00
Marty Schoch
3481ec9cef added hindi stemmer
closes #40
2014-08-11 22:29:47 -04:00
Marty Schoch
c65f7415ff added hindi normalizer
closes #64
2014-08-11 19:51:47 -04:00
Marty Schoch
cd0e3fd85b added german normalizer
updated german analyzer to use this normalizer
closes #65
2014-08-11 19:25:37 -04:00
Marty Schoch
4ccd69ed45 added arabic normalizer
closes #63
2014-08-11 18:35:35 -04:00
Marty Schoch
73b252f6a6 added persian normalizer
closes #67
2014-08-11 18:15:41 -04:00
Marty Schoch
e21b7f4436 added sorani normalizer and stemmer, now have analyzer
closes #43
2014-08-08 09:38:28 -04:00
Marty Schoch
ef35ea1985 added czech stop word list
closes #36
2014-08-07 22:32:49 -04:00
Marty Schoch
0e54fbd8da added keyword marker filter
updated stemmer filter to not stem tokens marked as keyword
closes #48
2014-08-07 08:13:00 -04:00
Marty Schoch
c19270108c added ngram and edge ngram token filters
closes #46 and closes #47
2014-08-06 22:11:42 -04:00
Marty Schoch
9a777aaa80 added token truncate filter
closes #49
2014-08-06 20:39:42 -04:00
Marty Schoch
d84187fd24 added apostrophe filter to improve turkish analyzer
closes #27
2014-08-06 08:50:00 -04:00
Marty Schoch
79ab2b9b3d added unicode normalization filter 2014-08-04 21:59:57 -04:00
Marty Schoch
2c0bf23fac added elision filter
defined article word maps for french, italian, irish and catalan
defined elision filters for these same languages
updated analyers for french and italian to use this new filter
irish and catalan still depend on other missing pieces
closes #25
2014-08-03 19:17:35 -04:00
Marty Schoch
0960cab0ae refactored StopWordsMap into WordMap so it can be reused
the ElisionFilter will need a word list of articles and plan to reuse this
2014-08-03 17:46:35 -04:00
Marty Schoch
25540c736a introduced token type 2014-07-31 13:54:12 -04:00
Marty Schoch
3eb63a887b improved stop word support and related config
stop words can be loaded from files/bytes, closes #19
stop words loaded for large list of languages, closes #20
defined language specific analyzers for as much as possible right now, closes #21
opened new issues for some of the remaining gaps
2014-07-30 19:29:52 -04:00
Marty Schoch
2968d3538a major refactor, apologies for the large commit
removed analyzers (these are now built as needed through config)
removed html chacter filter (now built as needed through config)
added missing license header
changed constructor signature of filters that cannot return errors
filter constructors that can have errors, now have Must variant which panics
change cdl2 tokenizer into filter (should only see lower-case input)
new top level index api, closes #5
refactored index tests to not rely directly on analyzers
moved query objects to top-level
new top level search api, closes #12
top score collector allows skipping results
index mapping supports _all by default, closes #3 and closes #6
index mapping supports disabled sections, closes #7
new http sub package with reusable http.Handler's, closes #22
2014-07-30 12:30:38 -04:00
Marty Schoch
3d842dfaf2 initial commit 2014-04-17 16:55:53 -04:00