0
0
Fork 0
Commit Graph

185 Commits

Author SHA1 Message Date
Ethan Koenig 012d436dd7 Add UniqueTerm token filter 2018-01-16 22:24:51 -08:00
Marty Schoch 09a61a7a38 add analyzers for several languages
Having pure Go snowball stemmers allows us to add support for
many languages into the core of bleve.  Specifically we just
added: Russian, Danish, Finnish, Hungarian, Dutch, Norwegian,
Romanian, Swedish, Turkish
2018-01-10 16:00:29 -05:00
Marty Schoch a9532e510a refactor slightly to use our new hosted snowball stemmers
rather than having each package include it directly inside of
bleve, we have decide to host them all in one repo

https://github.com/blevesearch/snowballstem

this makes the easier for the rest of the community to use
outside of bleve contexts
2018-01-10 15:15:31 -05:00
Marty Schoch af198c833f Merge branch 'ru_analyzer' of https://github.com/sokolovstas/bleve into sokolovstas-ru_analyzer 2018-01-10 10:29:15 -05:00
Ethan Koenig 0433f05d9c Fix test 2017-06-22 18:56:28 -04:00
Ethan Koenig 8994ad2e00 Fix token start/end/position values in camelCase tokenizer 2017-06-22 17:42:39 -04:00
Stanislav Sokolov d8d57e6990 Added Russian analyzer with snowball stemmer 2017-06-05 18:01:01 +05:00
Marty Schoch 8435ce5054 Merge pull request #587 from mschoch/add-spanish
add a pure Go Spanish analyzer
2017-04-29 19:48:18 -04:00
Marty Schoch b8de2df68d add a pure Go Spanish analyzer
this introduces a new light Spanish stemmer
and move the other pure Go Spanish analysis components back
in from the blevex package

the libstemmer version of the stemmer will remain in blevex
2017-04-29 19:31:43 -04:00
Marty Schoch ce901a8870 add a pure Go German analyzer
this introduces new German light stemmer
and moves the other pure Go german analysis components back
in from the blevex package

the libstemmer version of the stemmer will remain in blevex
2017-04-29 18:46:58 -04:00
Marty Schoch b9db744def add validtion which checks the type of char/token filters
when specified in the custom type of analyzer
2017-02-24 15:57:10 -05:00
Marty Schoch 782dbecfe1 fix edge ngram output when side=Back and input token len=max
edge condition was incorreclty checked
fixes #523
2017-01-30 20:29:20 -05:00
Sho Minagawa 5537688394 fix the confusing variable name 2017-01-14 20:26:08 +09:00
Steve Yen 6a38fa3719 go fmt 2016-10-12 09:39:43 -07:00
Michael Nitschinger 7e656dad32 Address special unicode sigma at end of term when lowercasing.
Σ maps to σ, except at the end of a word where it maps to ς.
This is the only conditional (contextual) but language-independent
mapping in unicode.
2016-10-11 12:37:08 +02:00
Michael Nitschinger ff35d75aa4 Skip already lowercased runes on transformation.
The LowerCaseFilter works on the original slice to avoid allocations,
so skipping already lowercased runes avoids unnecessary work.

benchmark                      old ns/op     new ns/op     delta
BenchmarkLowerCaseFilter-8     1302          815           -37.40%
2016-10-11 12:03:26 +02:00
Marty Schoch f3dc89699d address golint warnings 2016-10-02 10:47:40 -04:00
Marty Schoch cd6b409971 fix code i carelessly broke 2016-10-02 10:39:20 -04:00
Marty Schoch d4d3e7a043 address golint naming issues 2016-10-02 10:35:24 -04:00
Marty Schoch 2332455bd2 nicer formatting of license header 2016-10-02 10:13:14 -04:00
Marty Schoch 6bf9dd59ab BREAKING CHANGE - additional package renaming
i recently learned that package names should also prefer the
singular form, not the plural form
2016-10-01 17:20:59 -04:00
Marty Schoch 35da361bfa BREAKING CHANGE - renamed packages to be shorter and not use _
this commit only addresses the analysis sub-package
2016-09-30 12:36:10 -04:00
Marty Schoch c5159251a9 make shingle token filter stateless
the previous implementation was incorectly stateful, which
violates the contract for token filters

fixes #431
2016-09-15 08:59:43 -04:00
Marty Schoch ffee3c3764 fixed regexp tokenizers to not produce empty tokens 2016-09-14 16:22:20 -04:00
Marty Schoch ee61b2e866 Merge pull request #425 from mschoch/porterfaster
improve perf of porter stemmer
2016-09-11 20:22:23 -04:00
Marty Schoch f8e8c9d065 Merge pull request #426 from mschoch/fasterbuildterms
encode runes directly into buffer
2016-09-11 20:19:09 -04:00
Marty Schoch 44ff6ced8a improve perf of porter stemmer
1.  porter stemmer offers method to NOT do lowercasing, however
to use this we must convert to runes first ourself, so we did this

2.  now we can invoke the version that skips lowercasing, we
already do this ourselves before stemming through separate filter

due to the fact that the stemmer modifies the runes in place
we have no way to know if there were changes, thus we must
always encode back into the term byte slice

added unit test which catches the problem found

NOTE this uses analysis.BuildTermFromRunes so perf gain is
only visible with other PR also merged

future gains are possible if we udpate the stemmer to let us
know if changes were made, thus skipping re-encoding to
[]byte when no changes were actually made
2016-09-11 20:13:15 -04:00
Marty Schoch c13626be45 encode runes directly into buffer
avoid allocating unnecessary intermediate buffer

also introduce new method to let a user optimistically
try and encode back into an existing buffer, if it isn't
large enough, it silently allocates a new one and returns it
2016-09-11 20:10:03 -04:00
Marty Schoch 56c7b9f831 Merge pull request #423 from mschoch/stopfilterfaster
avoid allocation in stop token filter
2016-09-11 13:59:31 -04:00
Marty Schoch 9e9f172f81 speed up english possessive filter
previous impl always did full utf8 decode of rune
if we assume most tokens are not possessive this is unnecessary
and even if they are, we only need to chop off last to runes
so, now we only decode last rune of token, and if it looks like
s/S then we proceed to decode second to last rune, and then
only if it looks like any form of apostrophe, do we make any
changes to token, again by just reslicing original to chop
off the possessive extension
2016-09-11 12:55:03 -04:00
Marty Schoch faa07ac3a6 avoid allocation in stop token filter
the token stream resulting from the removal of stop words must
be shorter or the same length as the original, so we just
reuse it and truncate it at the end.
2016-09-11 12:29:33 -04:00
Marty Schoch 9089de251f remove byte_array_conveters
fixes #392
fixes #100
2016-07-01 10:21:41 -04:00
Marty Schoch fedb46269e updated whtitepsace to behave more like lucene/es 2016-06-10 15:30:43 -04:00
Marty Schoch 9c9dbcc90a fix another test issue 2016-06-10 13:21:27 -04:00
Marty Schoch 80f1117a6c add couchbase copyright and license now that CLA has been signed 2016-06-10 13:08:50 -04:00
Marty Schoch 043a3bfb7c change cjk analyzer to use unicode tokenizer
change cjk bigram analyzer to work with multi-rune terms
add cjk width filter replaces full unicode normailzation

these changes make the cjk analyzer behave more like elasticsearch
they also remove the depenency on the whitespace analyzer
which is now free to also behave more like lucene/es

fixes #33
2016-06-10 13:04:40 -04:00
a-little-srdjan efe573bc10 removing duplicate code by reusing util.go in analysis 2016-06-09 15:13:30 -04:00
Marty Schoch 5722d7b1d1 Merge pull request #384 from a-little-srdjan/ngram_int_bounds
Preventing panic on ngram initialization, and extending the type conversion
2016-06-08 23:45:32 -04:00
a-little-srdjan 9341cc835e Preventing panic on ngram initialization, and extending the type conversion. 2016-06-06 15:54:18 -04:00
a-little-srdjan 3f2701a97c init. simple camel case parser. 2016-06-03 11:04:21 -04:00
Marty Schoch 2a703376ea fix ineffectual assignments 2016-04-02 22:42:56 -04:00
Marty Schoch 7892882519 fix typos 2016-04-02 21:59:30 -04:00
Marty Schoch 194ee82c80 gofmt simplifications 2016-04-02 21:54:33 -04:00
Marty Schoch 0b171c85da change "simple" analyzer to use "letter" tokenizer
this change improves compatibility with the simple analyzer
defined by Lucene.  this has important implications for
some perf tests as well as they often use the simple
analyzer.
2016-03-31 15:13:17 -04:00
Ben Campbell 4fafb2be3f Merge branch 'master' into documenting 2016-03-23 10:48:09 +13:00
Marty Schoch cecdfcbc69 moving japanese analyzer to blevex package 2016-03-13 18:05:05 -04:00
ikawaha fcebff60e9 Add a test case 2016-02-21 19:59:52 +09:00
ikawaha 4fe7688431 Use a small version of kagome 2016-02-21 19:58:36 +09:00
Ben Campbell 47dbd85551 Merge branch 'master' into documenting 2016-01-29 09:31:30 +13:00
Marty Schoch fc34a97875 copy locations on merge for more safe/predictable behavior
fixes #328
2016-01-19 14:21:48 -05:00