bleve

gibheer

bleve

Author	SHA1	Message	Date
Ethan Koenig	012d436dd7	Add UniqueTerm token filter	2018-01-16 22:24:51 -08:00
Marty Schoch	09a61a7a38	add analyzers for several languages Having pure Go snowball stemmers allows us to add support for many languages into the core of bleve. Specifically we just added: Russian, Danish, Finnish, Hungarian, Dutch, Norwegian, Romanian, Swedish, Turkish	2018-01-10 16:00:29 -05:00
Marty Schoch	a9532e510a	refactor slightly to use our new hosted snowball stemmers rather than having each package include it directly inside of bleve, we have decide to host them all in one repo https://github.com/blevesearch/snowballstem this makes the easier for the rest of the community to use outside of bleve contexts	2018-01-10 15:15:31 -05:00
Marty Schoch	af198c833f	Merge branch 'ru_analyzer' of https://github.com/sokolovstas/bleve into sokolovstas-ru_analyzer	2018-01-10 10:29:15 -05:00
Ethan Koenig	0433f05d9c	Fix test	2017-06-22 18:56:28 -04:00
Ethan Koenig	8994ad2e00	Fix token start/end/position values in camelCase tokenizer	2017-06-22 17:42:39 -04:00
Stanislav Sokolov	d8d57e6990	Added Russian analyzer with snowball stemmer	2017-06-05 18:01:01 +05:00
Marty Schoch	8435ce5054	Merge pull request #587 from mschoch/add-spanish add a pure Go Spanish analyzer	2017-04-29 19:48:18 -04:00
Marty Schoch	b8de2df68d	add a pure Go Spanish analyzer this introduces a new light Spanish stemmer and move the other pure Go Spanish analysis components back in from the blevex package the libstemmer version of the stemmer will remain in blevex	2017-04-29 19:31:43 -04:00
Marty Schoch	ce901a8870	add a pure Go German analyzer this introduces new German light stemmer and moves the other pure Go german analysis components back in from the blevex package the libstemmer version of the stemmer will remain in blevex	2017-04-29 18:46:58 -04:00
Marty Schoch	b9db744def	add validtion which checks the type of char/token filters when specified in the custom type of analyzer	2017-02-24 15:57:10 -05:00
Marty Schoch	782dbecfe1	fix edge ngram output when side=Back and input token len=max edge condition was incorreclty checked fixes #523	2017-01-30 20:29:20 -05:00
Sho Minagawa	5537688394	fix the confusing variable name	2017-01-14 20:26:08 +09:00
Steve Yen	6a38fa3719	go fmt	2016-10-12 09:39:43 -07:00
Michael Nitschinger	7e656dad32	Address special unicode sigma at end of term when lowercasing. Σ maps to σ, except at the end of a word where it maps to ς. This is the only conditional (contextual) but language-independent mapping in unicode.	2016-10-11 12:37:08 +02:00
Michael Nitschinger	ff35d75aa4	Skip already lowercased runes on transformation. The LowerCaseFilter works on the original slice to avoid allocations, so skipping already lowercased runes avoids unnecessary work. benchmark old ns/op new ns/op delta BenchmarkLowerCaseFilter-8 1302 815 -37.40%	2016-10-11 12:03:26 +02:00
Marty Schoch	f3dc89699d	address golint warnings	2016-10-02 10:47:40 -04:00
Marty Schoch	cd6b409971	fix code i carelessly broke	2016-10-02 10:39:20 -04:00
Marty Schoch	d4d3e7a043	address golint naming issues	2016-10-02 10:35:24 -04:00
Marty Schoch	2332455bd2	nicer formatting of license header	2016-10-02 10:13:14 -04:00
Marty Schoch	6bf9dd59ab	BREAKING CHANGE - additional package renaming i recently learned that package names should also prefer the singular form, not the plural form	2016-10-01 17:20:59 -04:00
Marty Schoch	35da361bfa	BREAKING CHANGE - renamed packages to be shorter and not use _ this commit only addresses the analysis sub-package	2016-09-30 12:36:10 -04:00
Marty Schoch	c5159251a9	make shingle token filter stateless the previous implementation was incorectly stateful, which violates the contract for token filters fixes #431	2016-09-15 08:59:43 -04:00
Marty Schoch	ffee3c3764	fixed regexp tokenizers to not produce empty tokens	2016-09-14 16:22:20 -04:00
Marty Schoch	ee61b2e866	Merge pull request #425 from mschoch/porterfaster improve perf of porter stemmer	2016-09-11 20:22:23 -04:00
Marty Schoch	f8e8c9d065	Merge pull request #426 from mschoch/fasterbuildterms encode runes directly into buffer	2016-09-11 20:19:09 -04:00
Marty Schoch	44ff6ced8a	improve perf of porter stemmer 1. porter stemmer offers method to NOT do lowercasing, however to use this we must convert to runes first ourself, so we did this 2. now we can invoke the version that skips lowercasing, we already do this ourselves before stemming through separate filter due to the fact that the stemmer modifies the runes in place we have no way to know if there were changes, thus we must always encode back into the term byte slice added unit test which catches the problem found NOTE this uses analysis.BuildTermFromRunes so perf gain is only visible with other PR also merged future gains are possible if we udpate the stemmer to let us know if changes were made, thus skipping re-encoding to []byte when no changes were actually made	2016-09-11 20:13:15 -04:00
Marty Schoch	c13626be45	encode runes directly into buffer avoid allocating unnecessary intermediate buffer also introduce new method to let a user optimistically try and encode back into an existing buffer, if it isn't large enough, it silently allocates a new one and returns it	2016-09-11 20:10:03 -04:00
Marty Schoch	56c7b9f831	Merge pull request #423 from mschoch/stopfilterfaster avoid allocation in stop token filter	2016-09-11 13:59:31 -04:00
Marty Schoch	9e9f172f81	speed up english possessive filter previous impl always did full utf8 decode of rune if we assume most tokens are not possessive this is unnecessary and even if they are, we only need to chop off last to runes so, now we only decode last rune of token, and if it looks like s/S then we proceed to decode second to last rune, and then only if it looks like any form of apostrophe, do we make any changes to token, again by just reslicing original to chop off the possessive extension	2016-09-11 12:55:03 -04:00
Marty Schoch	faa07ac3a6	avoid allocation in stop token filter the token stream resulting from the removal of stop words must be shorter or the same length as the original, so we just reuse it and truncate it at the end.	2016-09-11 12:29:33 -04:00
Marty Schoch	9089de251f	remove byte_array_conveters fixes #392 fixes #100	2016-07-01 10:21:41 -04:00
Marty Schoch	fedb46269e	updated whtitepsace to behave more like lucene/es	2016-06-10 15:30:43 -04:00
Marty Schoch	9c9dbcc90a	fix another test issue	2016-06-10 13:21:27 -04:00
Marty Schoch	80f1117a6c	add couchbase copyright and license now that CLA has been signed	2016-06-10 13:08:50 -04:00
Marty Schoch	043a3bfb7c	change cjk analyzer to use unicode tokenizer change cjk bigram analyzer to work with multi-rune terms add cjk width filter replaces full unicode normailzation these changes make the cjk analyzer behave more like elasticsearch they also remove the depenency on the whitespace analyzer which is now free to also behave more like lucene/es fixes #33	2016-06-10 13:04:40 -04:00
a-little-srdjan	efe573bc10	removing duplicate code by reusing util.go in analysis	2016-06-09 15:13:30 -04:00
Marty Schoch	5722d7b1d1	Merge pull request #384 from a-little-srdjan/ngram_int_bounds Preventing panic on ngram initialization, and extending the type conversion	2016-06-08 23:45:32 -04:00
a-little-srdjan	9341cc835e	Preventing panic on ngram initialization, and extending the type conversion.	2016-06-06 15:54:18 -04:00
a-little-srdjan	3f2701a97c	init. simple camel case parser.	2016-06-03 11:04:21 -04:00
Marty Schoch	2a703376ea	fix ineffectual assignments	2016-04-02 22:42:56 -04:00
Marty Schoch	7892882519	fix typos	2016-04-02 21:59:30 -04:00
Marty Schoch	194ee82c80	gofmt simplifications	2016-04-02 21:54:33 -04:00
Marty Schoch	0b171c85da	change "simple" analyzer to use "letter" tokenizer this change improves compatibility with the simple analyzer defined by Lucene. this has important implications for some perf tests as well as they often use the simple analyzer.	2016-03-31 15:13:17 -04:00
Ben Campbell	4fafb2be3f	Merge branch 'master' into documenting	2016-03-23 10:48:09 +13:00
Marty Schoch	cecdfcbc69	moving japanese analyzer to blevex package	2016-03-13 18:05:05 -04:00
ikawaha	fcebff60e9	Add a test case	2016-02-21 19:59:52 +09:00
ikawaha	4fe7688431	Use a small version of kagome	2016-02-21 19:58:36 +09:00
Ben Campbell	47dbd85551	Merge branch 'master' into documenting	2016-01-29 09:31:30 +13:00
Marty Schoch	fc34a97875	copy locations on merge for more safe/predictable behavior fixes #328	2016-01-19 14:21:48 -05:00

1 2 3 4

185 Commits