bleve

Author	SHA1	Message	Date
Marty Schoch	f3dc89699d	address golint warnings	2016-10-02 10:47:40 -04:00
Marty Schoch	cd6b409971	fix code i carelessly broke	2016-10-02 10:39:20 -04:00
Marty Schoch	d4d3e7a043	address golint naming issues	2016-10-02 10:35:24 -04:00
Marty Schoch	2332455bd2	nicer formatting of license header	2016-10-02 10:13:14 -04:00
Marty Schoch	6bf9dd59ab	BREAKING CHANGE - additional package renaming i recently learned that package names should also prefer the singular form, not the plural form	2016-10-01 17:20:59 -04:00
Marty Schoch	35da361bfa	BREAKING CHANGE - renamed packages to be shorter and not use _ this commit only addresses the analysis sub-package	2016-09-30 12:36:10 -04:00
Marty Schoch	c5159251a9	make shingle token filter stateless the previous implementation was incorectly stateful, which violates the contract for token filters fixes #431	2016-09-15 08:59:43 -04:00
Marty Schoch	ffee3c3764	fixed regexp tokenizers to not produce empty tokens	2016-09-14 16:22:20 -04:00
Marty Schoch	ee61b2e866	Merge pull request #425 from mschoch/porterfaster improve perf of porter stemmer	2016-09-11 20:22:23 -04:00
Marty Schoch	f8e8c9d065	Merge pull request #426 from mschoch/fasterbuildterms encode runes directly into buffer	2016-09-11 20:19:09 -04:00
Marty Schoch	44ff6ced8a	improve perf of porter stemmer 1. porter stemmer offers method to NOT do lowercasing, however to use this we must convert to runes first ourself, so we did this 2. now we can invoke the version that skips lowercasing, we already do this ourselves before stemming through separate filter due to the fact that the stemmer modifies the runes in place we have no way to know if there were changes, thus we must always encode back into the term byte slice added unit test which catches the problem found NOTE this uses analysis.BuildTermFromRunes so perf gain is only visible with other PR also merged future gains are possible if we udpate the stemmer to let us know if changes were made, thus skipping re-encoding to []byte when no changes were actually made	2016-09-11 20:13:15 -04:00
Marty Schoch	c13626be45	encode runes directly into buffer avoid allocating unnecessary intermediate buffer also introduce new method to let a user optimistically try and encode back into an existing buffer, if it isn't large enough, it silently allocates a new one and returns it	2016-09-11 20:10:03 -04:00
Marty Schoch	56c7b9f831	Merge pull request #423 from mschoch/stopfilterfaster avoid allocation in stop token filter	2016-09-11 13:59:31 -04:00
Marty Schoch	9e9f172f81	speed up english possessive filter previous impl always did full utf8 decode of rune if we assume most tokens are not possessive this is unnecessary and even if they are, we only need to chop off last to runes so, now we only decode last rune of token, and if it looks like s/S then we proceed to decode second to last rune, and then only if it looks like any form of apostrophe, do we make any changes to token, again by just reslicing original to chop off the possessive extension	2016-09-11 12:55:03 -04:00
Marty Schoch	faa07ac3a6	avoid allocation in stop token filter the token stream resulting from the removal of stop words must be shorter or the same length as the original, so we just reuse it and truncate it at the end.	2016-09-11 12:29:33 -04:00
Marty Schoch	9089de251f	remove byte_array_conveters fixes #392 fixes #100	2016-07-01 10:21:41 -04:00
Marty Schoch	fedb46269e	updated whtitepsace to behave more like lucene/es	2016-06-10 15:30:43 -04:00
Marty Schoch	9c9dbcc90a	fix another test issue	2016-06-10 13:21:27 -04:00
Marty Schoch	80f1117a6c	add couchbase copyright and license now that CLA has been signed	2016-06-10 13:08:50 -04:00
Marty Schoch	043a3bfb7c	change cjk analyzer to use unicode tokenizer change cjk bigram analyzer to work with multi-rune terms add cjk width filter replaces full unicode normailzation these changes make the cjk analyzer behave more like elasticsearch they also remove the depenency on the whitespace analyzer which is now free to also behave more like lucene/es fixes #33	2016-06-10 13:04:40 -04:00
a-little-srdjan	efe573bc10	removing duplicate code by reusing util.go in analysis	2016-06-09 15:13:30 -04:00
Marty Schoch	5722d7b1d1	Merge pull request #384 from a-little-srdjan/ngram_int_bounds Preventing panic on ngram initialization, and extending the type conversion	2016-06-08 23:45:32 -04:00
a-little-srdjan	9341cc835e	Preventing panic on ngram initialization, and extending the type conversion.	2016-06-06 15:54:18 -04:00
a-little-srdjan	3f2701a97c	init. simple camel case parser.	2016-06-03 11:04:21 -04:00
Marty Schoch	2a703376ea	fix ineffectual assignments	2016-04-02 22:42:56 -04:00
Marty Schoch	7892882519	fix typos	2016-04-02 21:59:30 -04:00
Marty Schoch	194ee82c80	gofmt simplifications	2016-04-02 21:54:33 -04:00
Marty Schoch	0b171c85da	change "simple" analyzer to use "letter" tokenizer this change improves compatibility with the simple analyzer defined by Lucene. this has important implications for some perf tests as well as they often use the simple analyzer.	2016-03-31 15:13:17 -04:00
Ben Campbell	4fafb2be3f	Merge branch 'master' into documenting	2016-03-23 10:48:09 +13:00
Marty Schoch	cecdfcbc69	moving japanese analyzer to blevex package	2016-03-13 18:05:05 -04:00
ikawaha	fcebff60e9	Add a test case	2016-02-21 19:59:52 +09:00
ikawaha	4fe7688431	Use a small version of kagome	2016-02-21 19:58:36 +09:00
Ben Campbell	47dbd85551	Merge branch 'master' into documenting	2016-01-29 09:31:30 +13:00
Marty Schoch	fc34a97875	copy locations on merge for more safe/predictable behavior fixes #328	2016-01-19 14:21:48 -05:00
slavikm	680be52f87	Implemented boolean field support	2016-01-11 17:18:03 -08:00
Steve Yen	89d17f01ef	analyze locations only if includeTermVectors enabled With this change, TermLocations are computed and maintained only if includeTermVectors is enabled, for higher performance.	2016-01-05 12:46:46 -08:00
Steve Yen	325a616993	unicode.Tokenize() avoids array growth via array of arrays	2016-01-02 12:21:25 -08:00
Steve Yen	918732f3d8	unicode.Tokenize() allocs backing array of Tokens Previously, unicode.Tokenize() would allocate a Token one-by-one, on an as-needed basis. This change allocates a "backing array" of Tokens, so that it goes to the runtime object allocator much less often. It takes a heuristic guess as to the backing array size by using the average token (segment) length seen so far. Results from micro-benchmark (null-firestorm, bleve-blast) seem to give perhaps less than ~0.5 MB/second throughput improvement.	2016-01-02 12:21:25 -08:00
Steve Yen	a345e7951e	TokenFrequency() alloc's all TokenLocations up front	2016-01-02 12:21:17 -08:00
Marty Schoch	9777846206	Merge branch 'master' into firestorm	2015-11-30 15:02:46 -05:00
Marty Schoch	e472b3e807	add support for a "web" tokenizer/analyzer The goal of the "web" tokenizer is to recognize web things like - email addresses - URLs - twitter @handles and #hashtags This implementation uses regexp exceptions. There will most likely be endless debate about the regular expressions. These were chosein as "good enough for now". There is also a "web" analyzer. This is just the "standard" analyzer, but using the "web" tokenizer instead of the "unicode" one. NOTE: after processing the exceptions, it still falls back to the standard "unicode" one. For many users, you can simply set your mapping's default analyzer to be "web". closes #269	2015-11-30 14:27:18 -05:00
Marty Schoch	a707d44e0b	Merge branch 'master' into firestorm	2015-11-24 09:44:47 -05:00
Ben Campbell	994f4b4d11	added some godoc documentation for the en analyzer	2015-11-18 15:28:57 +13:00
Patrick Mezard	ff03874f19	token_map: document it along with stop_token_filter	2015-11-05 14:07:54 +01:00
Patrick Mezard	eb26402924	elision_filter: correctly strip multi-bytes quotation marks	2015-11-04 10:59:10 +01:00
Patrick Mezard	bae2079eb2	token_filters: fix typo in right single quotation mark name	2015-11-04 10:29:56 +01:00
Marty Schoch	01526e971f	Merge branch 'master' into firestorm	2015-10-28 11:26:01 -04:00
Patrick Mezard	f95f1d29a0	exception: fail if pattern is empty, name tokenizer in error	2015-10-27 18:53:03 +01:00
Patrick Mezard	8b17787a65	analysis: document "exception" tokenizer, and Tokenizer interface	2015-10-27 18:53:03 +01:00
Patrick Mezard	e2fa3d6351	doc: document Token, TokenFrequencies and Field structs It helps understanding what is going on in indexing code. ArrayPositions() was particularly puzzling.	2015-10-09 12:32:44 +02:00

1 2 3 4

169 Commits