bleve

Author	SHA1	Message	Date
Marty Schoch	ee61b2e866	Merge pull request #425 from mschoch/porterfaster improve perf of porter stemmer	2016-09-11 20:22:23 -04:00
Marty Schoch	44ff6ced8a	improve perf of porter stemmer 1. porter stemmer offers method to NOT do lowercasing, however to use this we must convert to runes first ourself, so we did this 2. now we can invoke the version that skips lowercasing, we already do this ourselves before stemming through separate filter due to the fact that the stemmer modifies the runes in place we have no way to know if there were changes, thus we must always encode back into the term byte slice added unit test which catches the problem found NOTE this uses analysis.BuildTermFromRunes so perf gain is only visible with other PR also merged future gains are possible if we udpate the stemmer to let us know if changes were made, thus skipping re-encoding to []byte when no changes were actually made	2016-09-11 20:13:15 -04:00
Marty Schoch	faa07ac3a6	avoid allocation in stop token filter the token stream resulting from the removal of stop words must be shorter or the same length as the original, so we just reuse it and truncate it at the end.	2016-09-11 12:29:33 -04:00
Marty Schoch	80f1117a6c	add couchbase copyright and license now that CLA has been signed	2016-06-10 13:08:50 -04:00
a-little-srdjan	efe573bc10	removing duplicate code by reusing util.go in analysis	2016-06-09 15:13:30 -04:00
Marty Schoch	5722d7b1d1	Merge pull request #384 from a-little-srdjan/ngram_int_bounds Preventing panic on ngram initialization, and extending the type conversion	2016-06-08 23:45:32 -04:00
a-little-srdjan	9341cc835e	Preventing panic on ngram initialization, and extending the type conversion.	2016-06-06 15:54:18 -04:00
a-little-srdjan	3f2701a97c	init. simple camel case parser.	2016-06-03 11:04:21 -04:00
Ben Campbell	994f4b4d11	added some godoc documentation for the en analyzer	2015-11-18 15:28:57 +13:00
Patrick Mezard	ff03874f19	token_map: document it along with stop_token_filter	2015-11-05 14:07:54 +01:00
Patrick Mezard	eb26402924	elision_filter: correctly strip multi-bytes quotation marks	2015-11-04 10:59:10 +01:00
Patrick Mezard	bae2079eb2	token_filters: fix typo in right single quotation mark name	2015-11-04 10:29:56 +01:00
Marty Schoch	f81b2be334	major refactor of bleve configuration see #221 for full details	2015-09-16 17:10:59 -04:00
Marty Schoch	a9f153bac7	fix typo in unicode normalization form constant also adjusted incorrect tests fixes #149	2015-01-26 14:09:20 -05:00
Marty Schoch	530613a239	rewrite map access to take advantage of optimization	2015-01-14 12:57:34 -05:00
Marty Schoch	890b1abfe6	new version of lower case filter which tries to avoid copying bytes	2015-01-14 11:34:30 -05:00
Marty Schoch	7cc544adf2	switched to bytes.ToLower for minor speedup	2015-01-14 09:28:57 -05:00
Marty Schoch	f000092201	added benchmark for lowercase filter	2015-01-14 09:28:57 -05:00
Steve Yen	db82eae3f4	go fmt	2015-01-13 11:04:45 -08:00
Silvan Jegen	ef18dfe4cd	Fix typos in comments and strings	2014-12-18 18:43:12 +01:00
Sergey Avseyev	570109a983	Update "code.google.com" import paths https://github.com/couchbase/sync_gateway/issues/492	2014-12-10 01:17:49 +03:00
Marty Schoch	d452b2a10e	add support for dictionary based compound word filter partially addresses #115	2014-11-18 15:18:42 -05:00
Marty Schoch	c4d1782689	new pure go porter stemmer integrated renamed original libstemmer porter to "stemmer_porter_classic" new pure go stemmer is "stemmer_porter"	2014-10-20 16:55:24 -04:00
Marty Schoch	1dc466a800	modified token filters to avoid creating new token stream often the result stream was the same length, so can reuse the existing token stream also, in cases where a new stream was required, set capacity to the length of the input stream. most output stream are at least as long as the input, so this may avoid some subsequent resizing	2014-09-23 18:41:32 -04:00
Marty Schoch	8debf26cb7	changed many components to not have defaults many of these defaults were arbitrary, and not having defaults lets us more easily flag them for configuration added a shingle filter introduce new toke type for shingles	2014-09-09 18:15:14 -04:00
Marty Schoch	d534b0836b	converted ALL_CAPS constants to CamelCase	2014-09-03 17:48:40 -04:00
Marty Schoch	7a7eb2e94c	add newline between license and package this avoids cluttering godocs with the license	2014-09-02 10:54:50 -04:00
Marty Schoch	1dcd06e412	add ability to define custom analysis as part of index mapping now, as part of your index mapping you can create custom analysis components. these custome analysis components are serialized as part of the mapping, and reused as you would expect on subsequent accesses.	2014-09-01 13:55:23 -04:00
Marty Schoch	1161361bea	rename imports from couchbaselabs to blevesearch	2014-08-28 15:38:57 -04:00
Marty Schoch	21ef6e9878	added build tag for things depending on libstemmer	2014-08-25 12:06:10 -04:00
Marty Schoch	08db2eae42	added alternate build tag 'full' which will be an alias to enable all	2014-08-25 11:40:58 -04:00
Marty Schoch	f37bb77794	added build tag to enable cld2	2014-08-25 11:24:20 -04:00
Marty Schoch	092e30a38e	tried to word the instructions for static and dynamic linking	2014-08-25 10:54:15 -04:00
deoxxa	22b7b3bc24	compile libcld2 statically	2014-08-24 03:44:57 +10:00
Marty Schoch	6a951b9372	added analyzer test for english	2014-08-14 14:28:24 -04:00
Marty Schoch	c526a38369	major refactor of analysis files, now wired up to registry ultimately this is make it more convenient for us to wire up different elements of the analysis pipeline, without having to preload everything into memory before we need it separately the index layer now has a mechanism for storing internal key/value pairs. this is expected to be used to store the mapping, and possibly other pieces of data by the top layer, but not exposed to the user at the top.	2014-08-13 21:14:47 -04:00
Marty Schoch	3481ec9cef	added hindi stemmer closes #40	2014-08-11 22:29:47 -04:00
Marty Schoch	c65f7415ff	added hindi normalizer closes #64	2014-08-11 19:51:47 -04:00
Marty Schoch	cd0e3fd85b	added german normalizer updated german analyzer to use this normalizer closes #65	2014-08-11 19:25:37 -04:00
Marty Schoch	4ccd69ed45	added arabic normalizer closes #63	2014-08-11 18:35:35 -04:00
Marty Schoch	73b252f6a6	added persian normalizer closes #67	2014-08-11 18:15:41 -04:00
Marty Schoch	e21b7f4436	added sorani normalizer and stemmer, now have analyzer closes #43	2014-08-08 09:38:28 -04:00
Marty Schoch	ef35ea1985	added czech stop word list closes #36	2014-08-07 22:32:49 -04:00
Marty Schoch	0e54fbd8da	added keyword marker filter updated stemmer filter to not stem tokens marked as keyword closes #48	2014-08-07 08:13:00 -04:00
Marty Schoch	c19270108c	added ngram and edge ngram token filters closes #46 and closes #47	2014-08-06 22:11:42 -04:00
Marty Schoch	9a777aaa80	added token truncate filter closes #49	2014-08-06 20:39:42 -04:00
Marty Schoch	d84187fd24	added apostrophe filter to improve turkish analyzer closes #27	2014-08-06 08:50:00 -04:00
Marty Schoch	79ab2b9b3d	added unicode normalization filter	2014-08-04 21:59:57 -04:00
Marty Schoch	2c0bf23fac	added elision filter defined article word maps for french, italian, irish and catalan defined elision filters for these same languages updated analyers for french and italian to use this new filter irish and catalan still depend on other missing pieces closes #25	2014-08-03 19:17:35 -04:00
Marty Schoch	0960cab0ae	refactored StopWordsMap into WordMap so it can be reused the ElisionFilter will need a word list of articles and plan to reuse this	2014-08-03 17:46:35 -04:00

1 2

54 Commits