bleve

Author	SHA1	Message	Date
slavikm	680be52f87	Implemented boolean field support	2016-01-11 17:18:03 -08:00
Steve Yen	89d17f01ef	analyze locations only if includeTermVectors enabled With this change, TermLocations are computed and maintained only if includeTermVectors is enabled, for higher performance.	2016-01-05 12:46:46 -08:00
Steve Yen	325a616993	unicode.Tokenize() avoids array growth via array of arrays	2016-01-02 12:21:25 -08:00
Steve Yen	918732f3d8	unicode.Tokenize() allocs backing array of Tokens Previously, unicode.Tokenize() would allocate a Token one-by-one, on an as-needed basis. This change allocates a "backing array" of Tokens, so that it goes to the runtime object allocator much less often. It takes a heuristic guess as to the backing array size by using the average token (segment) length seen so far. Results from micro-benchmark (null-firestorm, bleve-blast) seem to give perhaps less than ~0.5 MB/second throughput improvement.	2016-01-02 12:21:25 -08:00
Steve Yen	a345e7951e	TokenFrequency() alloc's all TokenLocations up front	2016-01-02 12:21:17 -08:00
Marty Schoch	9777846206	Merge branch 'master' into firestorm	2015-11-30 15:02:46 -05:00
Marty Schoch	e472b3e807	add support for a "web" tokenizer/analyzer The goal of the "web" tokenizer is to recognize web things like - email addresses - URLs - twitter @handles and #hashtags This implementation uses regexp exceptions. There will most likely be endless debate about the regular expressions. These were chosein as "good enough for now". There is also a "web" analyzer. This is just the "standard" analyzer, but using the "web" tokenizer instead of the "unicode" one. NOTE: after processing the exceptions, it still falls back to the standard "unicode" one. For many users, you can simply set your mapping's default analyzer to be "web". closes #269	2015-11-30 14:27:18 -05:00
Marty Schoch	a707d44e0b	Merge branch 'master' into firestorm	2015-11-24 09:44:47 -05:00
Ben Campbell	994f4b4d11	added some godoc documentation for the en analyzer	2015-11-18 15:28:57 +13:00
Patrick Mezard	ff03874f19	token_map: document it along with stop_token_filter	2015-11-05 14:07:54 +01:00
Patrick Mezard	eb26402924	elision_filter: correctly strip multi-bytes quotation marks	2015-11-04 10:59:10 +01:00
Patrick Mezard	bae2079eb2	token_filters: fix typo in right single quotation mark name	2015-11-04 10:29:56 +01:00
Marty Schoch	01526e971f	Merge branch 'master' into firestorm	2015-10-28 11:26:01 -04:00
Patrick Mezard	f95f1d29a0	exception: fail if pattern is empty, name tokenizer in error	2015-10-27 18:53:03 +01:00
Patrick Mezard	8b17787a65	analysis: document "exception" tokenizer, and Tokenizer interface	2015-10-27 18:53:03 +01:00
Patrick Mezard	e2fa3d6351	doc: document Token, TokenFrequencies and Field structs It helps understanding what is going on in indexing code. ArrayPositions() was particularly puzzling.	2015-10-09 12:32:44 +02:00
Marty Schoch	c3a4fab911	Merge pull request #238 from ikawaha/ja-morph-analyzer fix compliation with the latest changes to kagome	2015-09-28 17:05:46 -04:00
ikawaha	89af7978a9	fix compliation with the latest changes to kagome	2015-09-28 15:53:08 +09:00
Marty Schoch	66aa1b020a	Merge branch 'master' into firestorm	2015-09-23 11:32:25 -07:00
Marty Schoch	f81b2be334	major refactor of bleve configuration see #221 for full details	2015-09-16 17:10:59 -04:00
Marty Schoch	37aa5cb027	Merge branch 'master' into firestorm	2015-09-09 09:03:42 -04:00
Marty Schoch	d00bc91dc9	minor speed up in token frequency calculations benchmark old ns/op new ns/op delta BenchmarkAnalysis-4 1599218 1540991 -3.64% benchmark old allocs new allocs delta BenchmarkAnalysis-4 5353 5318 -0.65% benchmark old bytes new bytes delta BenchmarkAnalysis-4 370495 362983 -2.03%	2015-09-04 18:57:39 -04:00
Marty Schoch	84811cf5a0	made index type configurable + first version of firestorm	2015-08-25 14:52:42 -04:00
Donald Huang	767831d87c	move custom_analyzer to custom_analyzer package	2015-08-11 21:22:03 +00:00
Marty Schoch	3682c25467	update to correctly work with composite fields also updated search results to return array positions	2015-07-31 11:16:11 -04:00
Marty Schoch	1f4ef3da8b	move elision filter after lowercase filter this affects all languages using the elision filter languages fr and it are updated now languages ca and ga are still missing other components and do not yet have an analyzer, but they should follow this lead once they are ready fixes #218	2015-07-21 10:43:53 -04:00
Marty Schoch	65556f45c7	added additional tests for bug #214	2015-07-06 18:00:05 -04:00
Marty Schoch	0f16eccd6b	new tokenizer that allows you to pre-identify tokens with regexp name "exception" configure with list of regexp string "exceptions" these exceptions regexps that match sequences you want treated as a single token. these sequences are NOT sent to the underlying tokenizer configure "tokenizer" is the named tokenizer that should be used for processing all text regions not matching exceptions An example configuration with simple patterns to match URLs and email addresses: map[string]interface{}{ "type": "exception", "tokenizer": "unicode", "exceptions": []interface{}{ `[hH][tT][tT][pP][sS]?://(\S)`, `[fF][iI][lL][eE]://(\S)`, `[fF][tT][pP]://(\S)*`, `\S+@\S+`, } }	2015-04-08 15:31:58 -04:00
Marty Schoch	93e01a803e	fix issues identified by errcheck part of #169	2015-04-07 14:52:00 -04:00
Marty Schoch	50bd082257	fixed issues with portuguese analyzer fixes #70	2015-03-11 14:22:11 -04:00
Marty Schoch	7970f42c29	fix issues with italian analyzer switch it to not require icu/libstemmer fixes #69	2015-03-11 11:48:13 -04:00
Marty Schoch	eeaf514848	switch fr to not require icu/libstemmer also corrected copy/paste bug in test	2015-03-11 11:46:33 -04:00
Marty Schoch	8ae30fb6f0	fix issues with lucene stemmer fixes issue #68	2015-03-11 11:14:29 -04:00
Marty Schoch	300ec79c96	first pass at checking errors that were ignored part of #169	2015-03-06 14:46:29 -05:00
Salmān Aljammāz	9444af9366	arabic: add unicode normalization to analyzer	2015-02-06 19:50:58 +03:00
Salmān Aljammāz	91a8d5da9f	arabic: check minimum length before stemming This invloves converting tokens to a rune slice in the filter, but at least we're now compatable with Lucene's stemmer.	2015-02-06 19:50:58 +03:00
Salmān Aljammāz	0470f93955	arabic: add more stemmer tests These came from org.apache.lucene.analysis.ar.	2015-02-06 19:49:30 +03:00
Salmān Aljammāz	e461fed92a	arabic stemmer: strip multiple suffixes updates #150	2015-02-05 16:07:58 +03:00
Marty Schoch	4be974f489	added first implementation of arabic analyzer one test cases is not passing and is commented out temporarily updates #150	2015-02-05 07:44:55 -05:00
Marty Schoch	b9c22fe50d	Merge pull request #154 from saljam/arabic add arabic light stemmer	2015-02-05 07:09:54 -05:00
Salmān Aljammāz	945ef8158f	add arabic light stemmer fixes #28 updates #150	2015-02-05 13:24:30 +03:00
Marty Schoch	dd1cd189a7	added initial implementation of hindi analyzer closes #66	2015-02-04 15:12:08 -05:00
Marty Schoch	a9f153bac7	fix typo in unicode normalization form constant also adjusted incorrect tests fixes #149	2015-01-26 14:09:20 -05:00
Marty Schoch	530613a239	rewrite map access to take advantage of optimization	2015-01-14 12:57:34 -05:00
Marty Schoch	890b1abfe6	new version of lower case filter which tries to avoid copying bytes	2015-01-14 11:34:30 -05:00
Marty Schoch	7cc544adf2	switched to bytes.ToLower for minor speedup	2015-01-14 09:28:57 -05:00
Marty Schoch	f000092201	added benchmark for lowercase filter	2015-01-14 09:28:57 -05:00
Steve Yen	db82eae3f4	go fmt	2015-01-13 11:04:45 -08:00
Marty Schoch	ed06dd0581	switching to unicode tokenizer now that its faster than regexp	2015-01-12 18:04:34 -05:00
Marty Schoch	0a4844f9d0	change unicode tokenizer to use direct segmenter api	2015-01-12 17:57:45 -05:00
Sacheendra Talluri	4b3967a68e	rewrite custom analyzer without using reflect	2015-01-08 00:25:16 +05:30
Sacheendra Talluri	4abf2a638e	adds handling of []string type attributes to custom analyzer	2015-01-08 00:08:20 +05:30
Marty Schoch	0ddfa774ec	clean up logging to use package level *log.Logger by default messages go to ioutil.Discard	2014-12-28 12:14:48 -08:00
Silvan Jegen	ef18dfe4cd	Fix typos in comments and strings	2014-12-18 18:43:12 +01:00
Sergey Avseyev	570109a983	Update "code.google.com" import paths https://github.com/couchbase/sync_gateway/issues/492	2014-12-10 01:17:49 +03:00
Silvan Jegen	412049d63c	Remove unneeded import statements	2014-11-29 14:25:24 +01:00
Marty Schoch	fcab645f96	add test to cover kana/ideographic case	2014-11-26 08:42:40 -05:00
Marty Schoch	d452b2a10e	add support for dictionary based compound word filter partially addresses #115	2014-11-18 15:18:42 -05:00
Marty Schoch	40a8154bab	changed en analyzer to use pure go components behavior should be similar with unicode segmentation and a porter stemmer	2014-10-21 16:38:58 -04:00
Marty Schoch	c4d1782689	new pure go porter stemmer integrated renamed original libstemmer porter to "stemmer_porter_classic" new pure go stemmer is "stemmer_porter"	2014-10-20 16:55:24 -04:00
Marty Schoch	cf3643f292	added pure go tokenizer to do unicode word boundary segmentation	2014-10-17 18:07:48 -04:00
Marty Schoch	dcb90ad176	added benchmark for tokenizing English text	2014-10-17 18:07:01 -04:00
Marty Schoch	febb8d2df1	renamed unicode_word_boundary package to icu this is in preparation of alternative unicode word boundary impls	2014-10-17 15:15:13 -04:00
Marty Schoch	19d45dfdb6	fix compliation with the latest changes to kagome	2014-10-10 19:59:24 -07:00
Marty Schoch	1dc466a800	modified token filters to avoid creating new token stream often the result stream was the same length, so can reuse the existing token stream also, in cases where a new stream was required, set capacity to the length of the input stream. most output stream are at least as long as the input, so this may avoid some subsequent resizing	2014-09-23 18:41:32 -04:00
Marty Schoch	95e6e37e67	added build tag to fix runngin tests without tag	2014-09-16 11:28:44 -04:00
Marty Schoch	55c0e84665	relocated kagome tokenizer and introduced ja analyzer	2014-09-16 11:21:29 -04:00
Silvan Jegen	29bdc094a9	Use byte positions instead of character positions	2014-09-14 13:19:30 +02:00
Silvan Jegen	a8ec7f7af2	Add tests for the Kagome tokenizer	2014-09-13 17:45:22 +02:00
Silvan Jegen	ebf100c097	Add the Kagome tokenizer for Japanese	2014-09-13 17:45:19 +02:00
Marty Schoch	1a1cf32a86	introducing cjk_bigram filter and cjk analyzer closes #34	2014-09-11 10:39:05 -04:00
Marty Schoch	cb5ccd2b1d	fix whitespace tokenizer previously would fail to split ascii running into ideographic	2014-09-11 10:38:02 -04:00
Marty Schoch	8debf26cb7	changed many components to not have defaults many of these defaults were arbitrary, and not having defaults lets us more easily flag them for configuration added a shingle filter introduce new toke type for shingles	2014-09-09 18:15:14 -04:00
Marty Schoch	6b4c86b35a	changed whitespace tokenizer to work better on cjk input now it will return each cjk character as a separate token this will pair well with a cjk bigram filter for indexing	2014-09-07 14:11:01 -04:00
Marty Schoch	933d99c576	rename the configurable token map from standard to custom this makes it consistent with the "custom" analyzer which operates similarly also, added it to the config.go so its registerd and available for use	2014-09-07 14:09:38 -04:00
Marty Schoch	9e78643bad	icu tokenier uses brk status to set token type part of #34	2014-09-07 10:24:02 -04:00
Marty Schoch	377ae090d0	additional golint issues resolved	2014-09-03 18:17:26 -04:00
Marty Schoch	d534b0836b	converted ALL_CAPS constants to CamelCase	2014-09-03 17:48:40 -04:00
Marty Schoch	7a7eb2e94c	add newline between license and package this avoids cluttering godocs with the license	2014-09-02 10:54:50 -04:00
Marty Schoch	1dcd06e412	add ability to define custom analysis as part of index mapping now, as part of your index mapping you can create custom analysis components. these custome analysis components are serialized as part of the mapping, and reused as you would expect on subsequent accesses.	2014-09-01 13:55:23 -04:00
Marty Schoch	7bfad18d40	moved byte array converts into the analysis package	2014-08-29 19:23:21 -04:00
Marty Schoch	1161361bea	rename imports from couchbaselabs to blevesearch	2014-08-28 15:38:57 -04:00
Marty Schoch	e8959d03ae	added build tag 'icu' to enable functionality dependent on it	2014-08-25 12:22:01 -04:00
Marty Schoch	21ef6e9878	added build tag for things depending on libstemmer	2014-08-25 12:06:10 -04:00
Marty Schoch	08db2eae42	added alternate build tag 'full' which will be an alias to enable all	2014-08-25 11:40:58 -04:00
Marty Schoch	f37bb77794	added build tag to enable cld2	2014-08-25 11:24:20 -04:00
Marty Schoch	092e30a38e	tried to word the instructions for static and dynamic linking	2014-08-25 10:54:15 -04:00
deoxxa	22b7b3bc24	compile libcld2 statically	2014-08-24 03:44:57 +10:00
Marty Schoch	b48dc87afa	added test case clarifying whitespace tokenizer on empty input	2014-08-19 10:43:52 -04:00
Marty Schoch	5dcd39ade7	added turkish analyzer test	2014-08-14 16:42:41 -04:00
Marty Schoch	21408e49eb	added thai analyzer test	2014-08-14 16:39:37 -04:00
Marty Schoch	599ef6edce	added swedish analyzer test	2014-08-14 16:12:48 -04:00
Marty Schoch	64255e3eb9	added russian analyzer test	2014-08-14 16:11:23 -04:00
Marty Schoch	8896de2039	added romanian analyzer test	2014-08-14 16:06:17 -04:00
Marty Schoch	c2937b4b81	added portuguese analyzer test discrepencies found, logged in #70 failing tests commented out for now	2014-08-14 16:04:29 -04:00
Marty Schoch	81a9d325a2	added norwegian analyzer test	2014-08-14 16:01:03 -04:00
Marty Schoch	a3a97a09d3	added dutch analyzer test	2014-08-14 15:59:39 -04:00
Marty Schoch	6714d5d765	added italian analyzer test discrepencies found between us and lucene, documented in #69 failing tests commented out for now	2014-08-14 15:56:47 -04:00
Marty Schoch	b9c0477762	added hungarian analyzer test	2014-08-14 15:51:55 -04:00
Marty Schoch	6a9f8e85ae	added french analyzer test many discrepencies noted, opened issue #68 to track this failing tests commented out for now	2014-08-14 15:48:32 -04:00

1 2 3 4

185 Commits