0
0
Commit Graph

155 Commits

Author SHA1 Message Date
Marty Schoch
c13626be45 encode runes directly into buffer
avoid allocating unnecessary intermediate buffer

also introduce new method to let a user optimistically
try and encode back into an existing buffer, if it isn't
large enough, it silently allocates a new one and returns it
2016-09-11 20:10:03 -04:00
Marty Schoch
9089de251f remove byte_array_conveters
fixes #392
fixes #100
2016-07-01 10:21:41 -04:00
Marty Schoch
fedb46269e updated whtitepsace to behave more like lucene/es 2016-06-10 15:30:43 -04:00
Marty Schoch
9c9dbcc90a fix another test issue 2016-06-10 13:21:27 -04:00
Marty Schoch
80f1117a6c add couchbase copyright and license now that CLA has been signed 2016-06-10 13:08:50 -04:00
Marty Schoch
043a3bfb7c change cjk analyzer to use unicode tokenizer
change cjk bigram analyzer to work with multi-rune terms
add cjk width filter replaces full unicode normailzation

these changes make the cjk analyzer behave more like elasticsearch
they also remove the depenency on the whitespace analyzer
which is now free to also behave more like lucene/es

fixes #33
2016-06-10 13:04:40 -04:00
a-little-srdjan
efe573bc10 removing duplicate code by reusing util.go in analysis 2016-06-09 15:13:30 -04:00
Marty Schoch
5722d7b1d1 Merge pull request #384 from a-little-srdjan/ngram_int_bounds
Preventing panic on ngram initialization, and extending the type conversion
2016-06-08 23:45:32 -04:00
a-little-srdjan
9341cc835e Preventing panic on ngram initialization, and extending the type conversion. 2016-06-06 15:54:18 -04:00
a-little-srdjan
3f2701a97c init. simple camel case parser. 2016-06-03 11:04:21 -04:00
Marty Schoch
2a703376ea fix ineffectual assignments 2016-04-02 22:42:56 -04:00
Marty Schoch
7892882519 fix typos 2016-04-02 21:59:30 -04:00
Marty Schoch
194ee82c80 gofmt simplifications 2016-04-02 21:54:33 -04:00
Marty Schoch
0b171c85da change "simple" analyzer to use "letter" tokenizer
this change improves compatibility with the simple analyzer
defined by Lucene.  this has important implications for
some perf tests as well as they often use the simple
analyzer.
2016-03-31 15:13:17 -04:00
Ben Campbell
4fafb2be3f Merge branch 'master' into documenting 2016-03-23 10:48:09 +13:00
Marty Schoch
cecdfcbc69 moving japanese analyzer to blevex package 2016-03-13 18:05:05 -04:00
ikawaha
fcebff60e9 Add a test case 2016-02-21 19:59:52 +09:00
ikawaha
4fe7688431 Use a small version of kagome 2016-02-21 19:58:36 +09:00
Ben Campbell
47dbd85551 Merge branch 'master' into documenting 2016-01-29 09:31:30 +13:00
Marty Schoch
fc34a97875 copy locations on merge for more safe/predictable behavior
fixes #328
2016-01-19 14:21:48 -05:00
slavikm
680be52f87 Implemented boolean field support 2016-01-11 17:18:03 -08:00
Steve Yen
89d17f01ef analyze locations only if includeTermVectors enabled
With this change, TermLocations are computed and maintained only if
includeTermVectors is enabled, for higher performance.
2016-01-05 12:46:46 -08:00
Steve Yen
325a616993 unicode.Tokenize() avoids array growth via array of arrays 2016-01-02 12:21:25 -08:00
Steve Yen
918732f3d8 unicode.Tokenize() allocs backing array of Tokens
Previously, unicode.Tokenize() would allocate a Token one-by-one, on
an as-needed basis.

This change allocates a "backing array" of Tokens, so that it goes to
the runtime object allocator much less often.  It takes a heuristic
guess as to the backing array size by using the average token
(segment) length seen so far.

Results from micro-benchmark (null-firestorm, bleve-blast) seem to
give perhaps less than ~0.5 MB/second throughput improvement.
2016-01-02 12:21:25 -08:00
Steve Yen
a345e7951e TokenFrequency() alloc's all TokenLocations up front 2016-01-02 12:21:17 -08:00
Marty Schoch
9777846206 Merge branch 'master' into firestorm 2015-11-30 15:02:46 -05:00
Marty Schoch
e472b3e807 add support for a "web" tokenizer/analyzer
The goal of the "web" tokenizer is to recognize web things like
- email addresses
- URLs
- twitter @handles and #hashtags

This implementation uses regexp exceptions.  There will most
likely be endless debate about the regular expressions. These
were chosein as "good enough for now".

There is also a "web" analyzer.  This is just the "standard"
analyzer, but using the "web" tokenizer instead of the "unicode"
one.  NOTE: after processing the exceptions, it still falls back
to the standard "unicode" one.

For many users, you can simply set your mapping's default analyzer
to be "web".

closes #269
2015-11-30 14:27:18 -05:00
Marty Schoch
a707d44e0b Merge branch 'master' into firestorm 2015-11-24 09:44:47 -05:00
Ben Campbell
994f4b4d11 added some godoc documentation for the en analyzer 2015-11-18 15:28:57 +13:00
Patrick Mezard
ff03874f19 token_map: document it along with stop_token_filter 2015-11-05 14:07:54 +01:00
Patrick Mezard
eb26402924 elision_filter: correctly strip multi-bytes quotation marks 2015-11-04 10:59:10 +01:00
Patrick Mezard
bae2079eb2 token_filters: fix typo in right single quotation mark name 2015-11-04 10:29:56 +01:00
Marty Schoch
01526e971f Merge branch 'master' into firestorm 2015-10-28 11:26:01 -04:00
Patrick Mezard
f95f1d29a0 exception: fail if pattern is empty, name tokenizer in error 2015-10-27 18:53:03 +01:00
Patrick Mezard
8b17787a65 analysis: document "exception" tokenizer, and Tokenizer interface 2015-10-27 18:53:03 +01:00
Patrick Mezard
e2fa3d6351 doc: document Token, TokenFrequencies and Field structs
It helps understanding what is going on in indexing code.
ArrayPositions() was particularly puzzling.
2015-10-09 12:32:44 +02:00
Marty Schoch
c3a4fab911 Merge pull request #238 from ikawaha/ja-morph-analyzer
fix compliation with the latest changes to kagome
2015-09-28 17:05:46 -04:00
ikawaha
89af7978a9 fix compliation with the latest changes to kagome 2015-09-28 15:53:08 +09:00
Marty Schoch
66aa1b020a Merge branch 'master' into firestorm 2015-09-23 11:32:25 -07:00
Marty Schoch
f81b2be334 major refactor of bleve configuration
see #221 for full details
2015-09-16 17:10:59 -04:00
Marty Schoch
37aa5cb027 Merge branch 'master' into firestorm 2015-09-09 09:03:42 -04:00
Marty Schoch
d00bc91dc9 minor speed up in token frequency calculations
benchmark               old ns/op     new ns/op     delta
BenchmarkAnalysis-4     1599218       1540991       -3.64%

benchmark               old allocs     new allocs     delta
BenchmarkAnalysis-4     5353           5318           -0.65%

benchmark               old bytes     new bytes     delta
BenchmarkAnalysis-4     370495        362983        -2.03%
2015-09-04 18:57:39 -04:00
Marty Schoch
84811cf5a0 made index type configurable + first version of firestorm 2015-08-25 14:52:42 -04:00
Donald Huang
767831d87c move custom_analyzer to custom_analyzer package 2015-08-11 21:22:03 +00:00
Marty Schoch
3682c25467 update to correctly work with composite fields
also updated search results to return array positions
2015-07-31 11:16:11 -04:00
Marty Schoch
1f4ef3da8b move elision filter after lowercase filter
this affects all languages using the elision filter
languages fr and it are updated now
languages ca and ga are still missing other components and
do not yet have an analyzer, but they should follow this lead
once they are ready

fixes #218
2015-07-21 10:43:53 -04:00
Marty Schoch
65556f45c7 added additional tests for bug #214 2015-07-06 18:00:05 -04:00
Marty Schoch
0f16eccd6b new tokenizer that allows you to pre-identify tokens with regexp
name "exception"
configure with list of regexp string "exceptions"
these exceptions regexps that match sequences you want treated
as a single token.  these sequences are NOT sent to the
underlying tokenizer
configure "tokenizer" is the named tokenizer that should be
used for processing all text regions not matching exceptions

An example configuration with simple patterns to match URLs and
email addresses:

map[string]interface{}{
	"type":      "exception",
	"tokenizer": "unicode",
	"exceptions": []interface{}{
		`[hH][tT][tT][pP][sS]?://(\S)*`,
		`[fF][iI][lL][eE]://(\S)*`,
		`[fF][tT][pP]://(\S)*`,
		`\S+@\S+`,
  }
}
2015-04-08 15:31:58 -04:00
Marty Schoch
93e01a803e fix issues identified by errcheck
part of #169
2015-04-07 14:52:00 -04:00
Marty Schoch
50bd082257 fixed issues with portuguese analyzer
fixes #70
2015-03-11 14:22:11 -04:00