0
0
Commit Graph

146 Commits

Author SHA1 Message Date
a-little-srdjan
9341cc835e Preventing panic on ngram initialization, and extending the type conversion. 2016-06-06 15:54:18 -04:00
Marty Schoch
2a703376ea fix ineffectual assignments 2016-04-02 22:42:56 -04:00
Marty Schoch
7892882519 fix typos 2016-04-02 21:59:30 -04:00
Marty Schoch
194ee82c80 gofmt simplifications 2016-04-02 21:54:33 -04:00
Marty Schoch
0b171c85da change "simple" analyzer to use "letter" tokenizer
this change improves compatibility with the simple analyzer
defined by Lucene.  this has important implications for
some perf tests as well as they often use the simple
analyzer.
2016-03-31 15:13:17 -04:00
Ben Campbell
4fafb2be3f Merge branch 'master' into documenting 2016-03-23 10:48:09 +13:00
Marty Schoch
cecdfcbc69 moving japanese analyzer to blevex package 2016-03-13 18:05:05 -04:00
ikawaha
fcebff60e9 Add a test case 2016-02-21 19:59:52 +09:00
ikawaha
4fe7688431 Use a small version of kagome 2016-02-21 19:58:36 +09:00
Ben Campbell
47dbd85551 Merge branch 'master' into documenting 2016-01-29 09:31:30 +13:00
Marty Schoch
fc34a97875 copy locations on merge for more safe/predictable behavior
fixes #328
2016-01-19 14:21:48 -05:00
slavikm
680be52f87 Implemented boolean field support 2016-01-11 17:18:03 -08:00
Steve Yen
89d17f01ef analyze locations only if includeTermVectors enabled
With this change, TermLocations are computed and maintained only if
includeTermVectors is enabled, for higher performance.
2016-01-05 12:46:46 -08:00
Steve Yen
325a616993 unicode.Tokenize() avoids array growth via array of arrays 2016-01-02 12:21:25 -08:00
Steve Yen
918732f3d8 unicode.Tokenize() allocs backing array of Tokens
Previously, unicode.Tokenize() would allocate a Token one-by-one, on
an as-needed basis.

This change allocates a "backing array" of Tokens, so that it goes to
the runtime object allocator much less often.  It takes a heuristic
guess as to the backing array size by using the average token
(segment) length seen so far.

Results from micro-benchmark (null-firestorm, bleve-blast) seem to
give perhaps less than ~0.5 MB/second throughput improvement.
2016-01-02 12:21:25 -08:00
Steve Yen
a345e7951e TokenFrequency() alloc's all TokenLocations up front 2016-01-02 12:21:17 -08:00
Marty Schoch
9777846206 Merge branch 'master' into firestorm 2015-11-30 15:02:46 -05:00
Marty Schoch
e472b3e807 add support for a "web" tokenizer/analyzer
The goal of the "web" tokenizer is to recognize web things like
- email addresses
- URLs
- twitter @handles and #hashtags

This implementation uses regexp exceptions.  There will most
likely be endless debate about the regular expressions. These
were chosein as "good enough for now".

There is also a "web" analyzer.  This is just the "standard"
analyzer, but using the "web" tokenizer instead of the "unicode"
one.  NOTE: after processing the exceptions, it still falls back
to the standard "unicode" one.

For many users, you can simply set your mapping's default analyzer
to be "web".

closes #269
2015-11-30 14:27:18 -05:00
Marty Schoch
a707d44e0b Merge branch 'master' into firestorm 2015-11-24 09:44:47 -05:00
Ben Campbell
994f4b4d11 added some godoc documentation for the en analyzer 2015-11-18 15:28:57 +13:00
Patrick Mezard
ff03874f19 token_map: document it along with stop_token_filter 2015-11-05 14:07:54 +01:00
Patrick Mezard
eb26402924 elision_filter: correctly strip multi-bytes quotation marks 2015-11-04 10:59:10 +01:00
Patrick Mezard
bae2079eb2 token_filters: fix typo in right single quotation mark name 2015-11-04 10:29:56 +01:00
Marty Schoch
01526e971f Merge branch 'master' into firestorm 2015-10-28 11:26:01 -04:00
Patrick Mezard
f95f1d29a0 exception: fail if pattern is empty, name tokenizer in error 2015-10-27 18:53:03 +01:00
Patrick Mezard
8b17787a65 analysis: document "exception" tokenizer, and Tokenizer interface 2015-10-27 18:53:03 +01:00
Patrick Mezard
e2fa3d6351 doc: document Token, TokenFrequencies and Field structs
It helps understanding what is going on in indexing code.
ArrayPositions() was particularly puzzling.
2015-10-09 12:32:44 +02:00
Marty Schoch
c3a4fab911 Merge pull request #238 from ikawaha/ja-morph-analyzer
fix compliation with the latest changes to kagome
2015-09-28 17:05:46 -04:00
ikawaha
89af7978a9 fix compliation with the latest changes to kagome 2015-09-28 15:53:08 +09:00
Marty Schoch
66aa1b020a Merge branch 'master' into firestorm 2015-09-23 11:32:25 -07:00
Marty Schoch
f81b2be334 major refactor of bleve configuration
see #221 for full details
2015-09-16 17:10:59 -04:00
Marty Schoch
37aa5cb027 Merge branch 'master' into firestorm 2015-09-09 09:03:42 -04:00
Marty Schoch
d00bc91dc9 minor speed up in token frequency calculations
benchmark               old ns/op     new ns/op     delta
BenchmarkAnalysis-4     1599218       1540991       -3.64%

benchmark               old allocs     new allocs     delta
BenchmarkAnalysis-4     5353           5318           -0.65%

benchmark               old bytes     new bytes     delta
BenchmarkAnalysis-4     370495        362983        -2.03%
2015-09-04 18:57:39 -04:00
Marty Schoch
84811cf5a0 made index type configurable + first version of firestorm 2015-08-25 14:52:42 -04:00
Donald Huang
767831d87c move custom_analyzer to custom_analyzer package 2015-08-11 21:22:03 +00:00
Marty Schoch
3682c25467 update to correctly work with composite fields
also updated search results to return array positions
2015-07-31 11:16:11 -04:00
Marty Schoch
1f4ef3da8b move elision filter after lowercase filter
this affects all languages using the elision filter
languages fr and it are updated now
languages ca and ga are still missing other components and
do not yet have an analyzer, but they should follow this lead
once they are ready

fixes #218
2015-07-21 10:43:53 -04:00
Marty Schoch
65556f45c7 added additional tests for bug #214 2015-07-06 18:00:05 -04:00
Marty Schoch
0f16eccd6b new tokenizer that allows you to pre-identify tokens with regexp
name "exception"
configure with list of regexp string "exceptions"
these exceptions regexps that match sequences you want treated
as a single token.  these sequences are NOT sent to the
underlying tokenizer
configure "tokenizer" is the named tokenizer that should be
used for processing all text regions not matching exceptions

An example configuration with simple patterns to match URLs and
email addresses:

map[string]interface{}{
	"type":      "exception",
	"tokenizer": "unicode",
	"exceptions": []interface{}{
		`[hH][tT][tT][pP][sS]?://(\S)*`,
		`[fF][iI][lL][eE]://(\S)*`,
		`[fF][tT][pP]://(\S)*`,
		`\S+@\S+`,
  }
}
2015-04-08 15:31:58 -04:00
Marty Schoch
93e01a803e fix issues identified by errcheck
part of #169
2015-04-07 14:52:00 -04:00
Marty Schoch
50bd082257 fixed issues with portuguese analyzer
fixes #70
2015-03-11 14:22:11 -04:00
Marty Schoch
7970f42c29 fix issues with italian analyzer
switch it to not require icu/libstemmer
fixes #69
2015-03-11 11:48:13 -04:00
Marty Schoch
eeaf514848 switch fr to not require icu/libstemmer
also corrected copy/paste bug in test
2015-03-11 11:46:33 -04:00
Marty Schoch
8ae30fb6f0 fix issues with lucene stemmer
fixes issue #68
2015-03-11 11:14:29 -04:00
Marty Schoch
300ec79c96 first pass at checking errors that were ignored
part of #169
2015-03-06 14:46:29 -05:00
Salmān Aljammāz
9444af9366 arabic: add unicode normalization to analyzer 2015-02-06 19:50:58 +03:00
Salmān Aljammāz
91a8d5da9f arabic: check minimum length before stemming
This invloves converting tokens to a rune slice in the filter, but
at least we're now compatable with Lucene's stemmer.
2015-02-06 19:50:58 +03:00
Salmān Aljammāz
0470f93955 arabic: add more stemmer tests
These came from org.apache.lucene.analysis.ar.
2015-02-06 19:49:30 +03:00
Salmān Aljammāz
e461fed92a arabic stemmer: strip multiple suffixes
updates #150
2015-02-05 16:07:58 +03:00
Marty Schoch
4be974f489 added first implementation of arabic analyzer
one test cases is not passing and is commented out temporarily
updates #150
2015-02-05 07:44:55 -05:00