0
0
Commit Graph

167 Commits

Author SHA1 Message Date
Marty Schoch
d4d3e7a043 address golint naming issues 2016-10-02 10:35:24 -04:00
Marty Schoch
2332455bd2 nicer formatting of license header 2016-10-02 10:13:14 -04:00
Marty Schoch
6bf9dd59ab BREAKING CHANGE - additional package renaming
i recently learned that package names should also prefer the
singular form, not the plural form
2016-10-01 17:20:59 -04:00
Marty Schoch
35da361bfa BREAKING CHANGE - renamed packages to be shorter and not use _
this commit only addresses the analysis sub-package
2016-09-30 12:36:10 -04:00
Marty Schoch
c5159251a9 make shingle token filter stateless
the previous implementation was incorectly stateful, which
violates the contract for token filters

fixes #431
2016-09-15 08:59:43 -04:00
Marty Schoch
ffee3c3764 fixed regexp tokenizers to not produce empty tokens 2016-09-14 16:22:20 -04:00
Marty Schoch
ee61b2e866 Merge pull request #425 from mschoch/porterfaster
improve perf of porter stemmer
2016-09-11 20:22:23 -04:00
Marty Schoch
f8e8c9d065 Merge pull request #426 from mschoch/fasterbuildterms
encode runes directly into buffer
2016-09-11 20:19:09 -04:00
Marty Schoch
44ff6ced8a improve perf of porter stemmer
1.  porter stemmer offers method to NOT do lowercasing, however
to use this we must convert to runes first ourself, so we did this

2.  now we can invoke the version that skips lowercasing, we
already do this ourselves before stemming through separate filter

due to the fact that the stemmer modifies the runes in place
we have no way to know if there were changes, thus we must
always encode back into the term byte slice

added unit test which catches the problem found

NOTE this uses analysis.BuildTermFromRunes so perf gain is
only visible with other PR also merged

future gains are possible if we udpate the stemmer to let us
know if changes were made, thus skipping re-encoding to
[]byte when no changes were actually made
2016-09-11 20:13:15 -04:00
Marty Schoch
c13626be45 encode runes directly into buffer
avoid allocating unnecessary intermediate buffer

also introduce new method to let a user optimistically
try and encode back into an existing buffer, if it isn't
large enough, it silently allocates a new one and returns it
2016-09-11 20:10:03 -04:00
Marty Schoch
56c7b9f831 Merge pull request #423 from mschoch/stopfilterfaster
avoid allocation in stop token filter
2016-09-11 13:59:31 -04:00
Marty Schoch
9e9f172f81 speed up english possessive filter
previous impl always did full utf8 decode of rune
if we assume most tokens are not possessive this is unnecessary
and even if they are, we only need to chop off last to runes
so, now we only decode last rune of token, and if it looks like
s/S then we proceed to decode second to last rune, and then
only if it looks like any form of apostrophe, do we make any
changes to token, again by just reslicing original to chop
off the possessive extension
2016-09-11 12:55:03 -04:00
Marty Schoch
faa07ac3a6 avoid allocation in stop token filter
the token stream resulting from the removal of stop words must
be shorter or the same length as the original, so we just
reuse it and truncate it at the end.
2016-09-11 12:29:33 -04:00
Marty Schoch
9089de251f remove byte_array_conveters
fixes #392
fixes #100
2016-07-01 10:21:41 -04:00
Marty Schoch
fedb46269e updated whtitepsace to behave more like lucene/es 2016-06-10 15:30:43 -04:00
Marty Schoch
9c9dbcc90a fix another test issue 2016-06-10 13:21:27 -04:00
Marty Schoch
80f1117a6c add couchbase copyright and license now that CLA has been signed 2016-06-10 13:08:50 -04:00
Marty Schoch
043a3bfb7c change cjk analyzer to use unicode tokenizer
change cjk bigram analyzer to work with multi-rune terms
add cjk width filter replaces full unicode normailzation

these changes make the cjk analyzer behave more like elasticsearch
they also remove the depenency on the whitespace analyzer
which is now free to also behave more like lucene/es

fixes #33
2016-06-10 13:04:40 -04:00
a-little-srdjan
efe573bc10 removing duplicate code by reusing util.go in analysis 2016-06-09 15:13:30 -04:00
Marty Schoch
5722d7b1d1 Merge pull request #384 from a-little-srdjan/ngram_int_bounds
Preventing panic on ngram initialization, and extending the type conversion
2016-06-08 23:45:32 -04:00
a-little-srdjan
9341cc835e Preventing panic on ngram initialization, and extending the type conversion. 2016-06-06 15:54:18 -04:00
a-little-srdjan
3f2701a97c init. simple camel case parser. 2016-06-03 11:04:21 -04:00
Marty Schoch
2a703376ea fix ineffectual assignments 2016-04-02 22:42:56 -04:00
Marty Schoch
7892882519 fix typos 2016-04-02 21:59:30 -04:00
Marty Schoch
194ee82c80 gofmt simplifications 2016-04-02 21:54:33 -04:00
Marty Schoch
0b171c85da change "simple" analyzer to use "letter" tokenizer
this change improves compatibility with the simple analyzer
defined by Lucene.  this has important implications for
some perf tests as well as they often use the simple
analyzer.
2016-03-31 15:13:17 -04:00
Ben Campbell
4fafb2be3f Merge branch 'master' into documenting 2016-03-23 10:48:09 +13:00
Marty Schoch
cecdfcbc69 moving japanese analyzer to blevex package 2016-03-13 18:05:05 -04:00
ikawaha
fcebff60e9 Add a test case 2016-02-21 19:59:52 +09:00
ikawaha
4fe7688431 Use a small version of kagome 2016-02-21 19:58:36 +09:00
Ben Campbell
47dbd85551 Merge branch 'master' into documenting 2016-01-29 09:31:30 +13:00
Marty Schoch
fc34a97875 copy locations on merge for more safe/predictable behavior
fixes #328
2016-01-19 14:21:48 -05:00
slavikm
680be52f87 Implemented boolean field support 2016-01-11 17:18:03 -08:00
Steve Yen
89d17f01ef analyze locations only if includeTermVectors enabled
With this change, TermLocations are computed and maintained only if
includeTermVectors is enabled, for higher performance.
2016-01-05 12:46:46 -08:00
Steve Yen
325a616993 unicode.Tokenize() avoids array growth via array of arrays 2016-01-02 12:21:25 -08:00
Steve Yen
918732f3d8 unicode.Tokenize() allocs backing array of Tokens
Previously, unicode.Tokenize() would allocate a Token one-by-one, on
an as-needed basis.

This change allocates a "backing array" of Tokens, so that it goes to
the runtime object allocator much less often.  It takes a heuristic
guess as to the backing array size by using the average token
(segment) length seen so far.

Results from micro-benchmark (null-firestorm, bleve-blast) seem to
give perhaps less than ~0.5 MB/second throughput improvement.
2016-01-02 12:21:25 -08:00
Steve Yen
a345e7951e TokenFrequency() alloc's all TokenLocations up front 2016-01-02 12:21:17 -08:00
Marty Schoch
9777846206 Merge branch 'master' into firestorm 2015-11-30 15:02:46 -05:00
Marty Schoch
e472b3e807 add support for a "web" tokenizer/analyzer
The goal of the "web" tokenizer is to recognize web things like
- email addresses
- URLs
- twitter @handles and #hashtags

This implementation uses regexp exceptions.  There will most
likely be endless debate about the regular expressions. These
were chosein as "good enough for now".

There is also a "web" analyzer.  This is just the "standard"
analyzer, but using the "web" tokenizer instead of the "unicode"
one.  NOTE: after processing the exceptions, it still falls back
to the standard "unicode" one.

For many users, you can simply set your mapping's default analyzer
to be "web".

closes #269
2015-11-30 14:27:18 -05:00
Marty Schoch
a707d44e0b Merge branch 'master' into firestorm 2015-11-24 09:44:47 -05:00
Ben Campbell
994f4b4d11 added some godoc documentation for the en analyzer 2015-11-18 15:28:57 +13:00
Patrick Mezard
ff03874f19 token_map: document it along with stop_token_filter 2015-11-05 14:07:54 +01:00
Patrick Mezard
eb26402924 elision_filter: correctly strip multi-bytes quotation marks 2015-11-04 10:59:10 +01:00
Patrick Mezard
bae2079eb2 token_filters: fix typo in right single quotation mark name 2015-11-04 10:29:56 +01:00
Marty Schoch
01526e971f Merge branch 'master' into firestorm 2015-10-28 11:26:01 -04:00
Patrick Mezard
f95f1d29a0 exception: fail if pattern is empty, name tokenizer in error 2015-10-27 18:53:03 +01:00
Patrick Mezard
8b17787a65 analysis: document "exception" tokenizer, and Tokenizer interface 2015-10-27 18:53:03 +01:00
Patrick Mezard
e2fa3d6351 doc: document Token, TokenFrequencies and Field structs
It helps understanding what is going on in indexing code.
ArrayPositions() was particularly puzzling.
2015-10-09 12:32:44 +02:00
Marty Schoch
c3a4fab911 Merge pull request #238 from ikawaha/ja-morph-analyzer
fix compliation with the latest changes to kagome
2015-09-28 17:05:46 -04:00
ikawaha
89af7978a9 fix compliation with the latest changes to kagome 2015-09-28 15:53:08 +09:00