0
0
Commit Graph

5 Commits

Author SHA1 Message Date
Steve Yen
325a616993 unicode.Tokenize() avoids array growth via array of arrays 2016-01-02 12:21:25 -08:00
Steve Yen
918732f3d8 unicode.Tokenize() allocs backing array of Tokens
Previously, unicode.Tokenize() would allocate a Token one-by-one, on
an as-needed basis.

This change allocates a "backing array" of Tokens, so that it goes to
the runtime object allocator much less often.  It takes a heuristic
guess as to the backing array size by using the average token
(segment) length seen so far.

Results from micro-benchmark (null-firestorm, bleve-blast) seem to
give perhaps less than ~0.5 MB/second throughput improvement.
2016-01-02 12:21:25 -08:00
Marty Schoch
0a4844f9d0 change unicode tokenizer to use direct segmenter api 2015-01-12 17:57:45 -05:00
Marty Schoch
fcab645f96 add test to cover kana/ideographic case 2014-11-26 08:42:40 -05:00
Marty Schoch
cf3643f292 added pure go tokenizer to do unicode word boundary segmentation 2014-10-17 18:07:48 -04:00