0
0
Commit Graph

753 Commits

Author SHA1 Message Date
Marty Schoch
e5c1af4164 add travis config to run integration tests against firestorm 2016-01-05 13:00:36 -05:00
Marty Schoch
ab67b2f642 Merge pull request #267 from pmezard/doc-dump-methods
index: document DumpAll, DumpDoc and DumpFields methods
2016-01-05 09:55:35 -05:00
Marty Schoch
db7363fba1 Merge pull request #305 from steveyen/WIP-perf-20160102
perf 20160102
2016-01-05 08:54:47 -05:00
Steve Yen
70b7e73c82 firestorm compensator inFlight.Get() might return nil 2016-01-03 10:21:54 -08:00
Steve Yen
fb8c9a7475 firestorm.Batch() collects [][]IndexRows instead of []IndexRow
Rather than append() all received rows into a flat []IndexRow during
the result gathering loop, this change instead collects the analysis
result rows into a [][]IndexRow, which avoids extra copying.

As part of this, firestorm batchRows() now takes the [][]IndexRow as
its input.
2016-01-02 12:30:47 -08:00
Steve Yen
1c5b84911d firestorm DictUpdater NotifyBatch is more async 2016-01-02 12:21:25 -08:00
Steve Yen
b241242465 firestorm.Analyze() preallocs rows, with analyzeField() func
The new analyzeField() helper func is used for both regular fields and
for composite fields.

With this change, all analysis is done up front, for both regular
fields and composite fields.

After analysis, this change counts up all the row capacity needed and
extends the AnalysisResult.Rows in one shot, as opposed to the
previous approach of dynamically growing the array as needed during
append()'s.

Also, in this change, the TermFreqRow for _id is added first, which
seems more correct.
2016-01-02 12:21:25 -08:00
Steve Yen
325a616993 unicode.Tokenize() avoids array growth via array of arrays 2016-01-02 12:21:25 -08:00
Steve Yen
918732f3d8 unicode.Tokenize() allocs backing array of Tokens
Previously, unicode.Tokenize() would allocate a Token one-by-one, on
an as-needed basis.

This change allocates a "backing array" of Tokens, so that it goes to
the runtime object allocator much less often.  It takes a heuristic
guess as to the backing array size by using the average token
(segment) length seen so far.

Results from micro-benchmark (null-firestorm, bleve-blast) seem to
give perhaps less than ~0.5 MB/second throughput improvement.
2016-01-02 12:21:25 -08:00
Steve Yen
5b2bc1c20f firestorm.indexField() check for includeTermVectors moved out of loop 2016-01-02 12:21:25 -08:00
Steve Yen
45e9eaaacb firestorm.indexField() allocs up-front array of TermFreqRow's
This uses the "backing array" technique to allocate many TermFreqRow's
at the front of firestorm.indexField(), instead of the previous
one-by-one, as-needed TermFreqRow allocation approach.

Results from micro-benchmark, null-firestorm, bleve-blast has this
change producing a ~half MB/sec improvement.
2016-01-02 12:21:24 -08:00
Steve Yen
7ae696d661 firestorm lookuper notified via batch
Previously, the firestorm.Batch() would notify the lookuper goroutine
on a document by document basis.  If the lookuper input channel became
full, then that would block the firestorm.Batch() operation.

With this change, lookuper is notified once, with a "batch" that is an
[]*InFlightItem.

This change also reuses that same []*InFlightItem to invoke the
compensator.MutateBatch().

This also has the advantage of only converting the docID's from string
to []byte just once, outside of the lock that's used by the
compensator.

Micro-benchmark of this change with null-firestorm bleve-blast does
not show large impact, neither degradation or improvement.
2016-01-02 12:21:24 -08:00
Steve Yen
38d50ed8b5 renamed var to docsUpdated to match docsDeleted naming 2016-01-02 12:21:24 -08:00
Steve Yen
3feeb14b7d firestorm.batchRows reuses buf for all IndexRows 2016-01-02 12:21:24 -08:00
Steve Yen
0a7f7e3df8 firestorm.Analyze() converts docID to bytes only once 2016-01-02 12:21:24 -08:00
Steve Yen
fd81d0364c firestorm.indexField() uses capacity of len(tokenFreqs) 2016-01-02 12:21:24 -08:00
Steve Yen
a345e7951e TokenFrequency() alloc's all TokenLocations up front 2016-01-02 12:21:17 -08:00
Steve Yen
ee5ccda112 use KeyTo/ValueTo in firestorm.batchRows
After this change, with null kvstore micro-benchmark...

  GOMAXPROCS=8 ./bleve-blast -source=../../tmp/enwiki.txt \
    -count=100000 -numAnalyzers=8 -numIndexers=8 \
    -config=../../configs/null-firestorm.json -batch=100

Then TermFreqRow key and value methods dissapear as large boxes from
the cpu profile graphs.
2016-01-01 09:57:59 -08:00
Steve Yen
fd287bdfa4 firestorm.md markdown fixes 2016-01-01 09:57:59 -08:00
Steve Yen
b605224106 use shorter go idiom 2015-12-29 22:14:45 -08:00
Marty Schoch
6ddcde4c04 Merge pull request #294 from Shugyousha/fuzzytest
Add tests for fuzzy search
2015-12-25 11:38:35 -08:00
Marty Schoch
8ae2aee0bc Merge pull request #297 from aybabtme/firestorm-dont-gc-if-no-documents
Firestorm: dont gc if no documents
2015-12-25 11:23:49 -08:00
Antoine Grondin
6806343677 firestore: fix #296 for division by zero on GC 2015-12-25 11:34:19 +07:00
Antoine Grondin
a6f7abdfa3 firestore: reproducer for division by zero on GC 2015-12-25 11:33:46 +07:00
Marty Schoch
8efbd556a3 fix indexing bug with data coming from arrays
fixes #295
2015-12-21 14:59:32 -05:00
Marty Schoch
7bb58e1be4 add ability for integration test to check hit locations 2015-12-21 14:42:43 -05:00
Silvan Jegen
84c755cdb0 Add tests for fuzzy search 2015-12-20 17:00:46 +01:00
Marty Schoch
f7698f1f15 support match_all, match_none and docid queries via JSON
also fixed bug in docIDQuery execution which would cause not
matching the highest docID passed in if it was in fact a
valid ID
2015-12-16 14:53:14 -05:00
Marty Schoch
849b69c318 more enhancements to bleve_query 2015-12-16 14:52:33 -05:00
Marty Schoch
cf67fe2cbc fix major synchronization issue in the field_cache
The field cache is expected to be the authority on which field
names are identified by which identifier.  This code was
optimized for the most common case in which fields already
exist.  However, if we deterimine the field is missing with
the read lock (shared), we incorrectly immediately proceed
to create a new row with the write lock (exclusive).  The
problem is that multiple goroutines might have come to
the same conclusion, and they all proceed to add rows.  The two
choices were to do the whole operation with the write lock, or
recheck the value again with the write lock.  We have chosen
to repeat the check inside the write-lock, as this optimizes
for what we believe to be the most common case, in which most
fields will already exist.
2015-12-15 16:39:38 -05:00
Marty Schoch
84ec206fec add some tests for index names in results 2015-12-08 14:38:46 -05:00
Marty Schoch
d73beac3b9 search result hits now have a field with the name of the index
this allows you to figure out where a result actually came
from when using aliases
2015-12-08 13:55:04 -05:00
Marty Schoch
9d30e1c96b Merge branch 'master' into give_indexes_names 2015-12-08 11:56:53 -05:00
Marty Schoch
b4d4ee2fff fix incorrect results returned by phrase search
previously phrase searcher would not validate that consecutive
terms were actually occurring in the same array position

fixes #292
2015-12-06 15:55:00 -05:00
Marty Schoch
6e9da3bab7 allow running prefix queries through bleve_query command 2015-12-06 14:01:53 -05:00
Marty Schoch
aa7658bbb0 give indexes names, make stats available via expvar by default 2015-12-06 14:01:03 -05:00
Marty Schoch
a73a178923 fix incorrect prefix search behavior
avoids double incrementing of end term when reading term dict
fixes #293
2015-12-04 14:07:16 -05:00
Marty Schoch
699c86073a make existing integration tests work with firestorm 2015-12-01 12:29:56 -05:00
Marty Schoch
9777846206 Merge branch 'master' into firestorm 2015-11-30 15:02:46 -05:00
Marty Schoch
e472b3e807 add support for a "web" tokenizer/analyzer
The goal of the "web" tokenizer is to recognize web things like
- email addresses
- URLs
- twitter @handles and #hashtags

This implementation uses regexp exceptions.  There will most
likely be endless debate about the regular expressions. These
were chosein as "good enough for now".

There is also a "web" analyzer.  This is just the "standard"
analyzer, but using the "web" tokenizer instead of the "unicode"
one.  NOTE: after processing the exceptions, it still falls back
to the standard "unicode" one.

For many users, you can simply set your mapping's default analyzer
to be "web".

closes #269
2015-11-30 14:27:18 -05:00
Marty Schoch
6d851cfcc2 fix bug in warmup which led to docs being deleted 2015-11-30 10:18:14 -05:00
Marty Schoch
aa8d98f5fa include space after prefix in log output 2015-11-30 10:17:48 -05:00
Marty Schoch
68d8742826 correctly prefix internal rows with 'i' and print them in debug 2015-11-30 10:17:15 -05:00
Marty Schoch
17cfe8cff0 Merge branch 'master' into firestorm 2015-11-30 07:25:33 -05:00
Marty Schoch
b2ac05c6d0 support metrics through bleve query 2015-11-30 07:24:31 -05:00
Marty Schoch
c93de9734e fix issues identified by errcheck 2015-11-24 14:32:33 -05:00
Marty Schoch
bbef1980d8 Merge branch 'master' into firestorm 2015-11-24 13:04:36 -05:00
Marty Schoch
808f2c1e43 remove exceptions from errcheck 2015-11-24 12:52:46 -05:00
Marty Schoch
ff11f83842 properly handle errors inside metrics kvstore reporting 2015-11-24 12:52:03 -05:00
Marty Schoch
a707d44e0b Merge branch 'master' into firestorm 2015-11-24 09:44:47 -05:00