0
0
Commit Graph

180 Commits

Author SHA1 Message Date
Marty Schoch
27ba6187bc adds support for more complex field sorts with object (not string)
previously from JSON we would just deserialize strings like
"-abv" or "city" or "_id" or "_score" as simple sorts
on fields, ids or scores respectively

while this is simple and compact, it can be ambiguous (for
example if you have a field starting with - or if you have a field
named "_id" already.  also, this simple syntax doesnt allow us
to specify more cmoplex options to deal with type/mode/missing

we keep support for the simple string syntax, but now also
recognize a more expressive syntax like:

{
  "by": "field",
  "field": "abv",
  "desc": true,
  "type": "string",
  "mode": "min",
  "missing": "first"
}

type, mode and missing are optional and default to
"auto", "default", and "last" respectively
2016-08-17 14:33:51 -07:00
Marty Schoch
750e0ac16c change sort field impl to use indexed values not stored values 2016-08-17 09:20:44 -07:00
Marty Schoch
d7405a4d79 updated attempt to reuse []byte
previous attempt was flawed (but maked by Reset() method)
new approach is to do this work in the Reset() method itself,
logically this is where it belongs.

but further we acknowledge that IndexInternalID []byte lifetime
lives beyond the TermFieldDoc, so another copy is made into
the DocumentMatch.  Although this introduces yet another copy
the theory being tested is that it allows each of these
structuress to reuse memory without additional allocation.
2016-08-03 17:01:27 -04:00
Marty Schoch
89d83cb5a1 reuse memory already allocated for copies of docids
when the term field reader is copying ID values out of the
kv store's iterator, it is already attempting to reuse the
term frequency row data structure.  this change allows us
to also attempt to reuse the []byte allocated for previous
copies of the docid.  we reset the slice length to zero
then copy the data into the existing slice, avoiding
new allocation and garbage collection in the cases where
there is already enough space
2016-08-03 13:45:48 -04:00
Marty Schoch
36de4a7097 cleaner fix for the TermFrequencyRow reuse bug
reset to nil first, let remaining logic work as before
2016-08-01 17:17:29 -04:00
Marty Schoch
cfce9c5fc5 initialize term vector list in parseV
otherwise reusing previous term frequency row causes us to
keep tacking on to one gigantic list
2016-08-01 17:01:34 -04:00
Marty Schoch
172ca7e69e need to copy the doc ID for it to survive past next iteration 2016-08-01 17:01:04 -04:00
Marty Schoch
1aacd9bad5 changed approach
IndexInternalID is now []byte
this is still opaque, and should still work for any future
index implementations as it is a least common denominator
choice, all implementations must internally represent the
id as []byte at some point for storage to disk
2016-08-01 14:26:50 -04:00
Marty Schoch
5aa9e95468 major refactor of index/search API
index id's are now opaque (until finally returned to top-level user)
 - the TermFieldDoc's returned by TermFieldReader no longer contain doc id
 - instead they return an opaque IndexInternalID
 - items returned are still in the "natural index order"
 - but that is no longer guaranteed to be "doc id order"
 - correct behavior requires that they all follow the same order
 - but not any particular order

 - new API FinalizeDocID which converts index internal ID's to public string ID

 - APIs used internally which previously took doc id now take IndexInternalID
     - that is DocumentFieldTerms() and DocumentFieldTermsForFields()
 - however, APIs that are used externally do not reflect this change
     - that is Document()

 - DocumentIDReader follows the same changes, but this is less obvious
     - behavior clarified, used to iterate doc ids, BUT NOT in doc id order
     - method STILL available to iterate doc ids in range
     - but again, you won't get them in any meaningful order
     - new method to iterate actual doc ids from list of possible ids
         - this was introduced to make the DocIDSearcher continue working

searchers now work with the new opaque index internal doc ids
 - they return new DocumentMatchInternal (which does not have string ID)
scorerers also work with these opaque index internal doc ids
 - they return DocumentMatchInternal (which does not have string ID)
collectors now also perform a final step of converting the final result
 - they STILL return traditional DocumentMatch (with string ID)
 - but they now also require an IndexReader (so that they can do the conversion)
2016-07-31 13:46:18 -04:00
Marty Schoch
47ee69ae82 term field reader supports optionally omitting 3 details
at the time you create the term field reader, you can specify
that you don't need the term freq, the norm, or the term vectors

in that case, the index implementation can choose to not return
them in its subsequently returned values

this is advisory only, some simple implementations may ignore this
and continue to return the values anyway (as the current impl of
upside_down does today)

this change will allow future index implementations the
opportunity to do less work when it isn't required
2016-07-30 10:26:42 -04:00
Steve Yen
4822cff63a optimize Advance() with pre-allocated in-out param
This perf-related change helps the code and API reach more similarity
with the Next() methods, which now take a pre-allocate param.
2016-07-29 14:15:00 -07:00
Steve Yen
3c82086805 optimize upside_down reader & 64-bit struct alignments
The UpsideDownCouchTermFieldReader.Next() only needs the doc ID from
the key, so this change provides a specialized parseKDoc() method for
that optimization.

Additionally, fields in various structs are more 64-bit aligned, in an
attempt to reduce the invocations of runtime.typedmemmove() and
runtime.heapBitsBulkBarrier(), which the go compiler seems to
automatically insert to transparently handle misaligned data.
2016-07-23 10:37:40 -07:00
Steve Yen
5271a0f62b optimize termFieldVectorsFromTermVectors when empty 2016-07-21 11:46:14 -07:00
Steve Yen
b744148449 optimization to actually reuse the TermFrequencyRow 2016-07-21 11:10:49 -07:00
Steve Yen
39d3e2f028 optimize upside_down reader Next() with TermFieldDoc reuse
This optimization changes the index.TermFieldReader.Next() interface
API, adding an optional, pre-allocated *TermFieldDoc parameter, which
can help prevent garbage creation.
2016-07-21 11:10:49 -07:00
Steve Yen
2498ccc913 optimize upside_down reader Next() to reuse TermFrequencyRow
Before this change, upside down's reader would alloc a new
TermFrequencyRow on every Next(), which would be immediately
transformed into an index.TermFieldDoc{}.  This change reuses a
pre-allocated TermFrequencyRow that's a field in the reader.
2016-07-21 11:10:49 -07:00
Steve Yen
68af6aef62 optimize upside_down reader Next() when 0-length term field vectors
From some bleve-query perf profiling, term field vectors appeared to
be alloc'ed, which was unnecessary as term field vectors are disabled
in the bleve-blast/bleve-query tests.
2016-07-21 11:10:49 -07:00
slavikm
fc990bc2d1 Remove the field IDs from outside of the index 2016-07-19 20:42:45 -07:00
slavikm
ce64c17be1 Do field cache only once per search 2016-07-17 16:29:17 -07:00
slavikm
9a9b630a6d Make facets much faster 2016-07-17 15:31:35 -07:00
Marty Schoch
b8a2fbb887 fix data race in bleve batch reuse
Currently bleve batch is build by user goroutine
Then read by bleve gourinte
This is still safe when used correctly
However, Reset() will modify the map, which is now a data race

This fix is to simply make batch.Reset() alloc new maps.
This provides a data-access pattern that can be used safely.
Also, this thread argues that creating a new map may be faster
than trying to reuse an existing one:

https://groups.google.com/d/msg/golang-nuts/UvUm3LA1u8g/jGv_FobNpN0J

Separate but related, I have opted to remove the "unsafe batch"
checking that we did.  This was always limited anyway, and now
users of Go 1.6 are just as likely to get a panic from the
runtime for concurrent map access anyway.  So, the price paid
by us (additional mutex) is not worth it.

fixes #360 and #260
2016-04-08 15:32:13 -04:00
Marty Schoch
7892882519 fix typos 2016-04-02 21:59:30 -04:00
Marty Schoch
194ee82c80 gofmt simplifications 2016-04-02 21:54:33 -04:00
Marty Schoch
3dc64de478 moved fields requiring 64-bit alignment to start of struct
several data structures had a pointer at the start of the struct
on some 32-bit systems, this causes the remaining fields no longer
be aligned on 64-bit boundaries

the fix identifed by @pmezard is to put the counters first in the
struct, which guarantees correct alignment

fixes #359
2016-03-20 10:38:28 -04:00
Steve Yen
be2800a8e4 MB-18715 - moss Merge() didn't bump bufUsed correctly
And, also allocate more memory for both the partial and full merges.
2016-03-15 17:09:40 -07:00
Marty Schoch
d7292ed891 add support for gathering stats via map for easier consumption 2016-03-07 18:37:46 -05:00
Marty Schoch
23a323bc9d add support for numPlainTextBytesIndexed metric 2016-03-05 14:05:08 -05:00
Marty Schoch
81780f97d0 add term search stats 2016-03-05 07:50:25 -05:00
Steve Yen
a29dd25a48 upside_down dict row value size accounts for large uvarint's
This is somewhat unlikely, but if a term is (incredibly) popular, its
uvarint count value representation might go beyond 8 bytes.

Some KVStore implementations (like forestdb) provide a BatchEx cgo
optimization that depends on proper preallocated counting, so this
change provides a proper worst-case estimate based on the max-unvarint
of 10 bytes instead of the previously incorrect 8 bytes.
2016-02-22 11:52:51 -08:00
Marty Schoch
40c95513b7 add support for including kvstore stats 2016-02-05 12:26:19 -05:00
Marty Schoch
c5dea9e882 fix accessing store via Advanced() method which was broken 2016-02-02 11:54:18 -05:00
Marty Schoch
fc34a97875 copy locations on merge for more safe/predictable behavior
fixes #328
2016-01-19 14:21:48 -05:00
Steve Yen
035d9d0e40 unneeded cast and parens 2016-01-17 00:16:05 -08:00
Marty Schoch
1335eb2a7b Merge pull request #322 from steveyen/WIP-perf-20160113
KVReader.MultiGet and KVWriter.NewBatchEx API's
2016-01-15 14:28:59 -05:00
Silvan Jegen
d326898f7b Remove unneeded brackets 2016-01-14 16:41:41 +01:00
Steve Yen
6849e538be upside_down and firestorm use new NewBatchEx() API
With this change, the upside_down batchRows() and firestorm
batchRows() now use the new KVWriter.NewBatchEx() API, which can
improve performance by reducing the number of cgo hops.
2016-01-13 23:08:20 -08:00
Steve Yen
fe39b3fd13 avoid fieldTermFreqs loop if no composite fields 2016-01-13 14:45:04 -08:00
Marty Schoch
af25e724f6 Merge branch 'master' of https://github.com/slavikm/bleve into slavikm-master 2016-01-13 16:10:59 -05:00
Steve Yen
0e72b949b3 upside_down batchRows() takes array of arrays
In order to spend less time in append(), this change in upside_down
(similar to another recent performance change in firestorm) builds up
an array of arrays as the eventual input to batchRows().
2016-01-11 18:11:21 -08:00
slavikm
680be52f87 Implemented boolean field support 2016-01-11 17:18:03 -08:00
Steve Yen
7ce7d98cba upside_down merge dictionary deltas before using batch.Merge()
This change performs more dictionary delta incr/decr math in
batchRows() instead of in the KVStore ExecuteBatch() machinery.
2016-01-11 16:52:07 -08:00
Steve Yen
94273d5fa9 upside_down process internal rows earlier
With this change, internal rows are processed while we're waiting for
backIndex rows to be retrieved.
2016-01-11 16:25:35 -08:00
Steve Yen
bb5cd8f3d6 upside_down merge backIndexRow concurrently
Previously, the code would gather all the backIndexRows before
processing them.  This change instead merges the backIndexRows
concurrently on the theory that we might as well make progress on
compute & processing tasks while waiting for the rest of the back
index rows to be fetched from the KVStore.
2016-01-10 18:50:42 -08:00
Steve Yen
c3b5246b0c upside_down track analysis time tighter; and comments 2016-01-10 15:36:54 -08:00
Steve Yen
d3dd40d334 upside_down retrieves backindex concurrently with analysis
Start backindex reading concurrently with analysi to try to utilize
more I/O bandwidth.

The analysis time vs indexing time stats tracking are also now "off",
since there's now concurrency between those actiivties.

One tradeoff is that the lock area in upside_down Batch() is increased
as part of this change.
2016-01-10 15:18:28 -08:00
Steve Yen
860de28a28 fix memory leak by closing batches in batchRows() 2016-01-07 17:59:42 -08:00
Steve Yen
846912d083 upside_down udc.termVectorsFromTokenFreq rows append optimization 2016-01-07 00:48:34 -08:00
Steve Yen
8b980bd2ef firestorm avoid extra goroutine, similar to upside_down 2016-01-07 00:43:27 -08:00
Steve Yen
fbd0e7bfe9 upside_down backIndexTermEntries precalloc'ed capacity 2016-01-07 00:23:25 -08:00
Steve Yen
4eee8821f9 upside_down storeField/indexField append to provided arrays
Taking another optimization from firestorm, upside_down's
storeField()/indexField() funcs now also append() to passed-in arrays
rather than always allocating their own arrays.
2016-01-07 00:13:46 -08:00