0
0
Fork 0
Commit Graph

53 Commits

Author SHA1 Message Date
abhinavdangeti 65fed52d0b Do not account IndexReader's size in the query RAM estimate
Since its just the pointer size of the IndexReader that is
being accounted for while estimating the RAM needed to
execute a search query, get rid of the Size() API in the
IndexReader interface.
2018-03-15 13:23:58 -07:00
abhinavdangeti 7e36109b3c MB-28162: Provide API to estimate memory needed to run a search query
This API (unexported) will estimate the amount of memory needed to execute
a search query over an index before the collector begins data collection.

Sample estimates for certain queries:
{Size: 10, BenchmarkUpsidedownSearchOverhead}
                                                           ESTIMATE    BENCHMEM
TermQuery                                                  4616        4796
MatchQuery                                                 5210        5405
DisjunctionQuery (Match queries)                           7700        8447
DisjunctionQuery (Term queries)                            6514        6591
ConjunctionQuery (Match queries)                           7524        8175
Nested disjunction query (disjunction of disjunctions)     10306       10708
…
2018-03-06 13:53:42 -08:00
Marty Schoch 272da43c16 phrase searcher don't allow advance after end 2017-12-27 10:24:33 -08:00
Steve Yen c7a342bc7d scorch conjuncts match phrase test passes
The conjunction searcher Advance() method now checks if its curr
doc-matches suffices before advancing them.
2017-12-23 09:19:40 -08:00
Steve Yen d425a3be86 scorch fix disjunction searcher Advance()
Found with "versus" test (TestScorchVersusUpsideDownBoltSmallMNSAM),
which had a boolean query with a MustNot that was the same as the Must
parameters.  This replicates a situation found by
Aruna/Mihir/testrunner/RQG (MB-27291).  Example:

  "query": {
    "must_not": {"disjuncts": [
      {"field": "body", "match": "hello"}
    ]},
    "must": {"conjuncts": [
      {"field": "body", "match": "hello"}
    ]}
  }

The nested searchers along the MustNot pathway would end up looking
roughly like...

  booleanSearcher
    MustNot
      => disjunctionSearcher
         => disjunctionSearcher
            => termSearcher

On the first Next() call by the collector, the two disjunction
searchers would run through their respective Next() method processing,
which includes their initSearcher() processing on the first time.
This has the effect of driving the leaf termSearcher through two
Next() invocations.

That is, if there were 3 docs (doc-1, doc-2, doc-3), the leaf
termSearcher would at this point have moved to point to doc-3, while
the topmost MustNot would have received doc-1.

Next, the booleanSearcher's Must searcher would produce doc-2, so the
booleanSearcher would try to Advance() the MustNot searcher to doc-2.

But, in scorch, the leafmost termSearcher had already gotten past
doc-2 and would return its doc-3.

In upsidedown, in contrast, the leaf termSearcher would then drive the
KVStore iterator with a Seek(doc-2), and the KVStore iterator would
perform a backwards seek to reach doc-2.

In scorch, however, backwards iteration seeking isn't supported.

So, this fix checks the state of the disjunction searcher to see if we
already have the necessary state so that we don't have to perform
actual Advance()'es on the underlying searchers.  This not only fixes
the behavior w.r.t. scorch, but also can have an effect of potentially
making upsidedown slightly faster as we're avoiding some backwards
KVStore iterator seeks.
2017-12-21 18:20:04 -08:00
Steve Yen 33687260ca children of conjunct/disjunct's are not necessarily termSearchers
Rename termSearcher loop variable to searcher, as the child searchers
of a conjunction/disjunction searcher aren't necessarily
termSearchers.
2017-12-21 16:45:43 -08:00
Marty Schoch c048833fcd added stringer method to phrase part
a failing test was producing unhelpful pointer addresses as
the only debug output.  this changes the output to print
the terms and locations as readable text

part of #629
2017-09-01 09:16:08 -04:00
Marty Schoch 4c801f2f01 fix issue with numeric range queries in query string
previously the query string queries were modified to aid in
compatibility with other search systems.  this change:
f391b991c2
has a problem when combined with:
77101ae424
due to the introduction of MatchNoneSearchers being returned
in a case where previously they never would.

the fix for now is to simply return disjunction queries on 0
terms instead.  this ultimately also matches nothing, but avoids
triggering the logic which handles match none searchers in a
special way.
2017-06-06 16:03:05 -04:00
Marty Schoch 77101ae424 filter numeric range terms against the term dictionary
previously, all numeric terms required to implement a numeric
range search were passed to the disjunction query (possibly
exceeding the disjunction clause limit)

now, after producing the list of terms, we filter them against
the terms which actually exist in the term dictionary.  the
theory is that this will often greatly reduce the number of terms
and therefore reduce the likelihood that you would run into the
disjunction term limit in practice.

because the term dictionary interface does not have a seek API
and we're reluctant to add that now, i chose to do a binary
search of the terms, which either finds the term, or not. then
subsequent binary searches can proceed from that position,
since both the list of terms and the term dictionary are sorted.
2017-05-31 13:15:13 -04:00
Marty Schoch 87f693fc57 fix panic in term range search
if min and max are the same term
and the term is in dictionary
and both in and max are set to exclusive
then we would panic attempting to access element -1 of a slice.

now, after trimming the slice, we recheck that the length is > 0
2017-05-05 23:13:04 -04:00
Marty Schoch 8df8d4e797 fix geo point distance search
there was a bug where if the circle described by the point
distance query crossed the poles, then we incorrectly built
a box around it.  this resulted in incorrect searh results.
2017-04-27 17:28:07 -04:00
Marty Schoch 6f62489f21 add option for multi term searcher to skip max disjunction check
- geo searches now use this option and skip the check
- export ComputeGeoTerms for geo debug visualizations
2017-04-04 10:46:57 -04:00
Marty Schoch 1eba5541f2 introduce new query TermRange
The term range query is not often used in full-text queries, but
can be useful when filtering on keyword indexed text terms in
the index.

The JSON syntax to do a TermRange query is the same as for
NumericRange, but the min/max values must be string and not
float64.
2017-03-31 22:04:00 -04:00
Marty Schoch f8fdfebb6c refactor searchers
- TermSearcher has alternate constructor if term is []byte, this can avoid
  copying in some cases.  TermScorer updated to accept []byte term. Also
  removed a few struct fields which were not being used.

- New MultiTermSearcher searches for documents containing any of a list of
  terms.  Current implementation simply uses DisjunctionSearcher.

- Several other searcher constructors now simply build a list of terms and
  then delegate to the MultiTermSearcher
  - NewPrefixSearcher
  - NewRegexpSearcher
  - NewFuzzySearcher
  - NewNumericRangeSearcher

- NewGeoBoundingBoxSearcher and NewGeoPointDistanceSearcher make use of
  the MultiTermSearcher internally, and follow the pattern of returning
  an existing search.Searcher, as opposed to their own wrapping struct.

- Callback filter functions used in NewGeoBoundingBoxSearcher and
  NewGeoPointDistanceSearcher have been extracted into separate functions
  which makes the code much easier to read.
2017-03-31 17:21:46 -04:00
Marty Schoch 6554e9624f geo review comments from sreekanth
also one fix came from steve, i must have forgotten to push that
commit up before merging
2017-03-31 08:41:40 -04:00
Marty Schoch 6507e31787 improved geo searcher unit tests
also added flag for bounding box searcher to optionally not
check boundaries.  this is useful when other searchers are going
to check every point anyway by some other criteria.
2017-03-29 16:57:58 -04:00
Marty Schoch fdbe669fd5 several more items on the geo checklist
- added readme pointing back to lucene origins
- improved documentation of exported methods in geo package
- improved test coverage to 100% on geo package
- added support for parsing geojson style points
- removed some duplicated code in the geo bounding box searcher
2017-03-29 14:21:59 -04:00
Marty Schoch a16efa5e78 add experimental support for indexing/query geo points
New field type GeoPointField, or "geopoint" in mapping JSON.

Currently structs and maps are considered when a mapping explicitly
marks a field as type "geopoint".  Several variants of "lon", "lng", and "lat"
are looked for in map keys, struct field names, or method names.

New query type GeoBoundingBoxQuery searches for documents which have a
GeoPointField indexed with a value that is inside the specified bounding box.

New query type GeoDistanceQuery searches for documents which have a
GeoPointField indexed with a value that is less than or equal to the
specified distance from the specified location.

New sort by method "geo_distance".  Hits can be sorted by their distance
from the specified location.

New geo utility package with all routines ported from Lucene.

New FilteringSearcher, which wraps an existing Searcher, but filters
all hits with a user-provided callback.
2017-03-24 17:22:21 -07:00
Marty Schoch 2ba915b929 add additional parens to clarify logic 2017-02-10 20:22:32 -05:00
Marty Schoch c6085d8cdc address initial code review comments 2017-02-10 15:22:14 -05:00
Marty Schoch 09d00829db phrase searcher now supports multi-phrase
backwards compatability maintained through previous constructor
very basic test added (not sufficient)
2017-02-10 15:17:50 -05:00
Marty Schoch 9c8e1e82de add initial low-level support for multi-phrase
this adds basic multi-phrase support,
a shim to keep the top-level working
and unit tests for new multi-phrase cases
2017-02-10 13:16:05 -05:00
Marty Schoch 4e38c49287 move phrase search logic into phrase searcher
the logic of how a phrase search works should be an internal
detail of the phrase searcher.  further, these changes will
allow proper scoring of phrase matches, which require access
to the underlying searcher objects, which were hidden in the
previous approach.
2017-02-10 12:05:01 -05:00
Marty Schoch 8096d9fb90 remove use of float64 to represent int things
this originated from a misunderstanding of mine going back
several years.  the values need not be float64 just because
we plan to serialize them as json.

there are still larger questions about what the right type should
be, and where should any conversions go.  but, this commit
simply attempts to address the most egregious problems
2017-02-09 20:15:59 -05:00
Marty Schoch 232fc80dad add support for phrase slop to internals of phrase searcher
phrase slop is not yet supported on the frontend
added lots of tests around slop
2017-02-09 15:59:51 -05:00
Marty Schoch f82638c117 refactor phrase search to be recursive
a more correct solution that will enable us to extend in two
important ways:

1) support slop
2) support multi-phrase
2017-02-03 16:05:21 -05:00
Marty Schoch 12a7257b5f remove duplicate code suggested by review from @steveyen 2017-01-31 15:12:06 -05:00
Marty Schoch 7fd8aeb50a refactor phrase search into seprate methods
at the core, the Next() method moves another searcher forward
and checks each hit to see if it also satisfies the phrase
constraints.  the current implementation has 4 nested for loops.
these nested loops make it harder read (indentation) and harder
to reason about (complexity).

this refactor does not remove any loops, it simply moves some
of the inner loops into separate methods so that one can
more easily reason about the parts separately.
2017-01-31 13:32:46 -05:00
Marty Schoch b55c9043b9 improve performance of regular expression and wildcard queries
While researching an observed performance issue with wildcard
queries, it was observed that the LiteralPrefix() method on
the regexp.Regexp struct did not always behave as expected.

In particular, when the pattern starts with ^, AND involves
some backtracking, the LiteralPrefix() seems to always be the
empty string.

The side-effect of this is that we rely on having a helpful
prefix, to reduce the number of terms in the term dictionary
that need to be visited.

This change now makes the searcher enforce start/end on the term
directly, by using FindStringIndex() instead of Match().
Next, we also modified WildcardQuery and RegexpQuery to no
longer include the ^ and $ modifiers.

Documentation was also udpated to instruct users that they should
not include the ^ and $ modifiers in their patterns.
2017-01-18 16:22:16 -05:00
Marty Schoch 8cd6040b63 Merge pull request #512 from steveyen/master
API change: optional SearchRequest.IncludeLocations flag
2017-01-09 14:19:17 -05:00
Steve Yen 89a1cefde1 API change: optional SearchRequest.IncludeLocations flag
This is a change in search result behavior in that location
information is no longer provided by default with search results.

Although this looks like a wide-ranging change, it's mostly a
mechanical replacement of the explain bool flag with a new
search.SearcherOptions struct, which holds both the Explain bool flag
and the IncludeTermVectors bool flag.
2017-01-05 21:11:22 -08:00
Silvan Jegen 1a6a4c493b Check locations in the phrase searcher as well 2016-11-08 20:05:36 +01:00
Silvan Jegen 33e2432fc6 Initialize the return value as late as possible 2016-11-08 20:05:36 +01:00
Silvan Jegen 3dd363afaa Don't search the same term twice
We have searched for the first term in the phrase query already so we
can skip it. Before doing so we have to add the location of the first
term.
2016-11-08 20:05:04 +01:00
Silvan Jegen d87b4f88bf Refactor phrase searching
Reduce nesting by using early continues.
2016-11-08 20:04:28 +01:00
Steve Yen adc409e823 optimize NewRegexpSearcher to return its disjunction searcher
This minor optimization removes an unnecessary wrapper around the
disjunction searcher.
2016-10-27 13:16:41 -07:00
Steve Yen 58c3b5c9b8 revert optimization that trims search-disjunction child searchers
This commit reverts a previous optimization attempt 3f588cd4a that
tried to trim or shrink the array of child searchers in a
search-disjunction.

Although I am not sure why at the moment, that optimization
incorrectly broke higher level boolean queries, but reverting so that
functionality is restored.
2016-10-18 14:38:34 -07:00
Marty Schoch 5c7a2264a2 Merge pull request #473 from steveyen/reuse-incrementBytes-in-moss-kv-integration
reuse incrementBytes() in moss KV store integration
2016-10-13 14:03:46 +02:00
Marty Schoch cee18d302e Merge pull request #475 from steveyen/phrase-searcher-simplifications-dry
some simplification / DRY for phrase searcher
2016-10-12 23:07:35 +02:00
Steve Yen 1a994ce2a7 end fuzzy searcher prefixTerm construction loop early 2016-10-12 09:51:36 -07:00
Steve Yen 6a38fa3719 go fmt 2016-10-12 09:39:43 -07:00
Steve Yen 8230a7195f some simplification / DRY for phrase searcher 2016-10-12 09:26:31 -07:00
Marty Schoch bddc064069 Merge pull request #471 from steveyen/remove-extra-indirection-LevenshteinDistance
removed extra level of pointer indirection from LevenshteinDistance()'s params
2016-10-12 14:05:34 +02:00
Marty Schoch 483f06ef5b Merge pull request #467 from steveyen/optimize-disjunction-searcher-shrink-children
optimize disjunction searcher to trim child searchers array earlier
2016-10-12 14:00:19 +02:00
Marty Schoch b76cbc805e Merge pull request #465 from steveyen/cleanup-when-PrefixSearcher-error
close resources when we encounter an error on PrefixSearcher initialization
2016-10-12 13:39:28 +02:00
Steve Yen b6c97ddbfe removed extra ptr indirection from LevenshteinDistance 2016-10-11 08:49:10 -07:00
Steve Yen 3f588cd4ae optimize disjunction searcher to trim child searchers array earlier
Disjunction searchers are used heavily by higher-level searchers, like
prefix searchers.  In that case, a disjunction searcher might have
many thousands of child searchers.

This commit adds an optimization to close each child term searcher as
soon as a child searcher is finished and remove it from the
disjunction searcher's children.
2016-10-10 22:47:11 -07:00
Steve Yen 535b746b41 close resources when error on PrefixSearcher initialization 2016-10-10 17:29:59 -07:00
Steve Yen 2a022830f0 check FieldDictPrefix err result in prefix searcher 2016-10-10 15:35:54 -07:00
Marty Schoch 8e784c362b another golint suggestions 2016-10-02 11:54:04 -04:00