Previously bleve allowed you to create a memory-only index by
simply passing "" as the path argument to the New() method.
This was not clear when reading the code, and led to some
problematic error cases as well.
Now, to create a memory-only index one should use the
NewMemOnly() method. Passing "" as the path argument
to the New() method will now return os.ErrInvalid.
Advanced users calling NewUsing() can create disk-based or
memory-only indexes, but the change here is that pass ""
as the path argument no longer defaults you into getting
a memory-only index. Instead, the KV store is selected
manually, just as it is for the disk-based solutions.
Here is an example use of the NewUsing() method to create
a memory-only index:
NewUsing("", indexMapping, Config.DefaultIndexType,
Config.DefaultMemKVStore, nil)
Config.DefaultMemKVStore is just a new default value
added to the configuration, it currently points to
gtreap.Name (which could have been used directly
instead for more control)
closes#427
the default configuration, which sets the default kv engine
to boltdb is now done in file protected with the !appengine
build tag. this at least lets the analysis-wizzard app
run locally in the appengine simulator.
this still has not been tested on the real appengine, and further
changes may be required.
parsing of date ranges in queries no longer consults the
index mapping. it was deteremined that this wasn't very useful
and led to overly complicated query syntax/behavior.
instead, applications get set the datetime parser used for
date range queries with the top-level config QueryDateTimeParser
also, we now support querying date ranges in the query string,
the syntax is:
field:>"date"
>,>=,<,<= operators are supported
the date must be surrounded by quotes
and must parse in the configured date format
this lays the foundation for supporting the new firestorm
indexing scheme. i'm merging these changes ahead of
the rest of the firestorm branch so i can continue
to make changes to the analysis pipeline in parallel
refactor to share code in emulated batch
refactor to share code in emulated merge
refactor index kvstore benchmarks to share more code
refactor index kvstore benchmarks to be more repeatable
name "exception"
configure with list of regexp string "exceptions"
these exceptions regexps that match sequences you want treated
as a single token. these sequences are NOT sent to the
underlying tokenizer
configure "tokenizer" is the named tokenizer that should be
used for processing all text regions not matching exceptions
An example configuration with simple patterns to match URLs and
email addresses:
map[string]interface{}{
"type": "exception",
"tokenizer": "unicode",
"exceptions": []interface{}{
`[hH][tT][tT][pP][sS]?://(\S)*`,
`[fF][iI][lL][eE]://(\S)*`,
`[fF][tT][pP]://(\S)*`,
`\S+@\S+`,
}
}
1. text analysis is now done before the write lock is acquired
2. there is now a pool of analysis workers
3. the size of this pool is configurable
4. this allows for documents in a batch to be analyzed concurrently
as a part of benchmarking these changes i've also introduce a new
null storage implementation. this should never be used, as it
does not actualy build an index. it does however let us go
through all the normal indexing machinery, without incuring
any indexing I/O. this is very helpful in measuring improvements
made to the text analsysis pipeline, which are often overshadowed
by indexing times in benchmarks actually building an index.
many of these defaults were arbitrary, and not having
defaults lets us more easily flag them for configuration
added a shingle filter
introduce new toke type for shingles
now, as part of your index mapping you can create custom
analysis components. these custome analysis components
are serialized as part of the mapping, and reused
as you would expect on subsequent accesses.
this started initially to relocate highlighting into
a self contained package, which would then also use
the registry
however, it turned into a much larger refactor in
order to avoid cyclic imports
now facets, searchers, scorers and collectors
are also broken out into subpackages of search
by default we now use the pure go boltdb kv store
it is less tested at this point but appears to work
test pass, and moves us closer to the goal of being
able to just "go get" bleve
New is now used to create new indexes
Open is used to open existing indexes
calls to Open no longer specify a mapping because the mapping
is serialized and stored along with the index
ultimately this is make it more convenient for us to wire up
different elements of the analysis pipeline, without having to
preload everything into memory before we need it
separately the index layer now has a mechanism for storing
internal key/value pairs. this is expected to be used to
store the mapping, and possibly other pieces of data by the
top layer, but not exposed to the user at the top.