1. text analysis is now done before the write lock is acquired
2. there is now a pool of analysis workers
3. the size of this pool is configurable
4. this allows for documents in a batch to be analyzed concurrently
as a part of benchmarking these changes i've also introduce a new
null storage implementation. this should never be used, as it
does not actualy build an index. it does however let us go
through all the normal indexing machinery, without incuring
any indexing I/O. this is very helpful in measuring improvements
made to the text analsysis pipeline, which are often overshadowed
by indexing times in benchmarks actually building an index.
many of these defaults were arbitrary, and not having
defaults lets us more easily flag them for configuration
added a shingle filter
introduce new toke type for shingles
now, as part of your index mapping you can create custom
analysis components. these custome analysis components
are serialized as part of the mapping, and reused
as you would expect on subsequent accesses.
this started initially to relocate highlighting into
a self contained package, which would then also use
the registry
however, it turned into a much larger refactor in
order to avoid cyclic imports
now facets, searchers, scorers and collectors
are also broken out into subpackages of search
by default we now use the pure go boltdb kv store
it is less tested at this point but appears to work
test pass, and moves us closer to the goal of being
able to just "go get" bleve
New is now used to create new indexes
Open is used to open existing indexes
calls to Open no longer specify a mapping because the mapping
is serialized and stored along with the index
ultimately this is make it more convenient for us to wire up
different elements of the analysis pipeline, without having to
preload everything into memory before we need it
separately the index layer now has a mechanism for storing
internal key/value pairs. this is expected to be used to
store the mapping, and possibly other pieces of data by the
top layer, but not exposed to the user at the top.
introduced concept of byte array converters
right now only wired up to top-level index mapping
allowing the removal of the JSON methods, now at the top level
we default to parsing []byte as JSON, override if thats not
the behavior you want.
future enhancements will allow use of these byte array converters
to control how byte arrays are handled elsewhere in documents
this would allow for handing binary attachments, etc in the future
closes#59
defined article word maps for french, italian, irish and catalan
defined elision filters for these same languages
updated analyers for french and italian to use this new filter
irish and catalan still depend on other missing pieces
closes#25
stop words can be loaded from files/bytes, closes#19
stop words loaded for large list of languages, closes#20
defined language specific analyzers for as much as possible right now, closes#21
opened new issues for some of the remaining gaps
removed analyzers (these are now built as needed through config)
removed html chacter filter (now built as needed through config)
added missing license header
changed constructor signature of filters that cannot return errors
filter constructors that can have errors, now have Must variant which panics
change cdl2 tokenizer into filter (should only see lower-case input)
new top level index api, closes#5
refactored index tests to not rely directly on analyzers
moved query objects to top-level
new top level search api, closes#12
top score collector allows skipping results
index mapping supports _all by default, closes#3 and closes#6
index mapping supports disabled sections, closes#7
new http sub package with reusable http.Handler's, closes#22