606fd6344b
Previously term entries were encoded pairwise (field/term), so you'd have data like: F1/T1 F1/T2 F1/T3 F2/T4 F3/T5 As you can see, even though field 1 has 3 terms, we repeat the F1 part in the encoded data. This is a bit wasteful. In the new format we encode it as a list of terms for each field: F1/T1,T2,T3 F2/T4 F3/T5 When fields have multiple terms, this saves space. In unit tests there is no additional waste even in the case that a field has only a single value. Here are the results of an indexing test case (beer-search): $ benchcmp indexing-before.txt indexing-after.txt benchmark old ns/op new ns/op delta BenchmarkIndexing-4 11275835988 10745514321 -4.70% benchmark old allocs new allocs delta BenchmarkIndexing-4 25230685 22480494 -10.90% benchmark old bytes new bytes delta BenchmarkIndexing-4 4802816224 4741641856 -1.27% And here are the results of a MatchAll search building a facet on the "abv" field: $ benchcmp facet-before.txt facet-after.txt benchmark old ns/op new ns/op delta BenchmarkFacets-4 439762100 228064575 -48.14% benchmark old allocs new allocs delta BenchmarkFacets-4 9460208 3723286 -60.64% benchmark old bytes new bytes delta BenchmarkFacets-4 260784261 151746483 -41.81% Although we expect the index to be smaller in many cases, the beer-search index is about the same in this case. However, this may be due to the underlying storage (boltdb) in this case. Finally, the index version was bumped from 5 to 7, since smolder also used version 6, which could lead to some confusion.
15 lines
315 B
Protocol Buffer
15 lines
315 B
Protocol Buffer
message BackIndexTermsEntry {
|
|
required uint32 field = 1;
|
|
repeated string terms = 2;
|
|
}
|
|
|
|
message BackIndexStoreEntry {
|
|
required uint32 field = 1;
|
|
repeated uint64 arrayPositions = 2;
|
|
}
|
|
|
|
message BackIndexRowValue {
|
|
repeated BackIndexTermsEntry termsEntries = 1;
|
|
repeated BackIndexStoreEntry storedEntries = 2;
|
|
}
|