0
0
Fork 0

remove junk from end of scorch readme

This commit is contained in:
Marty Schoch 2018-01-06 21:09:53 -05:00
parent 0456569b62
commit 1788a03803
1 changed files with 0 additions and 143 deletions

View File

@ -365,146 +365,3 @@ A few simple principles have been identified.
- Segments with all items deleted/obsoleted can be dropped.
Merging of a segment should be able to proceed even if that segment is held by an ongoing snapshot, it should only delay the removal of it.
## TODO
- need reference counting on the segments, to know when we can safely remove?
- how well will bitmaps perform when large and possibly mmap'd?
-----
thinking out loud on storage
- fields
- field name - field id
- term dictionary
- field id - FST (values postings ids)
- postings
- posting id - postings list
- freqs
- posting id - freqs list
- norms
- posting id - norms list
- stored
- docNum
- field id - field values
----
race dialog with steve:
state: 2, 4, 8
- introducing new segment X
- deleted bitmasks, 2, 4, 8
- merger, merge 4 and 8
new segment Y
- merger wins
state: 2, 9
introducer: need to recompute bitmask for 9, could lose again and keep losing race
- introducer wins
state: 2, 4, 8, X
2-X, 4-X, 8-X, nil
merger finishes: new segment Y, is not valid, need to be recomputed
### Bolt Segment Proposal
Bucket
"f" field storage
Key Val
field name field id (var uint16)
// TODO field location bits
"d" term dictionary storage
Key Val
field id (var uint16) Vellum FST (mapping term to posting id uint64)
"p" postings list storage
Key Val
posting id (var uint64) Roaring Bitmap Serialization (doc numbers) - see FromBuffer
"x" chunked data storage
Key Val
chunk id (var uint64) sub-bucket
Key Val
posting id (var uint64) sub-bucket
ALL Compressed Integer Encoding []uint64
Key Val
"f" freqs 1 value per hit
"n" norms 1 value per hit
"i" fields <freq> values per hit
"s" start <freq> values per hit
"e" end <freq> values per hit
"p" pos <freq> values per hit
"a" array pos
<freq> entries
each entry is count
followed by <count> uint64
"s" stored field data
Key Val
doc num (var uint64) sub-bucket
Key Val
"m" mossy-like meta packed
16 bits - field id
8 bits - field type
2? bits - array pos length
X bits - offset
X bits - length
"d" raw []byte data (possibly compressed, need segment level config?)
"a" array position info, packed slice uint64
Notes:
It is assumed that each IndexReader (snapshot) starts a new Bolt TX (read-only) immediately, and holds it up until it is no longer needed. This allows us to use (unsafely) the raw bytes coming out of BoltDB as return values. Bolt guarantees they will be safe for the duration of the transaction (which we arrange to be the life of the index snapshot).
Only physically store the fields in one direction, even though at runtime we need both. Upon opening the index, we can read in all the k/v pairs in the "f" bucket. We use the unsafe package to create a []string inverted mapping pointing to the underlying []byte in the BoltDB values.
The term dictionary is stored opaquely as Vellum FST for each field. When accessing these keys, the []byte return to us is mmap'd by bolt under the hood. We then pass this to vellum using its []byte API, which then operates on it without ever forcing whole thing into memory unless needed.
We do not need to persist the dictkeys slice since it is only there to support the dictionary iterator prefix/range searches, which are supported directly by the FST.
Theory of operation of chunked storage is as follows. The postings list iterators only allow starting at the beginning, and have no "advance" capability. In the memory version, this means we always know the Nth hit in the postings list is the Nth entry in some other densely packed slice. However, while OK when everything is in RAM, this is not as suitable for a structure on disk, where wading through detailed info of records you don't care about is too expensive. Instead, we assume some fixed chunking, say 1024. All detailed info for document number N can be found inside of chunk N/1024. Now, the Advance operation still has to Next it's way through the posting list. But, now when it reaches a hit, it knows the chunk index as well as the hit index inside that chunk. Further, we push the chunk offsets to the top of the bolt structure, under the theory that we're likely to access data inside a chunk at the same time. For example, you're likely to access the frequency and norm values for a document hit together, so by organizing by chunk first, we increase the likelihood that this info is nearby on disk.
The "f" and "n" sub-buckets inside a posting have 1 entry for each hit. (you must next-next-next within the chunk)
The "i", "s", "e", "p", sub-buckets have <freq> entries for each hit. (you must have read and know the freq)
The "a" sub-bucket has <freq> groupings, where each grouping starts with a count, followed by <count> entries.
For example, lets say hit docNum 27 has freq of 2. The first location for the hit has array positions (0, 1) length 2, and the second location for the hit has array positions (1, 3, 2) length 3. The entries in the slice for this hit look like:
2 0 1 3 1 3 2
^ ^
| next entry, number of ints to follow for it
number of ints to follow for this entry