remove junk from end of scorch readme

2018-01-06 21:09:53 -05:00 · 2018-01-06 21:09:53 -05:00 · 1788a03803
parent 0456569b62
commit 1788a03803
1 changed files with 0 additions and 143 deletions
--- a/index/scorch/README.md
+++ b/index/scorch/README.md
@ -365,146 +365,3 @@ A few simple principles have been identified.
 - Segments with all items deleted/obsoleted can be dropped.

 Merging of a segment should be able to proceed even if that segment is held by an ongoing snapshot, it should only delay the removal of it.
-
-
-## TODO
-
- need reference counting on the segments, to know when we can safely remove?
-
- how well will bitmaps perform when large and possibly mmap'd?
-
-
-----
-thinking out loud on storage
-
- fields
- - field name - field id
- term dictionary
- - field id - FST (values postings ids)
- postings
- - posting id - postings list
- freqs
- - posting id - freqs list
- norms
- - posting id - norms list
- stored
- - docNum
-  - field id - field values
-
-
-
----
-
-race dialog with steve:
-
-state: 2, 4, 8
-
- introducing new segment X
-  - deleted bitmasks, 2, 4, 8
-
- merger, merge 4 and 8
-  new segment Y
-
-
- merger wins
-
-   state: 2, 9
-
-   introducer: need to recompute bitmask for 9, could lose again and keep losing race
-
- introducer wins
-
-  state: 2, 4, 8, X
-         2-X, 4-X, 8-X, nil
-
-  merger finishes: new segment Y, is not valid, need to be recomputed
-
-
-### Bolt Segment Proposal
-
-Bucket
-
-"f"       field storage
-
-          Key                 Val
-          field name          field id (var uint16)
-
-          // TODO field location bits
-
-"d"       term dictionary storage
-          Key                      Val
-          field id (var uint16)    Vellum FST (mapping term to posting id uint64)
-
-
-"p"       postings list storage
-          Key                      Val
-          posting id (var uint64)  Roaring Bitmap Serialization (doc numbers) - see FromBuffer
-
-
-"x"       chunked data storage
-          Key                      Val
-          chunk id (var uint64)    sub-bucket
-
-                                   Key                       Val
-                                   posting id (var uint64)   sub-bucket
-
-
-                                                                              ALL Compressed Integer Encoding []uint64
-                                                             Key      Val
-                                                             "f"      freqs   1 value per hit
-                                                             "n"      norms   1 value per hit
-                                                             "i"      fields  <freq> values per hit
-                                                             "s"      start   <freq> values per hit
-                                                             "e"      end     <freq> values per hit
-                                                             "p"      pos     <freq> values per hit
-                                                             "a"      array pos
-                                                                              <freq> entries
-                                                                                each entry is count
-                                                                                followed by <count> uint64
-
-"s"      stored field data
-         Key                      Val
-         doc num (var uint64)     sub-bucket
-
-                                  Key            Val
-                                  "m"         mossy-like meta packed
-
-                                              16 bits - field id
-                                              8 bits - field type
-                                              2? bits - array pos length
-
-                                              X bits - offset
-                                              X bits - length
-
-                                  "d"         raw []byte data (possibly compressed, need segment level config?)
-
-                                  "a"         array position info, packed slice uint64
-
-
-
-
-
-Notes:
-
-It is assumed that each IndexReader (snapshot) starts a new Bolt TX (read-only) immediately, and holds it up until it is no longer needed.  This allows us to use (unsafely) the raw bytes coming out of BoltDB as return values.  Bolt guarantees they will be safe for the duration of the transaction (which we arrange to be the life of the index snapshot).
-
-Only physically store the fields in one direction, even though at runtime we need both.  Upon opening the index, we can read in all the k/v pairs in the "f" bucket.  We use the unsafe package to create a []string inverted mapping pointing to the underlying []byte in the BoltDB values.
-
-The term dictionary is stored opaquely as Vellum FST for each field.  When accessing these keys, the []byte return to us is mmap'd by bolt under the hood.  We then pass this to vellum using its []byte API, which then operates on it without ever forcing whole thing into memory unless needed.
-
-We do not need to persist the dictkeys slice since it is only there to support the dictionary iterator prefix/range searches, which are supported directly by the FST.
-
-Theory of operation of chunked storage is as follows.  The postings list iterators only allow starting at the beginning, and have no "advance" capability.  In the memory version, this means we always know the Nth hit in the postings list is the Nth entry in some other densely packed slice.  However, while OK when everything is in RAM, this is not as suitable for a structure on disk, where wading through detailed info of records you don't care about is too expensive.  Instead, we assume some fixed chunking, say 1024.  All detailed info for document number N can be found inside of chunk N/1024.  Now, the Advance operation still has to Next it's way through the posting list.  But, now when it reaches a hit, it knows the chunk index as well as the hit index inside that chunk.  Further, we push the chunk offsets to the top of the bolt structure, under the theory that we're likely to access data inside a chunk at the same time.  For example, you're likely to access the frequency and norm values for a document hit together, so by organizing by chunk first, we increase the likelihood that this info is nearby on disk.
-
-The "f" and "n" sub-buckets inside a posting have 1 entry for each hit.  (you must next-next-next within the chunk)
-
-The "i", "s", "e", "p", sub-buckets have <freq> entries for each hit.  (you must have read and know the freq)
-
-The "a" sub-bucket has <freq> groupings, where each grouping starts with a count, followed by <count> entries.
-
-For example, lets say hit docNum 27 has freq of 2.  The first location for the hit has array positions (0, 1) length 2, and the second location for the hit has array positions (1, 3, 2) length 3.  The entries in the slice for this hit look like:
-
-2 0 1 3 1 3 2
-^     ^
-|     next entry, number of ints to follow for it
-number of ints to follow for this entry