0
0
bleve/index/scorch/segment/zap
2017-12-09 14:28:33 -05:00
..
cmd/zap add initial version of zap file format 2017-12-09 14:28:33 -05:00
build_test.go add initial version of zap file format 2017-12-09 14:28:33 -05:00
build.go add initial version of zap file format 2017-12-09 14:28:33 -05:00
count.go add initial version of zap file format 2017-12-09 14:28:33 -05:00
dict_test.go add initial version of zap file format 2017-12-09 14:28:33 -05:00
dict.go add initial version of zap file format 2017-12-09 14:28:33 -05:00
posting.go add initial version of zap file format 2017-12-09 14:28:33 -05:00
README.md add initial version of zap file format 2017-12-09 14:28:33 -05:00
segment_test.go add initial version of zap file format 2017-12-09 14:28:33 -05:00
segment.go add initial version of zap file format 2017-12-09 14:28:33 -05:00

zap file format

stored fields section

  • for each document
    • preparation phase:
      • produce a slice of metadata bytes and data bytes
      • produce these slices in field id order
      • field value is appended to the data slice
      • metadata slice is govarint encoded with the following values for each field value
        • field id (uint16)
        • field type (byte)
        • field value start offset in uncompressed data slice (uint64)
        • field value length (uint64)
        • field number of array positions (uint64)
        • one additional value for each array position (uint64)
        • compress the data slice using snappy
    • file writing phase:
      • remember the start offset for this document
      • write out meta data length (varint uint64)
      • write out compressed data length (varint uint64)
      • write out the metadata bytes
      • write out the compressed data bytes

stored fields idx

  • for each document
    • write start offset (remembered from previous section) of stored data (big endian uint64)

With this index and a known document number, we have direct access to all the stored field data.

posting details (freq/norm) section

  • for each posting list
    • produce a slice containing multiple consecutive chunks (each chunk is govarint stream)
    • produce a slice remembering offsets of where each chunk starts
    • preparation phase:
      • for each hit in the posting list
      • if this hit is in next chunk close out encoding of last chunk and record offset start of next
      • encode term frequency (uint64)
      • encode norm factor (float32)
    • file writing phase:
      • remember start position for this posting list details
      • write out number of chunks that follow (varint uint64)
      • write out length of each chunk (each a varint uint64)
      • write out the byte slice containing all the chunk data

If you know the doc number you're interested in, this format lets you jump to the correct chunk (docNum/chunkFactor) directly and then seek within that chunk until you find it.

posting details (location) section

  • for each posting list
    • produce a slice containing multiple consecutive chunks (each chunk is govarint stream)
    • produce a slice remembering offsets of where each chunk starts
    • preparation phase:
      • for each hit in the posting list
      • if this hit is in next chunk close out encoding of last chunk and record offset start of next
      • encode field (uint16)
      • encode field pos (uint64)
      • encode field start (uint64)
      • encode field end (uint64)
      • encode number of array positions to follow (uint64)
      • encode each array position (each uint64)
    • file writing phase:
      • remember start position for this posting list details
      • write out number of chunks that follow (varint uint64)
      • write out length of each chunk (each a varint uint64)
      • write out the byte slice containing all the chunk data

If you know the doc number you're interested in, this format lets you jump to the correct chunk (docNum/chunkFactor) directly and then seek within that chunk until you find it.

postings list section

  • for each posting list
    • preparation phase:
      • encode roaring bitmap posting list to bytes (so we know the length)
    • file writing phase:
      • remember the start position for this posting list
      • write freq/norm details offset (remembered from previous, as varint uint64)
      • write location details offset (remembered from previous, as varint uint64)
      • write length of encoded roaring bitmap
      • write the serialized roaring bitmap data

dictionary

  • for each field
    • preparation phase:
      • encode vellum FST with dictionary data pointing to file offset of posting list (remembered from previous)
    • file writing phase:
      • remember the start position of this persistDictionary
      • write length of vellum data (varint uint64)
      • write out vellum data

fields section

  • for each field
    • file writing phase:
      • remember start offset for each field
      • write 1 if field has location info indexed, 0 if not (varint uint64)
      • write dictionary address (remembered from previous) (varint uint64)
      • write length of field name (varint uint64)
      • write field name bytes

fields idx

  • for each field
    • file writing phase:
      • write big endian uint64 of start offset for each field

NOTE: currently we don't know or record the length of this fields index. Instead we rely on the fact that we know it immediately precedes a footer of known size.

  • file writing phase
    • write number of docs (big endian uint64)
    • write stored field index location (big endian uint64)
    • write field index location (big endian uint64)
    • write out chunk factor (big endian uint32)
    • write out version (big endian uint32)
    • write out file CRC of everything preceding this (big endian uint32)