jcsherin

Dremel: Interactive Analysis of Web-Scale Datasets

columnar storage

Normalizing and recombining such data at web scale is usually prohibitive.

Dremel uses a column-striped storage representation, which enables it to read less data from secondary storage and reduce CPU cost due to cheaper compression.

Nested Columnar Storage

Challenges:

  1. Lossless representation of record structure in columnar format
  2. fast encoding
  3. efficient reassembly

Values alone do not convey the structure of a record. Given two values of a repeated field, we do not know at what ‘level’ the value repeated (e.g., whether these values are from two different records, or two repeated values in the same record). Likewise, given a missing optional field, we do not know which enclosing records were defined explicitly. (p. 3)

Repetition Levels

It tells us at what repeated field in the field’s path the value has repeated. (p. 3)

In general though, determining the level up to which nested records exist requires extra information.

Definition Levels

Each value of a field with path p, esp. every NULL, has a definition level specifying how many fields in p that could be undefined (because they are optional or repeated) are actually present in the record.

We use integer definition levels as opposed to is-null bits so that the data for a leaf field (e.g., Name.Language.Country) contains the information about the occurrences of its parent fields; (p. 3)

Encoding

Splitting Records into Columns

Record Assembly

Not Particularly Interesting To Me (So maybe)