jcsherin

Dremel: Interactive Analysis of Web-Scale Datasets

columnar storage

Normalizing and recombining such data at web scale is usually prohibitive.

Dremel uses a column-striped storage representation, which enables it to
read less data from secondary storage and reduce CPU cost due to cheaper
compression.

Nested Columnar Storage

Challenges:

  1. Lossless representation of record structure in columnar format
  2. fast encoding
  3. efficient reassembly

Values alone do not convey the structure of a record. Given two values of a
repeated field, we do not know at what ‘level’ the value repeated (e.g.,
whether
these values are from two different records, or two repeated values in the
same
record). Likewise, given a missing optional field, we do not know which
enclosing records were defined explicitly. (p. 3)

Repetition Levels

It tells us at what repeated field in the field’s path the value has
repeated. (p. 3)

In general though, determining the level up to which nested records exist
requires extra information.

Definition Levels

Each value of a field with path p, esp. every NULL, has a definition level
specifying how many fields in p that could be undefined (because they are
optional or repeated) are actually present in the record.

We use integer definition levels as opposed to is-null bits so that the data
for a leaf field (e.g., Name.Language.Country) contains the information about
the occurrences of its parent fields; (p. 3)

Encoding

Splitting Records into Columns

Record Assembly

Not Particularly Interesting To Me (So maybe)