Working title
In 2010, Google published the Dremel paper showing how to shred nested data into columnar format, delivering 100x speedups on web-scale analytics. Until then, there was no widely adopted way to efficiently query nested data on large datasets.
Nested data in the artist's imagination:
type Contact = {
name?: string;
phones: Array<{number: string; phone_type?: string}>
}
Even today, you'll find yourselves waiting for hours if you try running aggregations like SUM and GROUP BY on multi-billion row datasets using Postgres or MySQL. It is a design decision; OLTP databases are meant to run thousands of concurrent transactions per second, and be able to instantly read and create entire rows. But that precludes analytical queries that rely on whole-table column data rather than individual rows.
For such queries, we turn to OLAP databases like DuckDB, DataFusion, and ClickHouse. They store data in a columnar manner. This exploits the physical reality of how bytes are arranged in the disk, so that we can grab large chunks of useful data in one go, without having to read any irrelevant data.
The challenge with nested data however is that there is no direct representation for it in a columnar format. This is what the Dremel paper solved for. Before we look at Dremel in detail, let's look at a brief visualization of row and columnar storage.
Code for the curious
If you are familiar with the Dremel landscape, then you can jump directly into a from-scratch educational implementation of the Dremel shredding algorithm here:github.com/jcsherin/denester. The core is only around 300 lines - parser.rs.
There is also parquet-parallel-nested which I wrote to explore the upper limits of shredding performance. It uses the Rust Arrow project and exploits parallelism to generate, shred, and write 10 million nested documents in approximately 450ms (on a 16-core AMD Ryzen 7 Pro).
Physical layout of data in row and column storages
Consider the following SQL query:
select SUM(salary) from employees
To execute this query, we need access to all the values of the `salary` column, and nothing else. This kind of query that operates on the entirety of specific columns is the standard for most business reporting, analytics, and dashboard queries.
Let us now look at both row and column oriented storages, and how they lay out the bytes physically on the disk. This arrangement decides which workload it supports best.
Row-oriented storage
Here all rows are stored one after the other.
You might have noticed that we've split up the rows into different pages. This is because at a hardware level, disk reads happen only in large chunks, typically 4kb or more, of "pages". The byte-wise read APIs are an illusion cast by the operating system. And so to read data from even one row, the database ends up loading the entire page containing that row's data, which will also include other irrelevant data from that page.
In the above visualization, we have marked every page that is actually read, with an orange border . And useful data - in this case the `salary` value - which we need to execute our query, is marked with a green background.
Here all pages have been read, even though we need only a few bytes of the `salary` column from each row in those pages. It is simply a consequence of how the data is laid out in the disk. This layout is optimized for transactional workloads: adding and updating entire rows. But it is deeply inefficient for analytical queries that need to read specific columns across the table.
Column-oriented storage
Column-oriented storage turns the page-based reading of disks into a major advantage. Since all values for a single attribute (like `salary`) are stored together, it can read just those pages for the columns it needs, ignoring all the other data in the table.
In this visualization we can see that only a single page is read (orange border ), which contains only useful data (green background ), demonstrating the immense I/O savings.
Motivation for Dremel
As we can see from the above visualization, there is a fundamentally physical reason why columnar storage is essential for efficient analytical queries. And early 2000s Google was confronting web-scale data that needed it. But their first solution was MapReduce, which was more of a distributed computing framework and didn't exploit columnar storage. There was one good reason for that: we did not have any widely adopted way to store nested data in columnar format. Due to widespread usage of ProtoBuf inside Google, most of their data was non-relational nested values, and so no existing OLAP database were fit to purpose. This is what led to the creation of Dremel.
But why wouldn't a straight-forward columnar mapping work? One simple approach might be to flatten the nested data by treating every unique path in the tree as a column. For example, let's consider this nested data:
[
{ "name": { "first": "John", "last": "Doe" } },
{ "name": { "first": "Jane", "last": "Smith" } }
]
We can store it in a columnar fashion in directly like this:
name.first | name.last |
---|---|
John | Doe |
Jane | Smith |
But this stops working the moment we have either repeating or optional fields. We will be unable to efficiently and non-ambiguously represent them with this mapping.
Consider this:
{
"profile": {
"name": "Alice",
"age": 30
}
}
// Record 2
{
"profile": null // Missing profile entirely!
}
// Record 3
{
"profile": {
"name": "Charlie"
// Missing age!
}
}
Here is its naive columnar representation:
profile.name | ["Alice", null, "Charlie"] |
profile.age | [30, null, null] |
So we dealt with empty values by putting `null` where the values are missing. That however is a lossy transformation that destroys information.
How do we distinguish between the leaf value missing vs the entire node missing? For example, in the second record, It is not that profile.name is null or profile.age is null, but there is no value at all for the `profile` field. That is impossible to know from this representation.
And for the third record, the `profile` node exists, but the leaf value `age` is missing. However, that information cannot be inferred by looking at just the values of the age column. Since its value is `null`, we could infer that either the `age` leaf value is missing, or its entire parent node is missing. Note that in OLAP we only work with individual columns and we should be able to query based on structural information with just that. We cannot reconstruct row information - that would defeat the very purpose of efficiency with columnar storage.
It doesn't stop there. The introduction of repeating elements (arrays) makes the ambiguity worse:
{
"doc_id": 10,
"links": {
"forward": ["http://A", "http://B"]
}
}
// Record 2
{
"doc_id": 20,
"links": {
"forward": ["http://C"]
}
}
Let's look at only the `forward` column.
links.forward | ["http://A", "http://B", "http://C"] |
But now we've lost the record boundaries. How do we know that "http://A" and "http://B" belong to record 1, and "http://C" belongs to record 2?
When nested values have both optional and repeated elements, then this naive mapping introduces so much ambiguity that it is impossible to reconstruct the original record. While OLAP doesn't require row wise iteration, it is important that the original value can be reconstructed to ensure the correctness of the mapping.
The Dremel paper was Google's solution to this exact problem - it lets us accurately and efficiently represent nested data with repeated and optional values in a columnar structure. They were dealing with billions of rows of hierarchical web data that broke traditional columnar assumptions. The introduction of Dremel and OLAP produced dramatic results: a MapReduce job processing 85 billion records that previously took over an hour was now completed in under 10 seconds.
Although it hasn't captured popular imagination like its peer the MapReduce, it has had a profound impact on all modern analytical SQL databases. It is what enables today's OLAP databases to query billions of rows of nested data with almost the same speed as flat relational data. It has stood the test of time; literally - in 2020 the Dremel paper was given the VLDB Test of Time award.