A sequence of data driven performance optimizations applied to a Parquet nested dataset generator. Each optimization step is guided by profiling, and capturing hardware performance counters, and coarse-grained benchmarks. It reinforces the importance of measuring perf changes, to verify that if it resulted in a speedup (or a slowdown). Sometimes an optimization makes it efficient at utilizing the underlying CPU cores, rather than making it faster.
Concurrency models (not limited to crab latching) prove that concurrent operations on a B+Tree index are possible with fine-grained latching. This contributes significantly to performance at the same time guaranteeing correctness and safety. However real-world implementations have to bridge a gap between the model and implementing useful features which directly conflicts with its safety properties. For instance, a naive bi-directional range scan implementation will introduce data races and create deadlocks.
A B+Tree node requires a contiguous, cache-friendly memory layout and a dynamic number of entries, but standard C++ tools falls short on both ends. This post explores how to solve this classic problem using the "struct hack" and manual memory management, while also exposing the higher complexity and drawbacks involved when chasing raw performance.
The stable logarithmic performance of a B+Tree is an outcome of its self-balancing property. When an inner node or leaf node falls below its minimum occupancy threshold during deletions, the tree rebalancing procedure is activated. The first choice is to either merge with a sibling, or borrow (redistribute) keys from a sibling. Neither choice is inherently better. Instead, the design decision signals the fundamental trade-off preferred by the specific implementation. A survey of major OLTP systems reveals a design space that is far more nuanced than this simple binary choice suggests.