Database Indexing Mechanics: B-Trees, LSM-Trees, and Sequential Scans
Database Indexing Mechanics: From to
Understanding how databases retrieve data efficiently is one of the most critical skills in backend engineering. At its core, database indexing is about avoiding full-table scans — transitioning a query's time complexity from (examining every row) to (narrowing the search space exponentially per step) or even in specialized cases.
This deep-dive explores three fundamental access patterns that underpin virtually every modern database engine:
| Access Pattern | Structure | Read Complexity | Write Complexity | Best For |
|---|---|---|---|---|
| Sequential Scan | Raw heap / unordered table | append | Full-table analytics, small tables | |
| B-Tree Index | Balanced multi-way search tree | OLTP, point lookups, range queries | ||
| LSM-Tree | Sorted runs + memtable | point; worst-case | amortized | Write-heavy workloads, time-series |
The key insight: no single structure dominates all workloads. Engineering the right indexing strategy requires understanding the structural mechanics and complexity trade-offs of each approach.
B-Trees vs LSM Trees: How Databases Actually Store Your Data
The Sequential Table Scan: Baseline
A sequential scan is the simplest — and often the most expensive — access path. The database reads every page of the heap table, evaluating each row against the query predicate.
Why Sequential Scans Still Matter
Despite their complexity, sequential scans are not always wrong:
-
Small tables: If a table fits in a few pages, the overhead of index traversal (random I/O) exceeds the cost of scanning all pages (sequential I/O). PostgreSQL's query planner, for instance, has a built-in random_page_cost threshold and will prefer sequential scans for small datasets .
-
High selectivity queries: When a query returns a large percentage of rows (e.g., >10–15%), reading the index + random heap fetches becomes more expensive than just scanning everything sequentially. This is because each index-based row access requires a separate random I/O, whereas a sequential scan reads contiguous blocks .
-
No useful index exists: If no index matches the query's filter columns, there is no alternative.
I/O Cost Model
The fundamental comparison:
Where = number of rows, = rows per page, = number of matching rows, = branching factor, = cost of sequential I/O, and = cost of random I/O.
Since random I/O is typically 10–100× slower than sequential I/O on HDDs (and still significant on SSDs), there exists a selectivity threshold beyond which the full scan wins.
Footnotes
The Selectivity Trap
Adding an index does NOT guarantee faster queries. If your query matches a large fraction of rows, the index introduces random I/O overhead that actually slows down the query compared to a sequential scan. Always check your EXPLAIN plans — databases like PostgreSQL and MySQL will correctly choose sequential scans when they're cheaper.
B-Tree Indexing: From to
The B-Tree is the most ubiquitous index structure in relational databases — used by PostgreSQL, MySQL (InnoDB), SQLite, Oracle, and SQL Server. It provides lookups, insertions, and deletions.
Structural Anatomy
Unlike a binary search tree (which has branching factor 2), B-Trees have a high branching factor (typically 100–1000+), meaning the tree is extremely shallow. A B-Tree with branching factor and keys has height:
For a table with 1 billion rows and branching factor 500:
This means only 4 disk reads are needed to find any row — the difference between seconds and microseconds.
| Property | Value |
|---|---|
| Node size | Typically matches disk page (4KB–16KB) |
| Branching factor () | 100–1000+ depending on key size |
| Height for 1B rows | ~3–4 levels |
| Point lookup cost | disk reads |
| Range scan cost | where = result rows |
| Insert/Delete cost | + possible node split/merge |
B-Tree Mechanics: Search, Insert, Delete
Search: Start at the root. At each node, perform a binary search within the node's sorted keys to select the correct child pointer. Descend until a leaf node is reached.
Insert: Traverse to the correct leaf. If the leaf has room, insert the key. If it's full, split the leaf into two halves and propagate the median key upward. This may cause cascading splits up to the root (which grows the tree by one level) .
Delete: Find the key, remove it. If the node falls below minimum occupancy, merge with a sibling or redistribute keys from a sibling. This may cascade upward.
B+Tree: The Practical Variant
Most databases actually implement B+Trees, not pure B-Trees. The critical differences:
- Internal nodes contain only keys (no data) — maximizing branching factor
- All data is in the leaf level — making scans uniform
- Leaf nodes are linked in a doubly-linked list — enabling efficient range scans without re-traversing the tree
Where is the number of matching rows and is the page size. The linked-list structure means once you reach the first matching leaf, you scan contiguous leaves — no more tree traversals .
Footnotes
B-Tree Point Lookup Algorithm
- 1Step 1
Load the root page from disk. This is the first I/O operation. The root is typically cached in memory after the first access.
- 2Step 2
Perform binary search on the sorted keys within the current node to find the child pointer that falls within the target key's range. With branching factor , this is typically 9 comparisons.
- 3Step 3
Follow the identified child pointer, loading the next page from disk. Repeat search within this node.
- 4Step 4
After levels, arrive at the leaf containing the target key. Total disk reads: — typically 2–4 for even massive tables.
- 5Step 5
The leaf contains the key and a heap pointer (CTID in PostgreSQL, primary key in InnoDB's clustered index). Perform one additional random I/O to retrieve the full row from the heap.
LSM-Trees: Engineering for Write-Heavy Workloads
The LSM-Tree (Log-Structured Merge Tree) was introduced by Patrick O'Neil et al. in 1996 as a fundamentally different approach: optimize for writes, not reads .
The Write Problem with B-Trees
B-Trees are read-optimized but write-hostile. Every insert, update, or delete requires:
- Traversing levels to find the target leaf
- Loading the page (random I/O)
- Potentially triggering page splits, rebalancing, and WAL logging
- Each write incurs random I/O — the worst case for disk performance
For write-heavy workloads (time-series data, event logs, IoT telemetry), this random I/O pattern becomes the bottleneck.
LSM-Tree Architecture: The Write Path
LSM-Trees flip this model by making writes sequential:
| Component | Storage | Data Structure | Role |
|---|---|---|---|
| MemTable | Memory | Skip list / RB-tree | Absorb writes at memory speed |
| WAL (Write-Ahead Log) | Disk (sequential) | Append-only log | Durability guarantee |
| SSTable | Disk (sequential) | Sorted flat file | Persistent sorted data |
| Bloom Filter | Memory | Probabilistic bit array | Speed up point lookups |
Write Complexity: amortized — writes go to the in-memory MemTable (no disk I/O for most writes). The MemTable is flushed to disk as a new SSTable only when it reaches a size threshold .
The Read Path: Where Complexity Increases
While writes are cheap, reads become more complex. To find a key:
- Check the MemTable (memory):
- Search Level 0 SSTables (newest to oldest): — each L0 SSTable may overlap
- Search Level 1+ SSTables: per level (one SSTable per level due to key range partitioning)
- Use Bloom filters to skip SSTables that definitely don't contain the key
Where is the number of un-compacted SSTables and is the maximum level. Without bloom filters, this can degrade toward — a critical concern.
Compaction: The Background Garbage Collector
Compaction is the mechanism that keeps read performance from degrading:
| Compaction Type | Trigger | Effect |
|---|---|---|
| Minor (flush) | MemTable full | Write one new L0 SSTable |
| Leveled compaction | L0 or L size exceeds threshold | Merge overlapping SSTables, push to L |
| Tiered compaction | Level too large | Merge entire tier at once, less write amplification |
| Major | Admin or threshold | Rewrite entire database into one sorted run |
Write amplification in leveled compaction can be significant: a single write may be re-written times during compaction. For LevelDB/RocksDB with 7 levels and size ratio 10, write amplification can reach 30–50× .
Footnotes
LSM-Tree Write Amplification
A single logical write in an LSM-Tree with leveled compaction across 6 levels and a 10× size ratio can be physically rewritten up to 30+ times. At , writing 1 GB of data physically writes 30 GB to disk. This impacts SSD lifespan and total I/O bandwidth. Consider tiered compaction for write-heavy workloads where read latency is acceptable.
B-Tree vs LSM-Tree vs Sequential Scan: Capability Comparison
Relative performance across key dimensions (higher = better for that dimension)
Complexity Transition Analysis:
The fundamental question: when does an index actually help, and when does it hurt?
The Cross-Over Point
Consider a table of rows stored across pages (where = rows per page). The cost of sequential scan:
The cost of B-Tree index scan returning rows:
Where is the tree height and and are sequential and random I/O times respectively.
Setting and solving for the selectivity threshold :
| Storage Type | Rows/Page () | Selectivity Threshold | |
|---|---|---|---|
| HDD | ~50 | 100 | ~0.02% (~1 in 5000 rows) |
| SATA SSD | ~10 | 100 | ~0.1% (~1 in 1000 rows) |
| NVMe SSD | ~2 | 100 | ~0.5% (~1 in 200 rows) |
On HDDs, an index scan is only beneficial when retrieving fewer than 0.02% of rows. On NVMe SSDs, the advantage window widens significantly because random I/O is much faster .
Big-O Summary
When Each Structure Wins: Decision Matrix
| Scenario | Best Access Path | Why |
|---|---|---|
| Point query on primary key | B-Tree (clustered) | , no extra heap fetch |
| Range query (e.g., date range) | B-Tree | Linked leaf pages enable sequential scan of range |
| High-volume inserts (>10K/sec) | LSM-Tree | Writes are memory inserts, no page splits |
| Full-table analytic aggregation | Sequential Scan | Sequential I/O is optimal for reading all rows |
| Key-value point lookup | LSM + Bloom Filter | Bloom filter eliminates most false SSTable reads |
| Small table (<1000 rows) | Sequential Scan | Index overhead > scan cost |
Footnotes
-
MySQL 8.0 Reference Manual and PostgreSQL 16 Documentation — Query Planner Cost Model ↩
Evolution of Database Indexing Structures
ISAM Introduced
1970IBM's Indexed Sequential Access Method — a static two-level tree structure. Lay the groundwork for ordered indexing, but couldn't handle dynamic inserts without overflow chains."
B-Tree Invented
1972Bayer & McCreight publish 'Organization and Maintenance of Large Ordered Indices.' The self-balancing multi-way tree becomes the foundation of all modern database indexing."
B+Tree Formalized
1979Variants with data only in leaves and linked leaf pointers appear, becoming the de-facto standard in System R, Oracle, and later PostgreSQL and MySQL."
LSM-Tree Paper Published
1996O'Neil, Cheng, Gollnick, & O'Neil publish the LSM-Tree paper in Acta Informatica, introducing the write-optimized paradigm that powers Bigtable, Cassandra, RocksDB, and LevelDB."
Bigtable & SSTables
2006Google publishes the Bigtable paper, popularizing SSTables + LSM architecture at web scale. Inspires Cassandra (2008), HBase (2008), and LevelDB (2011)."
RocksDB & HyperDatabases
2016+RocksDB (Facebook) optimizes LSM compaction (leveled, tiered, FIFO). New hybrids emerge: WiredTiger (MongoDB) offers both B-Tree and LSM modes."
1-- Create a B-Tree index 2CREATE INDEX idx_users_email ON users (email); 3 4-- Force index usage to see the plan 5EXPLAIN (ANALYZE, BUFFERS) 6SELECT * FROM users WHERE email = 'alice@example.com'; 7 8-- Expected: Index Scan using idx_users_email 9-- Cost: O(log N) ≈ 0.29..8.51 for 1M row table
Key: PostgreSQL uses B+Trees by default. The EXPLAIN output shows Index Scan for point queries and Index Scan Backward for ORDER BY ... DESC.
Advanced Topics & Edge Cases
Database Indexing Mechanics — Key Concepts
Knowledge Check
A B+Tree with branching factor 200 and 100 million rows has approximately how many levels?
References
Explore Related Topics
Algorithms: Foundations, Analysis, Design Paradigms, and Core Applications
Paging with Translation Lookaside Buffer (TLB) Scheme
Breadth-First Search as an Uninformed Search Strategy
Breadth‑First Search (BFS) expands the shallowest frontier nodes first using a FIFO queue and does not employ any heuristic function, making it an uninformed (blind) search strategy.
- Classified as uninformed search because it relies only on the problem definition, not on or other estimates.
- Complete for finite branching factors and optimal when all step costs are equal.
- Tree‑search time and space are ; graph‑search runs in time and space.
- Main weakness is exponential memory growth, so it suits shallow goals with ample memory.
- If step costs vary, uniform‑cost search should be used instead of BFS.