Database Indexing Mechanics: B-Trees, LSM-Trees, and Sequential Scans

Verified Sources

Jun 18, 2026

Database Indexing Mechanics: From $O(N)$ to $O(\log N)$

Understanding how databases retrieve data efficiently is one of the most critical skills in backend engineering. At its core, database indexing is about avoiding full-table scans — transitioning a query's time complexity from $O(N)$ (examining every row) to $O(\log N)$ (narrowing the search space exponentially per step) or even $O(1)$ in specialized cases.

This deep-dive explores three fundamental access patterns that underpin virtually every modern database engine:

Access Pattern	Structure	Read Complexity	Write Complexity	Best For
Sequential Scan	Raw heap / unordered table	$O(N)$	$O(1)$ append	Full-table analytics, small tables
B-Tree Index	Balanced multi-way search tree	$O(\log N)$	$O(\log N)$	OLTP, point lookups, range queries
LSM-Tree	Sorted runs + memtable	$O(\log N)$ point; $O(N)$ worst-case	$O(1)$ amortized	Write-heavy workloads, time-series

The key insight: no single structure dominates all workloads. Engineering the right indexing strategy requires understanding the structural mechanics and complexity trade-offs of each approach.

B-Trees vs LSM Trees: How Databases Actually Store Your Data

The Sequential Table Scan: $O(N)$ Baseline

A sequential scan is the simplest — and often the most expensive — access path. The database reads every page of the heap table, evaluating each row against the query predicate.

Why Sequential Scans Still Matter

Despite their $O(N)$ complexity, sequential scans are not always wrong:

Small tables: If a table fits in a few pages, the overhead of index traversal (random I/O) exceeds the cost of scanning all pages (sequential I/O). PostgreSQL's query planner, for instance, has a built-in random_page_cost threshold and will prefer sequential scans for small datasets .
High selectivity queries: When a query returns a large percentage of rows (e.g., >10–15%), reading the index + random heap fetches becomes more expensive than just scanning everything sequentially. This is because each index-based row access requires a separate random I/O, whereas a sequential scan reads contiguous blocks .
No useful index exists: If no index matches the query's filter columns, there is no alternative.

I/O Cost Model

The fundamental comparison:

$C_{\text{seq}} = \frac{N}{P} \cdot C_{\text{seq\_io}}$

$C_{\text{index}} = \log_B(N) \cdot C_{\text{rand\_io}} + S \cdot C_{\text{rand\_io}}$

Where $N$ = number of rows, $P$ = rows per page, $S$ = number of matching rows, $B$ = branching factor, $C_{\text{seq\_io}}$ = cost of sequential I/O, and $C_{\text{rand\_io}}$ = cost of random I/O.

Since random I/O is typically 10–100× slower than sequential I/O on HDDs (and still significant on SSDs), there exists a selectivity threshold beyond which the full scan wins.

PostgreSQL Documentation, Chapter 52 - Overview of PostgreSQL Internals ↩
MySQL 8.0 Reference Manual - Optimization and Indexes ↩

The Selectivity Trap

Adding an index does NOT guarantee faster queries. If your query matches a large fraction of rows, the index introduces random I/O overhead that actually slows down the query compared to a sequential scan. Always check your EXPLAIN plans — databases like PostgreSQL and MySQL will correctly choose sequential scans when they're cheaper.

B-Tree Indexing: From $O(N)$ to $O(\log N)$

The B-Tree is the most ubiquitous index structure in relational databases — used by PostgreSQL, MySQL (InnoDB), SQLite, Oracle, and SQL Server. It provides $O(\log N)$ lookups, insertions, and deletions.

Structural Anatomy

Unlike a binary search tree (which has branching factor 2), B-Trees have a high branching factor (typically 100–1000+), meaning the tree is extremely shallow. A B-Tree with branching factor $B$ and $N$ keys has height:

$h \leq \lceil \log_B(N) \rceil$

For a table with 1 billion rows and branching factor 500:

$h \leq \lceil \log_{500}(10^9) \rceil \approx 4 \text{ levels}$

This means only 4 disk reads are needed to find any row — the difference between seconds and microseconds.

Property	Value
Node size	Typically matches disk page (4KB–16KB)
Branching factor ( $B$ )	100–1000+ depending on key size
Height for 1B rows	~3–4 levels
Point lookup cost	$O(\log_B N)$ disk reads
Range scan cost	$O(\log_B N + K)$ where $K$ = result rows
Insert/Delete cost	$O(\log_B N)$ + possible node split/merge

B-Tree Mechanics: Search, Insert, Delete

Search: Start at the root. At each node, perform a binary search within the node's sorted keys to select the correct child pointer. Descend until a leaf node is reached.

Insert: Traverse to the correct leaf. If the leaf has room, insert the key. If it's full, split the leaf into two halves and propagate the median key upward. This may cause cascading splits up to the root (which grows the tree by one level) .

Delete: Find the key, remove it. If the node falls below minimum occupancy, merge with a sibling or redistribute keys from a sibling. This may cascade upward.

B+Tree: The Practical Variant

Most databases actually implement B+Trees, not pure B-Trees. The critical differences:

Internal nodes contain only keys (no data) — maximizing branching factor
All data is in the leaf level — making scans uniform
Leaf nodes are linked in a doubly-linked list — enabling efficient range scans without re-traversing the tree

$\text{Range scan cost}: O(\log_B N + K/P)$

Where $K$ is the number of matching rows and $P$ is the page size. The linked-list structure means once you reach the first matching leaf, you scan contiguous leaves — no more tree traversals .

Database Internals, Alex Petrov, O'Reilly 2019 — Chapter on B-Tree Variants ↩ ↩²

B-Tree Point Lookup Algorithm

1
Step 1
Load the root page from disk. This is the first I/O operation. The root is typically cached in memory after the first access.
2
Step 2
Perform binary search $O(\log B)$ on the sorted keys within the current node to find the child pointer that falls within the target key's range. With branching factor $B \approx 500$ , this is typically 9 comparisons.
3
Step 3
Follow the identified child pointer, loading the next page from disk. Repeat search within this node.
4
Step 4
After $h = O(\log_B N)$ levels, arrive at the leaf containing the target key. Total disk reads: $O(\log_B N)$ — typically 2–4 for even massive tables.
5
Step 5
The leaf contains the key and a heap pointer (CTID in PostgreSQL, primary key in InnoDB's clustered index). Perform one additional random I/O to retrieve the full row from the heap.

LSM-Trees: Engineering for Write-Heavy Workloads

The LSM-Tree (Log-Structured Merge Tree) was introduced by Patrick O'Neil et al. in 1996 as a fundamentally different approach: optimize for writes, not reads .

The Write Problem with B-Trees

B-Trees are read-optimized but write-hostile. Every insert, update, or delete requires:

Traversing $O(\log_B N)$ levels to find the target leaf
Loading the page (random I/O)
Potentially triggering page splits, rebalancing, and WAL logging
Each write incurs random I/O — the worst case for disk performance

For write-heavy workloads (time-series data, event logs, IoT telemetry), this random I/O pattern becomes the bottleneck.

LSM-Tree Architecture: The Write Path

LSM-Trees flip this model by making writes sequential:

Component	Storage	Data Structure	Role
MemTable	Memory	Skip list / RB-tree	Absorb writes at memory speed
WAL (Write-Ahead Log)	Disk (sequential)	Append-only log	Durability guarantee
SSTable	Disk (sequential)	Sorted flat file	Persistent sorted data
Bloom Filter	Memory	Probabilistic bit array	Speed up point lookups

Write Complexity: $O(1)$ amortized — writes go to the in-memory MemTable (no disk I/O for most writes). The MemTable is flushed to disk as a new SSTable only when it reaches a size threshold .

The Read Path: Where Complexity Increases

While writes are cheap, reads become more complex. To find a key:

Check the MemTable (memory): $O(\log N_{\text{mem}})$
Search Level 0 SSTables (newest to oldest): $O(L_0 \cdot N_{\text{sstable}})$ — each L0 SSTable may overlap
Search Level 1+ SSTables: $O(\log N)$ per level (one SSTable per level due to key range partitioning)
Use Bloom filters to skip SSTables that definitely don't contain the key

$\text{Worst-case read cost}: O(L_0 + \log_{B} N \cdot L_{\max})$

Where $L_0$ is the number of un-compacted SSTables and $L_{\max}$ is the maximum level. Without bloom filters, this can degrade toward $O(N)$ — a critical concern.

Compaction: The Background Garbage Collector

Compaction is the mechanism that keeps read performance from degrading:

Compaction Type	Trigger	Effect
Minor (flush)	MemTable full	Write one new L0 SSTable
Leveled compaction	L0 or L $i$ size exceeds threshold	Merge overlapping SSTables, push to L $i+1$
Tiered compaction	Level too large	Merge entire tier at once, less write amplification
Major	Admin or threshold	Rewrite entire database into one sorted run

Write amplification in leveled compaction can be significant: a single write may be re-written $L_{\max}$ times during compaction. For LevelDB/RocksDB with 7 levels and size ratio 10, write amplification can reach 30–50× .

O'Neil, P., et al. — "The Log-Structured Merge-Tree (LSM-Tree)", Acta Informatica, 1996 ↩ ↩²
RocksDB documentation — Write Amplification and Compaction Strategies ↩

LSM-Tree Write Amplification

A single logical write in an LSM-Tree with leveled compaction across 6 levels and a 10× size ratio can be physically rewritten up to 30+ times. At $\text{WA} = 30$ , writing 1 GB of data physically writes 30 GB to disk. This impacts SSD lifespan and total I/O bandwidth. Consider tiered compaction for write-heavy workloads where read latency is acceptable.

B-Tree vs LSM-Tree vs Sequential Scan: Capability Comparison

Relative performance across key dimensions (higher = better for that dimension)

Complexity Transition Analysis: $O(N) \rightarrow O(\log N)$

The fundamental question: when does an index actually help, and when does it hurt?

The Cross-Over Point

Consider a table of $N$ rows stored across $P = N / R$ pages (where $R$ = rows per page). The cost of sequential scan:

$C_{\text{seq}} = P \cdot t_{\text{seq}} = \frac{N}{R} \cdot t_{\text{seq}}$

The cost of B-Tree index scan returning $K$ rows:

$C_{\text{btree}} = h \cdot t_{\text{rand}} + K \cdot t_{\text{rand}}$

Where $h = O(\log_B N)$ is the tree height and $t_{\text{seq}}$ and $t_{\text{rand}}$ are sequential and random I/O times respectively.

Setting $C_{\text{seq}} = C_{\text{btree}}$ and solving for the selectivity threshold $\sigma = K / N$ :

$\sigma^* \approx \frac{1}{R} \cdot \frac{t_{\text{seq}}}{t_{\text{rand}}}$

Storage Type	$t_{\text{rand}} / t_{\text{seq}}$	Rows/Page ( $R$ )	Selectivity Threshold $\sigma^*$
HDD	~50	100	~0.02% (~1 in 5000 rows)
SATA SSD	~10	100	~0.1% (~1 in 1000 rows)
NVMe SSD	~2	100	~0.5% (~1 in 200 rows)

On HDDs, an index scan is only beneficial when retrieving fewer than 0.02% of rows. On NVMe SSDs, the advantage window widens significantly because random I/O is much faster .

Big-O Summary

$\boxed{\text{Sequential Scan: } O(N) \text{ - always, with no overhead}}$

$\boxed{\text{B-Tree Point Lookup: } O(\log_B N) \text{ - but } + O(K) \text{ random I/Os for fetches}}$

$\boxed{\text{LSM-Tree Point Lookup: } O(\log_B N) \text{ best case, } O(N) \text{ worst case without bloom filters}}$

$\boxed{\text{B-Tree Range Scan: } O(\log_B N + K) \text{ - logarithmic entry + linear scan of leaves}}$

$\boxed{\text{LSM-Tree Write: } O(1) \text{ amortized - constant-time in-memory insert}}$

When Each Structure Wins: Decision Matrix

Scenario	Best Access Path	Why
Point query on primary key	B-Tree (clustered)	$O(\log N)$ , no extra heap fetch
Range query (e.g., date range)	B-Tree	Linked leaf pages enable sequential scan of range
High-volume inserts (>10K/sec)	LSM-Tree	Writes are $O(1)$ memory inserts, no page splits
Full-table analytic aggregation	Sequential Scan	Sequential I/O is optimal for reading all rows
Key-value point lookup	LSM + Bloom Filter	Bloom filter eliminates most false SSTable reads
Small table (<1000 rows)	Sequential Scan	Index overhead > scan cost

MySQL 8.0 Reference Manual and PostgreSQL 16 Documentation — Query Planner Cost Model ↩

Evolution of Database Indexing Structures

ISAM Introduced

1970

IBM's Indexed Sequential Access Method — a static two-level tree structure. Lay the groundwork for ordered indexing, but couldn't handle dynamic inserts without overflow chains."

B-Tree Invented

1972

Bayer & McCreight publish 'Organization and Maintenance of Large Ordered Indices.' The self-balancing multi-way tree becomes the foundation of all modern database indexing."

B+Tree Formalized

1979

Variants with data only in leaves and linked leaf pointers appear, becoming the de-facto standard in System R, Oracle, and later PostgreSQL and MySQL."

LSM-Tree Paper Published

1996

O'Neil, Cheng, Gollnick, & O'Neil publish the LSM-Tree paper in Acta Informatica, introducing the write-optimized paradigm that powers Bigtable, Cassandra, RocksDB, and LevelDB."

Bigtable & SSTables

2006

Google publishes the Bigtable paper, popularizing SSTables + LSM architecture at web scale. Inspires Cassandra (2008), HBase (2008), and LevelDB (2011)."

RocksDB & HyperDatabases

2016+

RocksDB (Facebook) optimizes LSM compaction (leveled, tiered, FIFO). New hybrids emerge: WiredTiger (MongoDB) offers both B-Tree and LSM modes."

1-- Create a B-Tree index
2CREATE INDEX idx_users_email ON users (email);
3
4-- Force index usage to see the plan
5EXPLAIN (ANALYZE, BUFFERS)
6SELECT * FROM users WHERE email = 'alice@example.com';
7
8-- Expected: Index Scan using idx_users_email
9-- Cost: O(log N) ≈ 0.29..8.51 for 1M row table

Key: PostgreSQL uses B+Trees by default. The EXPLAIN output shows Index Scan for point queries and Index Scan Backward for ORDER BY ... DESC.

Advanced Topics & Edge Cases

Database Indexing Mechanics — Key Concepts

1 / 7

14%

Question · Term

What is the time complexity of a sequential table scan?

Click to reveal

Answer · Definition

$O(N)$ — every row must be examined. Sequential I/O is fast, but the linear dependence on table size makes this expensive for large tables with selective queries.

Knowledge Check

Question 1 of 5

Q1Single choice

A B+Tree with branching factor 200 and 100 million rows has approximately how many levels?

References

Explore Related Topics

Algorithms: Foundations, Analysis, Design Paradigms, and Core Applications

Paging with Translation Lookaside Buffer (TLB) Scheme

Breadth-First Search as an Uninformed Search Strategy

Breadth‑First Search (BFS) expands the shallowest frontier nodes first using a FIFO queue and does not employ any heuristic function, making it an uninformed (blind) search strategy.

Classified as uninformed search because it relies only on the problem definition, not on $h(n)$ or other estimates.
Complete for finite branching factors and optimal when all step costs are equal.
Tree‑search time and space are $O(b^{d})$ ; graph‑search runs in $O(|V|+|E|)$ time and $O(|V|)$ space.
Main weakness is exponential memory growth, so it suits shallow goals with ample memory.
If step costs vary, uniform‑cost search should be used instead of BFS.

Browse all research articles

Database Indexing Mechanics: B-Trees, LSM-Trees, and Sequential Scans

Database Indexing Mechanics: From O(N)O(N)O(N) to O(log⁡N)O(\log N)O(logN)

B-Trees vs LSM Trees: How Databases Actually Store Your Data

The Sequential Table Scan: O(N)O(N)O(N) Baseline

Why Sequential Scans Still Matter

I/O Cost Model

Footnotes

The Selectivity Trap

B-Tree Indexing: From O(N)O(N)O(N) to O(log⁡N)O(\log N)O(logN)

Structural Anatomy

B-Tree Mechanics: Search, Insert, Delete

B+Tree: The Practical Variant

Footnotes

B-Tree Point Lookup Algorithm

LSM-Trees: Engineering for Write-Heavy Workloads

The Write Problem with B-Trees

LSM-Tree Architecture: The Write Path

The Read Path: Where Complexity Increases

Compaction: The Background Garbage Collector

Footnotes

LSM-Tree Write Amplification

B-Tree vs LSM-Tree vs Sequential Scan: Capability Comparison

Complexity Transition Analysis: O(N)→O(log⁡N)O(N) \rightarrow O(\log N)O(N)→O(logN)

The Cross-Over Point

Big-O Summary

When Each Structure Wins: Decision Matrix

Footnotes

Evolution of Database Indexing Structures

ISAM Introduced

B-Tree Invented

B+Tree Formalized

LSM-Tree Paper Published

Bigtable & SSTables

RocksDB & HyperDatabases

Advanced Topics & Edge Cases

Database Indexing Mechanics — Key Concepts

What is the time complexity of a sequential table scan?

Knowledge Check

References

Explore Related Topics

Database Indexing Mechanics: From $O(N)$ to $O(\log N)$

The Sequential Table Scan: $O(N)$ Baseline

B-Tree Indexing: From $O(N)$ to $O(\log N)$

Complexity Transition Analysis: $O(N) \rightarrow O(\log N)$