Database Indexing Mechanics: B-Trees, LSM-Trees, and Sequential Scans

Database Indexing Mechanics: B-Trees, LSM-Trees, and Sequential Scans

Verified Sources
Jun 18, 2026

Database Indexing Mechanics: From O(N)O(N) to O(logN)O(\log N)

Understanding how databases retrieve data efficiently is one of the most critical skills in backend engineering. At its core, database indexing is about avoiding full-table scans — transitioning a query's time complexity from O(N)O(N) (examining every row) to O(logN)O(\log N) (narrowing the search space exponentially per step) or even O(1)O(1) in specialized cases.

This deep-dive explores three fundamental access patterns that underpin virtually every modern database engine:

Access PatternStructureRead ComplexityWrite ComplexityBest For
Sequential ScanRaw heap / unordered tableO(N)O(N)O(1)O(1) appendFull-table analytics, small tables
B-Tree IndexBalanced multi-way search treeO(logN)O(\log N)O(logN)O(\log N)OLTP, point lookups, range queries
LSM-TreeSorted runs + memtableO(logN)O(\log N) point; O(N)O(N) worst-caseO(1)O(1) amortizedWrite-heavy workloads, time-series

The key insight: no single structure dominates all workloads. Engineering the right indexing strategy requires understanding the structural mechanics and complexity trade-offs of each approach.

B-Trees vs LSM Trees: How Databases Actually Store Your Data

The Sequential Table Scan: O(N)O(N) Baseline

A sequential scan is the simplest — and often the most expensive — access path. The database reads every page of the heap table, evaluating each row against the query predicate.

Why Sequential Scans Still Matter

Despite their O(N)O(N) complexity, sequential scans are not always wrong:

  1. Small tables: If a table fits in a few pages, the overhead of index traversal (random I/O) exceeds the cost of scanning all pages (sequential I/O). PostgreSQL's query planner, for instance, has a built-in random_page_cost threshold and will prefer sequential scans for small datasets .

  2. High selectivity queries: When a query returns a large percentage of rows (e.g., >10–15%), reading the index + random heap fetches becomes more expensive than just scanning everything sequentially. This is because each index-based row access requires a separate random I/O, whereas a sequential scan reads contiguous blocks .

  3. No useful index exists: If no index matches the query's filter columns, there is no alternative.

I/O Cost Model

The fundamental comparison:

Cseq=NPCseq_ioC_{\text{seq}} = \frac{N}{P} \cdot C_{\text{seq\_io}}

Cindex=logB(N)Crand_io+SCrand_ioC_{\text{index}} = \log_B(N) \cdot C_{\text{rand\_io}} + S \cdot C_{\text{rand\_io}}

Where NN = number of rows, PP = rows per page, SS = number of matching rows, BB = branching factor, Cseq_ioC_{\text{seq\_io}} = cost of sequential I/O, and Crand_ioC_{\text{rand\_io}} = cost of random I/O.

Since random I/O is typically 10–100× slower than sequential I/O on HDDs (and still significant on SSDs), there exists a selectivity threshold beyond which the full scan wins.

Footnotes

  1. PostgreSQL Documentation, Chapter 52 - Overview of PostgreSQL Internals

  2. MySQL 8.0 Reference Manual - Optimization and Indexes

The Selectivity Trap

Adding an index does NOT guarantee faster queries. If your query matches a large fraction of rows, the index introduces random I/O overhead that actually slows down the query compared to a sequential scan. Always check your EXPLAIN plans — databases like PostgreSQL and MySQL will correctly choose sequential scans when they're cheaper.

B-Tree Indexing: From O(N)O(N) to O(logN)O(\log N)

The B-Tree is the most ubiquitous index structure in relational databases — used by PostgreSQL, MySQL (InnoDB), SQLite, Oracle, and SQL Server. It provides O(logN)O(\log N) lookups, insertions, and deletions.

Structural Anatomy

Unlike a binary search tree (which has branching factor 2), B-Trees have a high branching factor (typically 100–1000+), meaning the tree is extremely shallow. A B-Tree with branching factor BB and NN keys has height:

hlogB(N)h \leq \lceil \log_B(N) \rceil

For a table with 1 billion rows and branching factor 500:

hlog500(109)4 levelsh \leq \lceil \log_{500}(10^9) \rceil \approx 4 \text{ levels}

This means only 4 disk reads are needed to find any row — the difference between seconds and microseconds.

PropertyValue
Node sizeTypically matches disk page (4KB–16KB)
Branching factor (BB)100–1000+ depending on key size
Height for 1B rows~3–4 levels
Point lookup costO(logBN)O(\log_B N) disk reads
Range scan costO(logBN+K)O(\log_B N + K) where KK = result rows
Insert/Delete costO(logBN)O(\log_B N) + possible node split/merge

B-Tree Mechanics: Search, Insert, Delete

Search: Start at the root. At each node, perform a binary search within the node's sorted keys to select the correct child pointer. Descend until a leaf node is reached.

Insert: Traverse to the correct leaf. If the leaf has room, insert the key. If it's full, split the leaf into two halves and propagate the median key upward. This may cause cascading splits up to the root (which grows the tree by one level) .

Delete: Find the key, remove it. If the node falls below minimum occupancy, merge with a sibling or redistribute keys from a sibling. This may cascade upward.

B+Tree: The Practical Variant

Most databases actually implement B+Trees, not pure B-Trees. The critical differences:

  • Internal nodes contain only keys (no data) — maximizing branching factor
  • All data is in the leaf level — making scans uniform
  • Leaf nodes are linked in a doubly-linked list — enabling efficient range scans without re-traversing the tree

Range scan cost:O(logBN+K/P)\text{Range scan cost}: O(\log_B N + K/P)

Where KK is the number of matching rows and PP is the page size. The linked-list structure means once you reach the first matching leaf, you scan contiguous leaves — no more tree traversals .

Footnotes

  1. Database Internals, Alex Petrov, O'Reilly 2019 — Chapter on B-Tree Variants 2

B-Tree Point Lookup Algorithm

  1. 1
    Step 1

    Load the root page from disk. This is the first I/O operation. The root is typically cached in memory after the first access.

  2. 2
    Step 2

    Perform binary search O(logB)O(\log B) on the sorted keys within the current node to find the child pointer that falls within the target key's range. With branching factor B500B \approx 500, this is typically 9 comparisons.

  3. 3
    Step 3

    Follow the identified child pointer, loading the next page from disk. Repeat search within this node.

  4. 4
    Step 4

    After h=O(logBN)h = O(\log_B N) levels, arrive at the leaf containing the target key. Total disk reads: O(logBN)O(\log_B N) — typically 2–4 for even massive tables.

  5. 5
    Step 5

    The leaf contains the key and a heap pointer (CTID in PostgreSQL, primary key in InnoDB's clustered index). Perform one additional random I/O to retrieve the full row from the heap.

LSM-Trees: Engineering for Write-Heavy Workloads

The LSM-Tree (Log-Structured Merge Tree) was introduced by Patrick O'Neil et al. in 1996 as a fundamentally different approach: optimize for writes, not reads .

The Write Problem with B-Trees

B-Trees are read-optimized but write-hostile. Every insert, update, or delete requires:

  1. Traversing O(logBN)O(\log_B N) levels to find the target leaf
  2. Loading the page (random I/O)
  3. Potentially triggering page splits, rebalancing, and WAL logging
  4. Each write incurs random I/O — the worst case for disk performance

For write-heavy workloads (time-series data, event logs, IoT telemetry), this random I/O pattern becomes the bottleneck.

LSM-Tree Architecture: The Write Path

LSM-Trees flip this model by making writes sequential:

ComponentStorageData StructureRole
MemTableMemorySkip list / RB-treeAbsorb writes at memory speed
WAL (Write-Ahead Log)Disk (sequential)Append-only logDurability guarantee
SSTableDisk (sequential)Sorted flat filePersistent sorted data
Bloom FilterMemoryProbabilistic bit arraySpeed up point lookups

Write Complexity: O(1)O(1) amortized — writes go to the in-memory MemTable (no disk I/O for most writes). The MemTable is flushed to disk as a new SSTable only when it reaches a size threshold .

The Read Path: Where Complexity Increases

While writes are cheap, reads become more complex. To find a key:

  1. Check the MemTable (memory): O(logNmem)O(\log N_{\text{mem}})
  2. Search Level 0 SSTables (newest to oldest): O(L0Nsstable)O(L_0 \cdot N_{\text{sstable}}) — each L0 SSTable may overlap
  3. Search Level 1+ SSTables: O(logN)O(\log N) per level (one SSTable per level due to key range partitioning)
  4. Use Bloom filters to skip SSTables that definitely don't contain the key

Worst-case read cost:O(L0+logBNLmax)\text{Worst-case read cost}: O(L_0 + \log_{B} N \cdot L_{\max})

Where L0L_0 is the number of un-compacted SSTables and LmaxL_{\max} is the maximum level. Without bloom filters, this can degrade toward O(N)O(N) — a critical concern.

Compaction: The Background Garbage Collector

Compaction is the mechanism that keeps read performance from degrading:

Compaction TypeTriggerEffect
Minor (flush)MemTable fullWrite one new L0 SSTable
Leveled compactionL0 or Lii size exceeds thresholdMerge overlapping SSTables, push to Li+1i+1
Tiered compactionLevel too largeMerge entire tier at once, less write amplification
MajorAdmin or thresholdRewrite entire database into one sorted run

Write amplification in leveled compaction can be significant: a single write may be re-written LmaxL_{\max} times during compaction. For LevelDB/RocksDB with 7 levels and size ratio 10, write amplification can reach 30–50× .

Footnotes

  1. O'Neil, P., et al. — "The Log-Structured Merge-Tree (LSM-Tree)", Acta Informatica, 1996 2

  2. RocksDB documentation — Write Amplification and Compaction Strategies

LSM-Tree Write Amplification

A single logical write in an LSM-Tree with leveled compaction across 6 levels and a 10× size ratio can be physically rewritten up to 30+ times. At WA=30\text{WA} = 30, writing 1 GB of data physically writes 30 GB to disk. This impacts SSD lifespan and total I/O bandwidth. Consider tiered compaction for write-heavy workloads where read latency is acceptable.

B-Tree vs LSM-Tree vs Sequential Scan: Capability Comparison

Relative performance across key dimensions (higher = better for that dimension)

Complexity Transition Analysis: O(N)O(logN)O(N) \rightarrow O(\log N)

The fundamental question: when does an index actually help, and when does it hurt?

The Cross-Over Point

Consider a table of NN rows stored across P=N/RP = N / R pages (where RR = rows per page). The cost of sequential scan:

Cseq=Ptseq=NRtseqC_{\text{seq}} = P \cdot t_{\text{seq}} = \frac{N}{R} \cdot t_{\text{seq}}

The cost of B-Tree index scan returning KK rows:

Cbtree=htrand+KtrandC_{\text{btree}} = h \cdot t_{\text{rand}} + K \cdot t_{\text{rand}}

Where h=O(logBN)h = O(\log_B N) is the tree height and tseqt_{\text{seq}} and trandt_{\text{rand}} are sequential and random I/O times respectively.

Setting Cseq=CbtreeC_{\text{seq}} = C_{\text{btree}} and solving for the selectivity threshold σ=K/N\sigma = K / N:

σ1Rtseqtrand\sigma^* \approx \frac{1}{R} \cdot \frac{t_{\text{seq}}}{t_{\text{rand}}}

Storage Typetrand/tseqt_{\text{rand}} / t_{\text{seq}}Rows/Page (RR)Selectivity Threshold σ\sigma^*
HDD~50100~0.02% (~1 in 5000 rows)
SATA SSD~10100~0.1% (~1 in 1000 rows)
NVMe SSD~2100~0.5% (~1 in 200 rows)

On HDDs, an index scan is only beneficial when retrieving fewer than 0.02% of rows. On NVMe SSDs, the advantage window widens significantly because random I/O is much faster .

Big-O Summary

Sequential Scan: O(N) - always, with no overhead\boxed{\text{Sequential Scan: } O(N) \text{ - always, with no overhead}}

B-Tree Point Lookup: O(logBN) - but +O(K) random I/Os for fetches\boxed{\text{B-Tree Point Lookup: } O(\log_B N) \text{ - but } + O(K) \text{ random I/Os for fetches}}

LSM-Tree Point Lookup: O(logBN) best case, O(N) worst case without bloom filters\boxed{\text{LSM-Tree Point Lookup: } O(\log_B N) \text{ best case, } O(N) \text{ worst case without bloom filters}}

B-Tree Range Scan: O(logBN+K) - logarithmic entry + linear scan of leaves\boxed{\text{B-Tree Range Scan: } O(\log_B N + K) \text{ - logarithmic entry + linear scan of leaves}}

LSM-Tree Write: O(1) amortized - constant-time in-memory insert\boxed{\text{LSM-Tree Write: } O(1) \text{ amortized - constant-time in-memory insert}}

When Each Structure Wins: Decision Matrix

ScenarioBest Access PathWhy
Point query on primary keyB-Tree (clustered)O(logN)O(\log N), no extra heap fetch
Range query (e.g., date range)B-TreeLinked leaf pages enable sequential scan of range
High-volume inserts (>10K/sec)LSM-TreeWrites are O(1)O(1) memory inserts, no page splits
Full-table analytic aggregationSequential ScanSequential I/O is optimal for reading all rows
Key-value point lookupLSM + Bloom FilterBloom filter eliminates most false SSTable reads
Small table (<1000 rows)Sequential ScanIndex overhead > scan cost

Footnotes

  1. MySQL 8.0 Reference Manual and PostgreSQL 16 Documentation — Query Planner Cost Model

Evolution of Database Indexing Structures

ISAM Introduced

1970

IBM's Indexed Sequential Access Method — a static two-level tree structure. Lay the groundwork for ordered indexing, but couldn't handle dynamic inserts without overflow chains."

B-Tree Invented

1972

Bayer & McCreight publish 'Organization and Maintenance of Large Ordered Indices.' The self-balancing multi-way tree becomes the foundation of all modern database indexing."

B+Tree Formalized

1979

Variants with data only in leaves and linked leaf pointers appear, becoming the de-facto standard in System R, Oracle, and later PostgreSQL and MySQL."

LSM-Tree Paper Published

1996

O'Neil, Cheng, Gollnick, & O'Neil publish the LSM-Tree paper in Acta Informatica, introducing the write-optimized paradigm that powers Bigtable, Cassandra, RocksDB, and LevelDB."

Bigtable & SSTables

2006

Google publishes the Bigtable paper, popularizing SSTables + LSM architecture at web scale. Inspires Cassandra (2008), HBase (2008), and LevelDB (2011)."

RocksDB & HyperDatabases

2016+

RocksDB (Facebook) optimizes LSM compaction (leveled, tiered, FIFO). New hybrids emerge: WiredTiger (MongoDB) offers both B-Tree and LSM modes."

1-- Create a B-Tree index 2CREATE INDEX idx_users_email ON users (email); 3 4-- Force index usage to see the plan 5EXPLAIN (ANALYZE, BUFFERS) 6SELECT * FROM users WHERE email = 'alice@example.com'; 7 8-- Expected: Index Scan using idx_users_email 9-- Cost: O(log N) ≈ 0.29..8.51 for 1M row table

Key: PostgreSQL uses B+Trees by default. The EXPLAIN output shows Index Scan for point queries and Index Scan Backward for ORDER BY ... DESC.

Advanced Topics & Edge Cases

Database Indexing Mechanics — Key Concepts

1 / 7
14%
Question · Term

What is the time complexity of a sequential table scan?

Click to reveal
Answer · Definition

O(N)O(N) — every row must be examined. Sequential I/O is fast, but the linear dependence on table size makes this expensive for large tables with selective queries.

Knowledge Check

Question 1 of 5
Q1Single choice

A B+Tree with branching factor 200 and 100 million rows has approximately how many levels?

References