Coursify

System Design for Software Engineers

Database Replication: Availability and Read Scaling

Database Replication: Availability and Read Scaling

As your traffic grows, a single database server becomes a bottleneck. Even if you have enough CPU and RAM, a single disk can only perform so many I/O operations per second. Replication is the practice of keeping a copy of the same data on multiple machines.

Replication solves two major problems:

  1. High Availability: If the primary server fails, a replica can take over.
  2. Read Scalability: You can distribute read-only queries across multiple "read replicas."

Replication Architectures

Leader-Follower (Master-Slave)

This is the most common architecture. All writes go to a single Leader. The leader then sends these updates to one or more Followers.

  • Usage: Clients send writes to the Leader and reads to any Follower.
  • Pros: Simple to reason about, no write conflicts.
  • Cons: If the leader fails, you must promote a follower (Failover), which can be complex.

Multi-Leader

Multiple nodes can accept writes. They then replicate their changes to each other.

  • Usage: Useful for multi-region deployments where you want users to write to their nearest data center.
  • Pros: Can survive a whole data center outage.
  • Cons: High complexity; you must handle Write Conflicts (when two users update the same record in different regions).

Leaderless (Dynamo-style)

Every node can accept both reads and writes. Examples include Cassandra and Riak.

  • Usage: Often used for high-write-throughput systems.
  • Pros: High availability and fault tolerance.
  • Cons: Complex consistency models (Quorums).

Setting Up a Read-Heavy Architecture

  1. 1
    Step 1

    Check your application logs. If your reads outnumber writes (e.g., 100:1), you are a perfect candidate for read replicas.

  2. 2
    Step 2

    Create a new database instance with the same hardware specifications as your leader. Most managed services (like AWS RDS or Google Cloud SQL) allow you to do this with one click.

  3. 3
    Step 3

    Configure the leader to stream its transaction logs (e.g., binlog in MySQL) to the follower. Most production systems use Asynchronous replication to avoid slowing down writes on the leader.

  4. 4
    Step 4

    Modify your application code to use two different connection strings: one for the Leader (writes) and a load-balanced set for the Followers (reads).

  5. 5
    Step 5

    Be aware that followers might be a few milliseconds (or seconds) behind the leader. If a user updates their profile and immediately refreshes, they might see the old data if the read goes to a lagging follower.

Synchronous vs. Asynchronous Replication

  • Synchronous: The leader waits for the follower to confirm it received the write before telling the client "Success."
    • Pro: Guaranteed consistency (no data loss if leader fails).
    • Con: High latency; if the follower is slow or the network flickers, the whole system stops.
  • Asynchronous: The leader writes locally and immediately tells the client "Success," then sends the data to the follower in the background.
    • Pro: Very fast; system stays available even if followers are down.
    • Con: Potential data loss; if the leader crashes before the follower receives the update, that data is gone.

Common Mistakes

  • Reading Your Own Writes: A user posts a comment (write to leader) and is redirected to the page (read from follower). If the follower hasn't caught up, the comment is missing. Solution: Route the first read after a write to the leader.
  • Follower Overload: Sending too many complex analytical queries to a single follower, causing it to lag further and further behind.
  • Cascading Failures: If the leader fails, and the newly promoted follower isn't powerful enough to handle the traffic, it will also crash, taking the whole system down.

Recap

  • Replication increases availability and allows you to scale reads.
  • Leader-Follower is the standard; Multi-Leader is for multi-region; Leaderless is for high-write scale.
  • Asynchronous replication is faster but risks data loss during failover.
  • Always account for Replication Lag in your application logic.

Knowledge Check

Question 1 of 3
Q1Single choice

What is the primary benefit of adding 'Read Replicas' to a database cluster?