Coursify

System Design for Software Engineers

Database Sharding and Partitioning

Database Sharding and Partitioning

While replication helps with reads, it doesn't help with writes (since all writes still go to the single leader). To scale writes and handle datasets larger than a single server can store, you must use Partitioning (also known as Sharding).

Sharding is the process of breaking up a large database into smaller, more manageable parts called Shards. Each shard is its own independent database, and together they represent the whole dataset.

Partitioning vs. Sharding

  • Vertical Partitioning: Splitting a table by columns. For example, moving a large blob column (like user_bio) to a separate table/server to keep the main users table lean.
  • Horizontal Partitioning (Sharding): Splitting a table by rows. For example, users with IDs 1-1000 go to Shard A, and users with IDs 1001-2000 go to Shard B.

Sharding Strategies

Choosing the right Sharding Key is the most important part of the process:

  1. Range-Based Sharding: Assigns data based on a range of values (e.g., User IDs or Alphabetical).
    • Pro: Easy to implement.
    • Con: Leads to "Hot Spots" (e.g., if all new users are active, Shard C might be overloaded while Shard A is idle).
  2. Hash-Based Sharding: Applies a hash function to the sharding key (e.g., hash(user_id) % number_of_shards) to determine the shard.
    • Pro: Evenly distributes data.
    • Con: Hard to add new shards later (requires "Resharding" the entire dataset).
  3. Directory-Based Sharding: A lookup table tracks which shard holds which data.
    • Pro: Extremely flexible.
    • Con: The lookup table itself becomes a single point of failure and a bottleneck.

Planning Your Sharding Strategy

  1. 1
    Step 1

    Monitor your database's write IOPS and disk usage. If you are consistently above 80% capacity and vertical scaling is no longer cost-effective, it's time to shard.

  2. 2
    Step 2

    Choose a column that is present in most queries and has high cardinality (many unique values). A user_id or tenant_id is often a good choice. Avoid keys like gender or country which lead to massive, unbalanced shards.

  3. 3
    Step 3

    For most high-scale systems, Hash-Based Sharding is preferred because it prevents hotspots. Use Consistent Hashing if you expect to add or remove shards frequently in the future.

  4. 4
    Step 4

    Once data is sharded, you can no longer perform SQL JOINs across shards. You must either denormalize your data (copying data so it's all in one shard) or perform the join in your application code.

  5. 5
    Step 5

    Use a proxy like Vitess (for MySQL) or Citus (for PostgreSQL), or build the routing logic into your application's data access layer. The application must know which shard to talk to for every request.

The Challenges of Sharding

Sharding is "the nuclear option" of database scaling. It adds massive complexity:

  • Resharding: If your data grows beyond your initial shards, moving data between shards while the system is live is incredibly difficult.
  • Complexity: Your application code becomes much more complex as it has to handle multiple database connections and aggregate results.
  • Transactions: Distributed transactions across multiple shards are slow and often avoided in high-scale systems.

Common Mistakes

  • Sharding Too Early: Sharding adds so much complexity that you should only do it when you've exhausted all other options (caching, read replicas, vertical scaling).
  • Poor Key Choice: Picking a key that results in one shard being much larger or busier than the others (the "Celebrity Problem" or "Hotspot").
  • Ignoring Operational Overhead: Thinking sharding is just a software change. It requires significant changes to your backup, monitoring, and deployment pipelines.

Recap

  • Sharding splits a database horizontally by rows into independent nodes.
  • Choosing the right Sharding Key is critical for even distribution.
  • Hash-based sharding is common for evenness; Consistent hashing helps with resharding.
  • Sharding breaks JOINs and Transactions, requiring application-level workarounds.

Knowledge Check

Question 1 of 3
Q1Single choice

What is the primary reason to use Database Sharding?