Database Scaling And Reliability Strategies


Replication

Copying the same data from one database to one or more replicas (add redundancy).

Types

Primary - Replica

One primary DB that handle all writes (INSERT, UPDATE, DELETE) and one or more replica DBs that handle the reads (SELECT). Replicas get copied data from primary DB.

The Primary DB sends Changelogs (WALs) to the replicas, replicas apply changes to stay up to date.

Why use it?

  1. if you have a lot of reads, replicas can help distribute the load instead of reading from single db instances.
  2. if the primary Db crashes, replicas can take over.

Pros:

  1. Scale Reads.
  2. High Availability (even when primary DB fails you can promotes one of replica to primary DB).
  3. Backups can be taken from Replica.

Cons:

  1. Can’t scale Writes (all writes still goes to primary DB).
  2. Data Lag (Eventually consistent) -> sometimes Data in replica is not up to date with primary.
  3. If Primary DB crashes failovers takes time.

Failovers:

  1. Detection: Monitoring detect a Primary DB failure.
  2. Promotion: One of the replica is promoted to Primary.
  3. Reconfiguration: Other replicas point the the new promoted DB.

Tools for Failovers Automation:

Active - Active (Primary-Primary)

Multiple Primary DBs each of them can accept both writes and reads, they sync with each other to stay up to date.

Why use it?

  1. Write intensive Application
  2. Geographically distributed Application

Pros:

  1. Scale Writes across locations/regions.
  2. No SPOF -> if one fails, others still works
  3. lower latency for global app

Cons:

  1. Conflicts -> what if 2 or more node update the same row (require conflict resolutions)

Conflict Resolutions: Because multiple nodes can accept write operations, conflicts can occur when the same row is modified concurrently on different primaries.

Conflict Resolution Strategies:

  • Last Write Wins (LWW).
  • Application Defined (Custom)
  • Conflict-Free Replicated Data Types (CRDTs)

Strategy

Synchronous

Transactions on primary DB declared complete only when those changes have been replicated to all replicas, this ensures that all replicas have the latest data.

  • Pros:
    • Strong data consistency.
    • No data loss in case of primary failure.
  • Cons:
    • Increased latency due to waiting for acks.
    • Potential performance bottlenecks. Use when you can’t risk data loss (e.g., banking, inventory) and you’re okay with slightly slower write performance.

Asynchronous

As the name suggest, primary DB can commits transactions without waiting for replicas.

  • Pros:
    • Lower latency and higher throughput.
    • Better performance for write-heavy applications.
  • Cons:
    • Risk of data loss if the primary fails before replication.
    • Replicas may serve stale data. Use when you care more about speed than perfect sync.

Semi-synchronous:

Primary Db waits until changes have been replicated to at least one replica.

Summary

Type (Strategy) Who handles writes? Consistency Pros Cons Best Use
Primary - Replica (Async) Only one primary Eventual Simple, fast reads Stale replicas, primary bottleneck Read-heavy apps
Primary - Replica (Sync) Only one primary Strong Safe writes Slower reads, blocking Critical data
Primary-Primary Multiple Primary Needs conflict resolution Fast, no SPOF Complex, conflict risk Multi-region writes

Sharding

Splitting database horizontally across servers, where each server holds a subset of data. manage multiple independent databases (shards). App or middleware decide where to read/write.

Why use it?

  • When data is too big on 1 server.
  • To scale out and reduce load.
  • Optimize User geo-loc. Pros:
  • Linearly scales with number of shards.
  • Each shard is smaller -> better performance
  • Independent failure/recovery of shards. Cons:
  • Complex queries (e.g., joins) across shards are difficult (or in some cases impossible).
  • Application or middleware must route queries.
  • Data balancing and shard rebalancing is difficult.
  • Transactions across shards are complex (need 2 phase commit)/

Choosing Shard key:

Ideal shard key properties:

Property Why it Matters
High Cardinality Avoid putting too much data in one shard
Even Distribution Balance load and storage across all shards
Frequently Queried So that queries know which shard to go to
Stable Shouldn’t change often (don’t use usernames/emails)
Locality (if needed) If you want region-aware routing (geo-sharding)

Problems

  • *Celebrity problem (Hotspot):
    • Unbalance access to certain key (for example social apps where a single celebrity can have millions of reads per day)
      • one way to solve this by allocate each shard to single celebrity (and further partition it if needed)
  • Resharding Data
    • Needed when a single shard no longer hold more data, solution: Consistent Hashing

A celebrity user causes huge amount of writes/reads, all their data goes to 1 shard -> This shard becomes a hotspot.


References

  1. https://www.enterprisedb.com/postgres-tutorials/postgresql-replication-and-automatic-failover-tutorial
  2. https://www.enjoyalgorithms.com/blog/master-slave-replication-databases