Database Scaling And Reliability Strategies

April 20, 2025 4-minute read

software engineering • Scalability

Replication

Copying the same data from one database to one or more replicas (add redundancy).

Types

Primary - Replica

One primary DB that handle all writes (INSERT, UPDATE, DELETE) and one or more replica DBs that handle the reads (SELECT). Replicas get copied data from primary DB.

The Primary DB sends Changelogs (WALs) to the replicas, replicas apply changes to stay up to date.

Why use it?

if you have a lot of reads, replicas can help distribute the load instead of reading from single db instances.
if the primary Db crashes, replicas can take over.

Pros:

Scale Reads.
High Availability (even when primary DB fails you can promotes one of replica to primary DB).
Backups can be taken from Replica.

Cons:

Can’t scale Writes (all writes still goes to primary DB).
Data Lag (Eventually consistent) -> sometimes Data in replica is not up to date with primary.
If Primary DB crashes failovers takes time.

Failovers:

Detection: Monitoring detect a Primary DB failure.
Promotion: One of the replica is promoted to Primary.
Reconfiguration: Other replicas point the the new promoted DB.

Tools for Failovers Automation:

MySQL: MHA and Orchestrator.
PostgreSQL: Patroni and repmgr .

Active - Active (Primary-Primary)

Multiple Primary DBs each of them can accept both writes and reads, they sync with each other to stay up to date.

Why use it?

Write intensive Application
Geographically distributed Application

Pros:

Scale Writes across locations/regions.
No SPOF -> if one fails, others still works
lower latency for global app

Cons:

Conflicts -> what if 2 or more node update the same row (require conflict resolutions)

Conflict Resolutions: Because multiple nodes can accept write operations, conflicts can occur when the same row is modified concurrently on different primaries.

Conflict Resolution Strategies:

Last Write Wins (LWW).
Application Defined (Custom)
Conflict-Free Replicated Data Types (CRDTs)

Strategy

Synchronous

Transactions on primary DB declared complete only when those changes have been replicated to all replicas, this ensures that all replicas have the latest data.

Pros:
- Strong data consistency.
- No data loss in case of primary failure.
Cons:
- Increased latency due to waiting for acks.
- Potential performance bottlenecks. Use when you can’t risk data loss (e.g., banking, inventory) and you’re okay with slightly slower write performance.

Asynchronous

As the name suggest, primary DB can commits transactions without waiting for replicas.

Pros:
- Lower latency and higher throughput.
- Better performance for write-heavy applications.
Cons:
- Risk of data loss if the primary fails before replication.
- Replicas may serve stale data. Use when you care more about speed than perfect sync.

Semi-synchronous:

Primary Db waits until changes have been replicated to at least one replica.

Summary

Type (Strategy)	Who handles writes?	Consistency	Pros	Cons	Best Use
Primary - Replica (Async)	Only one primary	Eventual	Simple, fast reads	Stale replicas, primary bottleneck	Read-heavy apps
Primary - Replica (Sync)	Only one primary	Strong	Safe writes	Slower reads, blocking	Critical data
Primary-Primary	Multiple Primary	Needs conflict resolution	Fast, no SPOF	Complex, conflict risk	Multi-region writes

Sharding

Splitting database horizontally across servers, where each server holds a subset of data. manage multiple independent databases (shards). App or middleware decide where to read/write.

Why use it?

When data is too big on 1 server.
To scale out and reduce load.
Optimize User geo-loc. Pros:

Linearly scales with number of shards.
Each shard is smaller -> better performance
Independent failure/recovery of shards. Cons:
Complex queries (e.g., joins) across shards are difficult (or in some cases impossible).
Application or middleware must route queries.
Data balancing and shard rebalancing is difficult.
Transactions across shards are complex (need 2 phase commit)/

Choosing Shard key:

Ideal shard key properties:

Property	Why it Matters
High Cardinality	Avoid putting too much data in one shard
Even Distribution	Balance load and storage across all shards
Frequently Queried	So that queries know which shard to go to
Stable	Shouldn’t change often (don’t use usernames/emails)
Locality (if needed)	If you want region-aware routing (geo-sharding)

Problems

*Celebrity problem (Hotspot):
- Unbalance access to certain key (for example social apps where a single celebrity can have millions of reads per day)
  - one way to solve this by allocate each shard to single celebrity (and further partition it if needed)
Resharding Data
- Needed when a single shard no longer hold more data, solution: Consistent Hashing

A celebrity user causes huge amount of writes/reads, all their data goes to 1 shard -> This shard becomes a hotspot.

Replication

Types

Primary - Replica

Active - Active (Primary-Primary)

Strategy

Synchronous

Asynchronous

Semi-synchronous:

Summary

Sharding

Choosing Shard key:

Problems

References