Scaling
What to do when one server isn't enough
Your side project hits the front page of a forum. Forty thousand strangers arrive in ten minutes. The single server that served your first hundred users politely melts. Now what — a bigger server, or more of them?
That question has exactly two answers, and they're not mutually exclusive. We'll cover both, then the three pieces of machinery (load balancers, replication, and sharding) that make the "more of them" answer survive contact with reality.
Vertical vs horizontal scaling, load balancers to spread traffic, replication to scale reads and survive failures, and sharding to scale writes. Every concept ends with a QA testing lens: how a tester would probe or break it.
Vertical Scaling — get a bigger machine
The simplest fix: more CPU, more RAM, faster disks. Vertical scaling (scaling up) needs no code changes — you just rent a beefier box.
It's attractive because it's simple — your app never has to think about coordinating across machines, and one big server handles a surprising amount of traffic. Plenty for a startup.
But there's a hard wall:
A physical ceiling — the biggest instance is still finite
Cost grows disproportionately at the top end
A single point of failure — that one box goes down, everything goes down
Past a certain point, you stop buying bigger and start buying more.
QA Lens A vertically scaled system has one number that matters: where the ceiling is. Load-test to find it before production does — push traffic until something saturates and note which resource gives first (CPU? memory? disk I/O? connection pool?). Then test the upgrade itself: moving to a bigger box usually means downtime or a failover, and "we'll just resize the instance" plans have a habit of being tested for the first time during an incident.
Horizontal Scaling — add more machines
Horizontal scaling (scaling out) spreads load across many servers, with no theoretical upper limit. It's how every internet-scale system runs: Google, Netflix, and Amazon sit behind thousands of servers, and adding capacity is just spinning up another instance.
The catch is that your application has to be built for it:
- Zero code changes
- No distributed-systems complexity
- Hard ceiling on size
- Single point of failure
- No real upper limit
- Survives single-server failure
- Requires stateless servers + shared session store
- Needs a load balancer and a DB scaling plan
Horizontal scaling only works if any server can handle any request. That means servers must be stateless — session data lives in a shared store like Redis, never in one server's memory. The moment a user is "stuck" to the server that holds their login, you've broken the model.
QA Lens Statefulness hides until the second server appears. Run your tests against a two-node cluster, not one: log in on node A, send the next request to node B, and confirm the session survives. A test suite that always hits a single instance will pass right up until the day you scale out and users start getting randomly logged off.
Load Balancers — the traffic cop
A load balancer sits between clients and your server pool and decides who handles each request, so no single server drowns. It delivers two things at once: scalability (spread the load) and availability (route around dead servers).
Switch the algorithm and watch the per-server counters diverge — Round Robin spreads evenly,
Least Connections chases the lightest server, Weighted favors the bigger boxes (note the w3/w1
weights). Then click a server to kill it and watch the load balancer's health check reroute
traffic to the survivors. Kill them all and requests start failing.
It picks a target using an algorithm, and quietly drops servers that fail health checks:
| Algorithm | How it chooses |
|---|---|
| Round Robin | Next server in order, one after another |
| Least Connections | The server currently handling the fewest requests |
| Weighted | Beefier servers get a larger share of traffic |
| Health checks | Stop sending traffic to any server that stops responding |
Big systems stack them: one tier of load balancers for web servers, another for app servers, another in front of databases.
"We added a load balancer, so there's no single point of failure." The load balancer is one — until it's redundant itself. If that one Nginx box dies, every healthy server behind it is unreachable. Real deployments run load balancers in pairs with failover (or use a managed, inherently redundant one like an AWS ALB). Whenever you remove a SPOF, ask what you just created.
QA Lens
The load balancer's whole promise is "a dead server is invisible to users." Test it: kill a
backend mid-request and confirm traffic drains to healthy nodes with no failed requests.
Then test the inverse — a server that's slow but not dead. A naive health check sees a
200 and keeps feeding requests into a black hole.
Replication — copy data to scale reads and survive failure
Replication keeps copies of your data on multiple database servers. The common shape is primary–replica: all writes hit the primary, which streams changes to read-only replicas.
Most apps are read-heavy — think how often you scroll a feed versus post to it. Pointing reads at replicas offloads the primary and multiplies read capacity. Bonus: if a replica dies, reads shift to the others; if the primary dies, a replica is promoted to take over.
The price is replication lag — a brief delay before a write shows up on replicas:
A tweet appearing on someone's feed half a second late? Acceptable. A bank showing yesterday's balance after a withdrawal? Not acceptable. Read critical, must-be-fresh data from the primary; read everything else from replicas.
How much lag — and whether failover can lose data — depends on when the primary acknowledges a write:
| Mode | The primary says "done"… | Trade-off |
|---|---|---|
| Asynchronous | Immediately, then streams to replicas | Fast writes; a primary crash can lose the last unacked writes |
| Synchronous | Only after a replica confirms it has the write | No loss on failover; every write pays the replica round trip |
| Semi-sync | After one replica confirms | The common middle ground |
Most deployments default to async because write latency is visible to every user, while failover is rare. Just know the consequence you've signed up for: with async replication, "promote a replica" can mean quietly abandoning the last few committed writes.
QA Lens Replication lag is the source of maddening "flaky" tests. A test writes via the primary, reads via a replica a millisecond later, and intermittently sees stale data. Make the read path explicit in tests, and verify failover directly: kill the primary under write load and measure what happens — how long until a replica is promoted, and how many acknowledged writes were lost. With async replication some loss is expected; the test's job is to prove it stays within the bound the business agreed to (and that nobody assumed "zero" by default).
Sharding — split data to scale writes
Replication copies all the data to every node, which scales reads but not writes — every write still funnels through one primary. Sharding (also called horizontal partitioning — it splits by rows, where vertical partitioning splits by columns) divides the data so each node holds only a portion, distributing storage and write load.
Start on Range-based + Sequential keys and watch one shard catch fire 🔥 — monotonic keys (like auto-increment IDs) all fall into the newest range, so a single node takes every write. Now flip to Hash-based: the same keys scatter evenly. That's the whole shard-key lesson in one toggle.
A shard key decides which node owns each row. The two common strategies:
| Strategy | How it works | Watch out for |
|---|---|---|
| Hash-based | hash(key) % N picks the shard | Adding a shard reshuffles almost everything |
| Range-based | Consecutive key ranges per shard | Sequential keys create a hot shard |
Whichever you pick, the classic failure mode is the same: a bad shard key creates hot
shards, where one node eats most of the traffic while the others idle. Shard by country and
your biggest market melts one node; shard by an auto-increment ID with range-based splits and
every new write lands on the newest shard.
Consistent hashing — adding shards without reshuffling the world
The table above hides a nasty cliff: with hash(key) % N, growing from 4 shards to 5 changes
N, so almost every key now maps to a different shard — adding capacity triggers a massive
data migration at exactly the moment you're under load. Consistent hashing is the standard
fix: place shards on a conceptual ring and assign each key to the next shard clockwise. Adding a
shard then takes over only the slice of keys between it and its neighbor — roughly 1/N of the
data moves instead of nearly all of it. (In practice each node appears at many points on the
ring — "virtual nodes" — so the load spreads evenly.)
It's how Cassandra and DynamoDB distribute data, and it's a favorite interview follow-up: if you propose hash-based sharding, expect "what happens when you add a shard?" — consistent hashing is the answer.
Even with a good key and a good ring, sharding is genuinely hard: cross-shard queries (joining data on different nodes) are expensive, rebalancing is operationally delicate, and the complexity is permanent. Don't shard until replication and caching can't keep up.
QA Lens The defects sharding introduces are about distribution, not single rows. Test that your shard key spreads load evenly — synthetic traffic should hit all shards, not pile onto one. Then test the cross-shard operations explicitly: a query or transaction spanning shards is where correctness quietly breaks, because there's no single node that sees the whole truth.
How Scaling Stacks Up
You rarely pick one technique. You layer them as pressure mounts:
Test Yourself
Answer from memory first, then expand to check.
Q1. Users report being randomly logged out — but only since you added a second app server. What's the likely cause?
Session state was living in one server's memory. With two servers behind a load balancer, a user's next request lands on the other node, which has never heard of them. The fix is the stateless rule: sessions go in a shared store (Redis), so any server can handle any request.
Q2. Reads are overwhelming your database. Writes are fine. Do you replicate or shard — and why?
Replicate. Read replicas multiply read capacity without touching the schema or the application's query logic. Sharding is for when the primary can't absorb the writes — it's the last resort because it makes cross-shard queries and rebalancing your problem forever.
Q3. Your shards use hash(user_id) % 4. What goes wrong when you add a fifth shard, and what's the standard fix?
% 5 re-maps almost every key, so adding capacity triggers a near-total data migration.
Consistent hashing fixes it: shards sit on a ring, and a new shard takes over only its slice —
about 1/N of the keys move instead of nearly all of them.
Q4. With async replication, the primary dies and a replica is promoted. What did you possibly just lose, and what should a failover test measure?
The last writes the primary acknowledged but hadn't yet streamed to any replica. A real failover test kills the primary under write load and measures both recovery time and how many acknowledged writes vanished — then checks that number against what the business actually agreed to tolerate.
Quick Revision
Bigger machine. Simple, no code change — but a hard ceiling and a single point of failure.
More machines, no ceiling. Requires stateless servers + a shared session store.
Spreads traffic, runs health checks, routes around dead servers. Scalability + availability.
Primary–replica. Scales reads, enables failover. Trade-off: replication lag.
Split data across nodes to scale writes. Hard: cross-shard queries, hot shards, rebalancing.
"Reads are 100× writes, so I'll add read replicas and a cache before touching the schema. I'd only shard once a single primary can't keep up with writes — and I'd pick the shard key carefully to avoid hot shards." That sequencing is the signal interviewers look for.
Where to Next
Running many machines unlocks scale — and a whole new class of problems.
- Distributed Systems — CAP theorem, CDNs, and idempotency
- Data Storage — the storage layer you're now scaling