Skip to main content

Scaling

What to do when one server isn't enough

Your side project hits the front page of a forum. Forty thousand strangers arrive in ten minutes. The single server that served your first hundred users politely melts. Now what — a bigger server, or more of them?

That question has exactly two answers, and they're not mutually exclusive. We'll cover both, then the three pieces of machinery (load balancers, replication, and sharding) that make the "more of them" answer survive contact with reality.

What this page covers

Vertical vs horizontal scaling, load balancers to spread traffic, replication to scale reads and survive failures, and sharding to scale writes. Every concept ends with a QA testing lens: how a tester would probe or break it.


Vertical Scaling — get a bigger machine

The simplest fix: more CPU, more RAM, faster disks. Vertical scaling (scaling up) needs no code changes — you just rent a beefier box.

Scaling up — until the ceiling
requestresponse· hover a node to trace it
upgradeupgradeceiling
Small2 CPU / 4GB
Medium8 CPU / 32GB
Large64 CPU / 256GB
Hardware limitcannot scale further
Each upgrade is easy until the red node: the biggest instance is still finite, and cost soars near the top.

It's attractive because it's simple — your app never has to think about coordinating across machines, and one big server handles a surprising amount of traffic. Plenty for a startup.

But there's a hard wall:

A physical ceiling — the biggest instance is still finite
Cost grows disproportionately at the top end
A single point of failure — that one box goes down, everything goes down

Past a certain point, you stop buying bigger and start buying more.

QA Lens A vertically scaled system has one number that matters: where the ceiling is. Load-test to find it before production does — push traffic until something saturates and note which resource gives first (CPU? memory? disk I/O? connection pool?). Then test the upgrade itself: moving to a bigger box usually means downtime or a failover, and "we'll just resize the instance" plans have a habit of being tested for the first time during an incident.


Horizontal Scaling — add more machines

Horizontal scaling (scaling out) spreads load across many servers, with no theoretical upper limit. It's how every internet-scale system runs: Google, Netflix, and Amazon sit behind thousands of servers, and adding capacity is just spinning up another instance.

Scaling out — add more servers
requestresponse· hover a node to trace it
Load BalancerNginx / ALB
Server 1
Server 2
Server 3
Server N ...
Capacity is now just another green box — but only if every server is stateless enough to handle any request.

The catch is that your application has to be built for it:

Up vs Out
Vertical (Up)bigger box
  • Zero code changes
  • No distributed-systems complexity
  • Hard ceiling on size
  • Single point of failure
Best forEarly-stage apps and workloads that fit one machine.
Horizontal (Out)more boxes
  • No real upper limit
  • Survives single-server failure
  • Requires stateless servers + shared session store
  • Needs a load balancer and a DB scaling plan
Best forAnything that must grow past one machine or stay highly available.
The stateless rule

Horizontal scaling only works if any server can handle any request. That means servers must be stateless — session data lives in a shared store like Redis, never in one server's memory. The moment a user is "stuck" to the server that holds their login, you've broken the model.

QA Lens Statefulness hides until the second server appears. Run your tests against a two-node cluster, not one: log in on node A, send the next request to node B, and confirm the session survives. A test suite that always hits a single instance will pass right up until the day you scale out and users start getting randomly logged off.


Load Balancers — the traffic cop

A load balancer sits between clients and your server pool and decides who handles each request, so no single server drowns. It delivers two things at once: scalability (spread the load) and availability (route around dead servers).

Load Balancer · Live Traffic Distribution
· click a server to kill it
Client 1
Client 2
Client 3
Load Balancer4/4 healthy
Server 10
Server 20
Server 30
Server 40
Switch algorithms to see the per-server counters diverge, then kill a server and watch health checks reroute.
Play with it

Switch the algorithm and watch the per-server counters diverge — Round Robin spreads evenly, Least Connections chases the lightest server, Weighted favors the bigger boxes (note the w3/w1 weights). Then click a server to kill it and watch the load balancer's health check reroute traffic to the survivors. Kill them all and requests start failing.

It picks a target using an algorithm, and quietly drops servers that fail health checks:

AlgorithmHow it chooses
Round RobinNext server in order, one after another
Least ConnectionsThe server currently handling the fewest requests
WeightedBeefier servers get a larger share of traffic
Health checksStop sending traffic to any server that stops responding

Big systems stack them: one tier of load balancers for web servers, another for app servers, another in front of databases.

Common misconception

"We added a load balancer, so there's no single point of failure." The load balancer is one — until it's redundant itself. If that one Nginx box dies, every healthy server behind it is unreachable. Real deployments run load balancers in pairs with failover (or use a managed, inherently redundant one like an AWS ALB). Whenever you remove a SPOF, ask what you just created.

QA Lens The load balancer's whole promise is "a dead server is invisible to users." Test it: kill a backend mid-request and confirm traffic drains to healthy nodes with no failed requests. Then test the inverse — a server that's slow but not dead. A naive health check sees a 200 and keeps feeding requests into a black hole.


Replication — copy data to scale reads and survive failure

Replication keeps copies of your data on multiple database servers. The common shape is primary–replica: all writes hit the primary, which streams changes to read-only replicas.

Primary–replica — writes to one, reads from many
requestresponse· hover a node to trace it
writesreplicatesreplicatesreadsreads
App Server
Primaryread + write
Replica 1read only
Replica 2read only
All writes still funnel through one primary while reads fan out — replication scales reads, never writes.

Most apps are read-heavy — think how often you scroll a feed versus post to it. Pointing reads at replicas offloads the primary and multiplies read capacity. Bonus: if a replica dies, reads shift to the others; if the primary dies, a replica is promoted to take over.

The price is replication lag — a brief delay before a write shows up on replicas:

Lag is usually fine — until it isn't

A tweet appearing on someone's feed half a second late? Acceptable. A bank showing yesterday's balance after a withdrawal? Not acceptable. Read critical, must-be-fresh data from the primary; read everything else from replicas.

How much lag — and whether failover can lose data — depends on when the primary acknowledges a write:

ModeThe primary says "done"…Trade-off
AsynchronousImmediately, then streams to replicasFast writes; a primary crash can lose the last unacked writes
SynchronousOnly after a replica confirms it has the writeNo loss on failover; every write pays the replica round trip
Semi-syncAfter one replica confirmsThe common middle ground

Most deployments default to async because write latency is visible to every user, while failover is rare. Just know the consequence you've signed up for: with async replication, "promote a replica" can mean quietly abandoning the last few committed writes.

QA Lens Replication lag is the source of maddening "flaky" tests. A test writes via the primary, reads via a replica a millisecond later, and intermittently sees stale data. Make the read path explicit in tests, and verify failover directly: kill the primary under write load and measure what happens — how long until a replica is promoted, and how many acknowledged writes were lost. With async replication some loss is expected; the test's job is to prove it stays within the bound the business agreed to (and that nobody assumed "zero" by default).


Sharding — split data to scale writes

Replication copies all the data to every node, which scales reads but not writes — every write still funnels through one primary. Sharding (also called horizontal partitioning — it splits by rows, where vertical partitioning splits by columns) divides the data so each node holds only a portion, distributing storage and write load.

Sharding · Shard Key & Hot Shards
Writes
Shard Routerkey range / 4Shard 10Shard 20Shard 30Shard 400 records routed · load is spread evenly across shards
Try sequential keys with range-based sharding and watch one shard take every write — then flip to hash-based.
Play with it

Start on Range-based + Sequential keys and watch one shard catch fire 🔥 — monotonic keys (like auto-increment IDs) all fall into the newest range, so a single node takes every write. Now flip to Hash-based: the same keys scatter evenly. That's the whole shard-key lesson in one toggle.

A shard key decides which node owns each row. The two common strategies:

StrategyHow it worksWatch out for
Hash-basedhash(key) % N picks the shardAdding a shard reshuffles almost everything
Range-basedConsecutive key ranges per shardSequential keys create a hot shard

Whichever you pick, the classic failure mode is the same: a bad shard key creates hot shards, where one node eats most of the traffic while the others idle. Shard by country and your biggest market melts one node; shard by an auto-increment ID with range-based splits and every new write lands on the newest shard.

Consistent hashing — adding shards without reshuffling the world

The table above hides a nasty cliff: with hash(key) % N, growing from 4 shards to 5 changes N, so almost every key now maps to a different shard — adding capacity triggers a massive data migration at exactly the moment you're under load. Consistent hashing is the standard fix: place shards on a conceptual ring and assign each key to the next shard clockwise. Adding a shard then takes over only the slice of keys between it and its neighbor — roughly 1/N of the data moves instead of nearly all of it. (In practice each node appears at many points on the ring — "virtual nodes" — so the load spreads evenly.)

It's how Cassandra and DynamoDB distribute data, and it's a favorite interview follow-up: if you propose hash-based sharding, expect "what happens when you add a shard?" — consistent hashing is the answer.

Even with a good key and a good ring, sharding is genuinely hard: cross-shard queries (joining data on different nodes) are expensive, rebalancing is operationally delicate, and the complexity is permanent. Don't shard until replication and caching can't keep up.

QA Lens The defects sharding introduces are about distribution, not single rows. Test that your shard key spreads load evenly — synthetic traffic should hit all shards, not pile onto one. Then test the cross-shard operations explicitly: a query or transaction spanning shards is where correctness quietly breaks, because there's no single node that sees the whole truth.


How Scaling Stacks Up

You rarely pick one technique. You layer them as pressure mounts:

1
Scale Up
first
2
Scale Out
add servers
3
Replicate
scale reads
4
Shard
scale writes
Note the ordering: sharding comes last because its complexity is permanent — exhaust the easier layers first.

Test Yourself

Answer from memory first, then expand to check.

Q1. Users report being randomly logged out — but only since you added a second app server. What's the likely cause?

Session state was living in one server's memory. With two servers behind a load balancer, a user's next request lands on the other node, which has never heard of them. The fix is the stateless rule: sessions go in a shared store (Redis), so any server can handle any request.

Q2. Reads are overwhelming your database. Writes are fine. Do you replicate or shard — and why?

Replicate. Read replicas multiply read capacity without touching the schema or the application's query logic. Sharding is for when the primary can't absorb the writes — it's the last resort because it makes cross-shard queries and rebalancing your problem forever.

Q3. Your shards use hash(user_id) % 4. What goes wrong when you add a fifth shard, and what's the standard fix?

% 5 re-maps almost every key, so adding capacity triggers a near-total data migration. Consistent hashing fixes it: shards sit on a ring, and a new shard takes over only its slice — about 1/N of the keys move instead of nearly all of them.

Q4. With async replication, the primary dies and a replica is promoted. What did you possibly just lose, and what should a failover test measure?

The last writes the primary acknowledged but hadn't yet streamed to any replica. A real failover test kills the primary under write load and measures both recovery time and how many acknowledged writes vanished — then checks that number against what the business actually agreed to tolerate.


Quick Revision

Vertical

Bigger machine. Simple, no code change — but a hard ceiling and a single point of failure.

Horizontal

More machines, no ceiling. Requires stateless servers + a shared session store.

Load Balancer

Spreads traffic, runs health checks, routes around dead servers. Scalability + availability.

Replication

Primary–replica. Scales reads, enables failover. Trade-off: replication lag.

Sharding

Split data across nodes to scale writes. Hard: cross-shard queries, hot shards, rebalancing.

In an interview

"Reads are 100× writes, so I'll add read replicas and a cache before touching the schema. I'd only shard once a single primary can't keep up with writes — and I'd pick the shard key carefully to avoid hot shards." That sequencing is the signal interviewers look for.


Where to Next

Running many machines unlocks scale — and a whole new class of problems.