Distributed Systems
Where "it works on one machine" stops being true
On a single machine, a function call either succeeds or fails, and you know which. Across a network, there's a third outcome: the request was processed, but the reply got lost. Did the payment go through? You genuinely can't tell. Welcome to distributed systems.
Once you have multiple servers, databases, and caches talking over a network, machines fail independently and networks split without warning. This page covers the four ideas that turn that chaos into something designable: the CAP theorem, CDNs, idempotency, and timeouts, retries & circuit breakers.
The CAP theorem (the trade-off you can't escape), CDNs (serving content from near the user), idempotency (making retries safe), and timeouts, retries & circuit breakers (failing without falling over). Every concept ends with a QA testing lens: how a tester would probe or break it.
CAP Theorem — pick two, but really pick one
The CAP theorem says a distributed system can guarantee at most two of three properties:
The twist the theorem is famous for: in any real network, partitions are not optional — cables fail, packets drop. So partition tolerance is a given, and the actual choice is between Consistency and Availability when a partition happens.
While the network is healthy, both nodes agree — there's no trade-off to make. Now Cut network and write a new value to A. In CP mode, Node B refuses the read (an error) rather than hand back stale data. Switch to AP and the same read succeeds, but with a stale value. Heal the network and B catches up. That toggle is the CAP theorem.
- Returns an error instead of stale data
- Everyone always sees correct, current values
- Some requests fail during a partition
- CP-leaning defaults: HBase, ZooKeeper, etcd
- Always responds, even mid-partition
- Data may be briefly out of date
- Reconciles once the partition heals
- AP-leaning defaults: Cassandra, CouchDB
Those product names describe defaults, not destinies. Most modern databases are tunable:
Cassandra becomes CP-ish if you write and read at QUORUM; DynamoDB offers strongly consistent
reads on request; MongoDB's behavior depends on write/read concerns. The senior answer isn't
"Cassandra is AP" — it's "here's the consistency setting I'd run this workload with, and why."
There's also a follow-up worth knowing: CAP only speaks to what happens during a partition. PACELC extends it — when there's a Partition, you trade Availability vs Consistency; Else (in normal operation) you trade Latency vs Consistency. Waiting for replicas to confirm every write keeps data consistent but makes every request slower. That everyday trade-off bites far more often than actual partitions do.
The interview move isn't reciting the letters — it's saying which you'd pick for which part of your system, and why. A shopping cart can be AP; the final charge must be CP.
QA Lens You can't test CAP behavior without partition testing — deliberately severing the network between nodes (tools like a chaos proxy do this). For a CP system, confirm it returns errors rather than stale reads while split. For an AP system, confirm it stays up and that the two halves reconcile correctly once reconnected. Last-write-wins can silently eat data; prove it doesn't.
CDN — bring the content to the user
A CDN (Content Delivery Network) is a fleet of servers spread across the globe that caches content close to users. Instead of every request crossing an ocean to your origin in Virginia, users are served from the nearest edge node.
This is pull-through caching, the common default: edges fetch from origin on the first request and keep a copy for the TTL. (The alternative, push, pre-uploads content to edges — useful for big planned releases like a game patch, at the cost of managing what lives where.) CDNs mostly serve static content — images, CSS, JavaScript, video — though modern ones (Cloudflare) also cache dynamic responses and run code at the edge. The first user in a region pays the cache miss; everyone after gets the cached copy, orders of magnitude faster.
The payoff is three wins at once:
Lower latency — content travels a shorter distance
Less load on origin — edges absorb the read traffic
Better availability — cached content survives an origin outage
The reflex rule mirrors blob storage: any time your system serves media or static assets, put a CDN in front of it. CloudFront, Cloudflare, and Akamai are the names to drop.
QA Lens CDNs make "I deployed a fix but users still see the old version" a daily reality — the edge is serving a cached copy. Test cache busting (versioned URLs / cache headers) so a release actually reaches users. Verify behavior from multiple regions, and confirm a cache miss in a cold region still returns correct content, not a stale or wrong asset.
Idempotency — make retries safe
Remember the lost reply from the intro? A network timeout doesn't tell you whether the server did the work — only that you didn't hear back. So the client retries. Without protection, that retry charges the card twice. Idempotency is the guarantee that doing something N times has the same effect as doing it once.
Some methods are idempotent by nature: GET (reading twice returns the same thing) and DELETE
(deleting an already-deleted thing is a no-op). POST is not — so you attach an
idempotency key: a unique ID the client sends with the request. The server records each key's
result and, on a repeat, returns the stored result instead of doing the work again.
Stripe, PayPal, and virtually every payment API require idempotency keys for exactly this reason. Any time you discuss payments, order creation, or "send the email", mention idempotency — it's the difference between "charged once" and a furious customer.
QA Lens This is the single most valuable distributed-systems test you can write: fire the same request twice with the same idempotency key and assert exactly one side effect — one charge, one order, one email. Then test the nastier variant: two retries arriving concurrently. A naive "check then insert" has a race window that lets both through.
Timeouts, Retries & Circuit Breakers — failing without falling over
Idempotency makes retries safe. This concept is about making them smart — because the most common way distributed systems die isn't a crash, it's a slow dependency plus naive retries.
It starts with the timeout. A call with no timeout can hang forever, holding a thread, a connection, and a user. Every network call needs a deadline — and the deadline should be based on that dependency's real p99, not a default someone copied in 2019.
Then the retry. A failed call is often worth one more try — but how you retry decides whether you recover or make it worse:
"Retries make the system more reliable." Undisciplined retries do the opposite — they're an amplifier. If a service slows down and every caller retries 3×, the service now faces 4× its normal traffic at its weakest moment, which is how a slowdown becomes an outage. This is the retry storm, and it's a self-inflicted DDoS. Retries help only when paired with backoff, jitter, and idempotency.
When a dependency is truly down, even polite retries are wasted work. A circuit breaker sits in front of the call and, after enough consecutive failures, opens — failing all calls instantly without touching the network. After a cooldown it lets one probe request through (half-open); success closes the circuit, failure keeps it open. The point is to fail fast and give the dependency silence to recover in, instead of letting requests pile up and drag your own service down with it.
QA Lens Don't just test the dependency being down — test it being slow, which is far more dangerous. Inject a 30-second delay (not an error) into a downstream call and watch what happens upstream: do callers time out cleanly, or do threads pile up until your service falls over too? Then verify the breaker's full lifecycle: it opens after the threshold, fails fast while open, probes half-open, and closes again on recovery — a breaker that never closes is an outage that survives the fix.
Test Yourself
Answer from memory first, then expand to check.
Q1. "We chose CP, so our system is consistent and partition-tolerant but never available." What's wrong with this sentence?
CAP only describes behavior during a partition. A CP system is fully available whenever the network is healthy — it just chooses to return errors rather than stale data while a partition is happening. And per PACELC, the trade-off that bites daily isn't partition behavior at all: it's latency vs consistency in normal operation.
Q2. A payment request times out. Why is "just retry it" dangerous, and what two things make the retry safe?
A timeout doesn't tell you whether the server did the work — the charge may have succeeded and only the reply was lost, so a blind retry can charge twice. The retry needs an idempotency key (so the server recognizes the repeat and returns the stored result) and backoff with jitter (so a struggling service isn't hammered by synchronized retries).
Q3. You deployed a fix but users still see the old page. The servers are confirmed updated. What's going on?
The CDN edge is serving its cached copy and will keep doing so until the TTL expires. The fix is cache busting — versioned asset URLs or correct cache headers — and it's exactly why "deploy succeeded" and "users got the fix" are two different test assertions.
Q4. Why is a slow dependency more dangerous than a dead one — and which pattern defends against each?
A dead dependency fails fast; a slow one silently holds your threads and connections hostage until your own service runs out and falls over — the failure cascades. Timeouts defend against slow (every call gets a deadline); circuit breakers defend against dead (stop calling, fail instantly, probe for recovery).
Quick Revision
Partitions are inevitable, so the real choice is Consistency vs Availability. CP for money, AP for feeds.
Cache content at edge nodes near users. Lower latency, less origin load, better availability.
Make retries safe. Idempotency keys turn "maybe charged twice" into "charged exactly once."
Every call gets a deadline. Retry with backoff + jitter. Circuit breakers fail fast when a dependency dies.
On one machine you ask "will it work?" Across a network you ask "what happens when a message is lost, delayed, duplicated, or reordered?" Design for all four — because all four will happen.
Where to Next
These primitives compose into full-system blueprints. That's the last group.
- Architecture Patterns — microservices, queues, rate limiting, gateways
- Scaling — the techniques that created these distributed challenges