Data Storage
Where data lives, and how to find it before the user gives up
Your product launches. Sign-ups pour in. Then the dreaded query that ran in 5ms with a thousand rows takes 8 seconds at ten million — and the whole app grinds with it. Storage decisions made on day one quietly set the ceiling for day one thousand.
APIs shuttle data between programs. This page is about the data itself: choosing the right database, finding rows fast, splitting tables that grew too big, caching the hot stuff, trading normalization for speed, and parking big files where they belong.
Databases and their flavors, SQL vs NoSQL, indexing, vertical partitioning, caching, denormalization, and blob storage — seven decisions that decide whether your reads stay fast as data grows. Every concept ends with a QA testing lens: how a tester would probe or break it.
Databases — there is no "best" one
A database is just an organized store optimized for putting data in and getting it back out. The trap is asking "which database is best?" The honest answer is it depends on your data's shape and how you read it.
| Type | Shape | Shines at | Examples |
|---|---|---|---|
| Relational | Tables with strict relationships | Transactions, consistency, joins | PostgreSQL, MySQL |
| Document | Flexible JSON-like docs | Evolving / varied schemas | MongoDB |
| Key-Value | A giant hash map | Blazing-fast lookups by key | Redis, DynamoDB |
| Graph | Nodes + edges as first-class | Social graphs, recommendations | Neo4j |
Real systems mix them. An e-commerce app might keep orders in PostgreSQL (needs ACID) and the product catalog in MongoDB (flexible, read-heavy). The skill isn't picking one — it's explaining why each part of your system uses what it uses.
QA Lens When one app spans multiple stores, the bugs live in the seams. An order in PostgreSQL referencing a product that was deleted from MongoDB has no foreign key to save it — nothing enforces consistency across two databases. Test those cross-store references explicitly, and test each store's failure independently: catalog down but orders up should degrade gracefully, not 500 the whole checkout.
SQL vs NoSQL — the decision behind most others
The most common storage fork. SQL enforces a strict schema with ACID transactions; NoSQL trades some of that rigor for flexibility and horizontal scale.
- Structured schema, enforced at write time
- ACID transactions — strong consistency
- Rich joins and ad-hoc queries
- Scales up (vertically) more naturally than out
- Flexible / evolving schema
- Often eventually consistent
- No joins — model around access patterns
- Scales out horizontally with ease
Start with SQL (Postgres) unless you have a concrete reason not to. ACID, mature tooling, and decades of operational knowledge are hard to beat. Adopt NoSQL when a specific access pattern or scale requirement forces it — and say which one in the interview.
"NoSQL is faster than SQL." Neither is "faster" — they're fast at different things. A key-value lookup in Redis beats a five-table join, but a Postgres query with the right index beats a MongoDB aggregation that has to fan out across documents. NoSQL's headline win is horizontal scale and schema flexibility, and it pays for both by pushing joins, transactions, and consistency work up into your application code.
QA Lens The riskiest word above is "eventually." On a NoSQL store, a test that writes then immediately reads back can pass on a fast day and fail under load when the read hits a lagging replica. Test for read-your-own-writes expectations explicitly, and never assume a freshly written row is visible everywhere the instant the write returns.
Database Indexing — the back-of-the-book shortcut
Without an index, finding a row means scanning every row. Fine at 100 rows; a catastrophe at 100 million. An index is a separate data structure that turns a linear scan into a fast lookup — like the index at the back of a textbook.
Most databases default to B-tree indexes: a sorted, balanced tree giving O(log n) lookups.
Create one on email and the engine builds a tree mapping email values straight to row
locations.
But indexes aren't free — and that trade-off is the whole interview answer:
Reads on indexed columns get dramatically faster
Every INSERT/UPDATE must also update the index (slower writes)
Indexes consume extra storage
The rule of thumb: index columns that appear in WHERE clauses and JOIN conditions — and
resist indexing everything.
QA Lens
Performance tests on tiny seed data are theater — a full scan and an index lookup both finish
in 2ms when there are 50 rows. Load realistic volumes, then read the query plan
(EXPLAIN) to confirm the index is actually used. A missing index shows up
as a Seq Scan that quietly scales linearly into an outage.
Vertical Partitioning — split wide tables by how they're read
When a table sprouts dozens of columns, vertical partitioning slices it into narrower tables joined by a shared key. The organizing principle: separate columns by access pattern.
(Its sibling, horizontal partitioning — splitting by rows instead of columns — is better known as sharding, covered in the Scaling page. Vertical splits columns; horizontal splits rows.)
Your profile page needs a name and avatar, not a credit card number. Splitting the table means each query touches only the bytes it needs — fewer reads from disk, faster responses. It also lets you put stricter access controls on the billing partition and tune each table's indexes and storage independently.
Bring up vertical partitioning when you have a wide table whose columns are accessed in very different contexts and security tiers — the textbook case being splitting sensitive billing fields away from public profile fields.
QA Lens Splitting one table into three turns one write into three — and that's no longer atomic unless someone wraps it in a transaction. Test the half-written user: create an account, kill the process between the Profile insert and the Auth insert, and check whether you've made a user who exists but can't log in. Then verify the security promise actually holds: the profile-page query must not be able to select from the billing partition at all.
Caching — the single biggest read win
A cache stores frequently accessed data in a fast layer (usually memory) so you skip the database on most reads. Done right, it's the highest-leverage performance technique you have.
Let it run and watch the hit rate climb as hot keys (A and B come up most) settle into the three cache slots. The first request for any key is an amber miss that detours to the database and stores the value; every request after is a green hit served straight from cache. Hit Expire cache to flush it and watch the cold-start miss storm before it warms back up.
The pattern above is cache-aside (lazy loading): check the cache, return on a hit, and on a miss fetch from the DB and write the result back for next time. Redis and Memcached are the usual tools.
The genuinely hard part of caching isn't reading — it's invalidation. When the underlying data changes, the cache must update or expire, or users see stale data:
There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors.
QA Lens Caching turns "stale data" into a first-class test case. After an update, assert the user sees the new value within the promised TTL — not the old one forever. Probe the cold start (empty cache under load → does the DB survive the stampede?) and the thundering herd when a hot key expires and a thousand requests miss at once.
Denormalization — trade clean schema for fast reads
Normalized schemas split data across many tables to avoid duplication. Clean — but rendering one screen might mean joining five tables. Denormalization deliberately duplicates data to cut those joins.
- Normalized (joins)
- Denormalized (one read)
Showing an order with the buyer's name and product title means joining three tables on every read. Clean to store, expensive to read.
All the display data lives in one row, so a single read returns everything. The cost: if a
user renames themselves, you must update user_name everywhere it was copied.
Denormalization is a read optimization — reach for it when the system is read-heavy and those reads drown in joins. It's the norm in NoSQL, where joins aren't supported. The interview rule: never propose denormalization without a plan to keep the duplicated data consistent.
QA Lens
Duplicated data is duplicated bugs waiting to drift. Write a test that renames a user and then
verifies every copy of user_name updates — orders, invoices, cached
summaries. Drift between copies is the classic denormalization defect, and it never shows up in
a single-table unit test.
Blob Storage — the right home for big files
Images, videos, PDFs, backups — "binary large objects" don't belong in your rows. Blob storage (Amazon S3 and friends) stores massive unstructured files cheaply and durably.
The pattern is simple and worth committing to memory: store the file in blob storage, put the returned URL in your database. Your DB stays lean (a short URL string, not a 10MB image), and the blob store handles durability, replication, and serving. S3 advertises eleven nines of durability (99.999999999%) and plugs straight into a CDN for fast delivery.
Whenever a design touches media — avatars, uploads, video, documents — say it out loud: "Files go in blob storage; the database holds the URL." It's an expected, easy point to score.
QA Lens The bug hides in the seam between the two stores. Test the orphan cases: a DB row pointing at a file that failed to upload, and a file with no row referencing it. Verify access control on the URL (is that "private" upload actually reachable by a logged-out stranger?) and that large uploads fail gracefully rather than half-writing.
Test Yourself
Answer from memory first, then expand to check.
Q1. A query that took 5ms in staging takes 8 seconds in production. Staging has 1,000 rows; production has 10 million. What's your first diagnostic step?
Run EXPLAIN on the production query and look for a sequential scan. A missing index is
invisible at small scale — both a full scan and an index lookup are instant on 1,000 rows. This
is also why performance tests on tiny seed data prove nothing.
Q2. Indexes make reads faster. Why not index every column?
Every INSERT/UPDATE must also update every index, so writes get slower with each one — and
each index costs storage. Index what appears in WHERE and JOIN clauses; leave the rest.
Q3. Vertical partitioning and sharding both "split" data. What's the difference in one sentence each?
Vertical partitioning splits a table by columns (profile fields here, billing fields there) to match access patterns. Sharding — horizontal partitioning — splits by rows (users A–M on one node, N–Z on another) to scale storage and write load across machines.
Q4. You denormalized user_name into the orders table for fast reads. A user renames themselves. What must your test now verify?
That every copy of the duplicated data gets updated — orders, invoices, cached summaries. Drift between copies is the classic denormalization defect, and no single-table unit test will catch it. Never propose denormalization without naming the mechanism that keeps copies in sync.
Q5. Where does a user's 10MB profile video live, and what goes in the database?
The video goes in blob storage (S3 or similar, fronted by a CDN); the database stores only the URL. Then test the seams: a row pointing at a failed upload, an orphaned file with no row, and whether a "private" upload's URL is really private.
Quick Revision
Relational, document, key-value, graph. Pick by data shape and access pattern.
ACID + joins vs flexible + horizontal scale. Default to SQL; justify NoSQL.
B-tree → O(log n) reads. Faster reads, slower writes. Don't index everything.
Split wide tables by access pattern; touch only the columns you need.
Cache-aside for fast reads. Invalidation (TTL / write-through) is the hard part.
Duplicate data to kill joins on read-heavy paths — with a consistency plan.
Big files in S3; store only the URL in the database. Cheap, durable, CDN-ready.
Where to Next
You can store data efficiently on one machine. What happens when one machine isn't enough?
- Scaling — vertical, horizontal, load balancers, replication, sharding
- APIs & Communication — the contracts that move this data