Data Storage

Where data lives, and how to find it before the user gives up

Your product launches. Sign-ups pour in. Then the dreaded query that ran in 5ms with a thousand rows takes 8 seconds at ten million — and the whole app grinds with it. Storage decisions made on day one quietly set the ceiling for day one thousand.

APIs shuttle data between programs. This page is about the data itself: choosing the right database, finding rows fast, splitting tables that grew too big, caching the hot stuff, trading normalization for speed, and parking big files where they belong.

What this page covers

Databases and their flavors, SQL vs NoSQL, indexing, vertical partitioning, caching, denormalization, and blob storage — seven decisions that decide whether your reads stay fast as data grows. Every concept ends with a QA testing lens: how a tester would probe or break it.

Databases — there is no "best" one

A database is just an organized store optimized for putting data in and getting it back out. The trap is asking "which database is best?" The honest answer is it depends on your data's shape and how you read it.

One app, four database shapes

requestresponse· hover a node to trace it

Notice one application talking to several stores at once — real systems mix them, and the bugs live in the seams.

Type	Shape	Shines at	Examples
Relational	Tables with strict relationships	Transactions, consistency, joins	PostgreSQL, MySQL
Document	Flexible JSON-like docs	Evolving / varied schemas	MongoDB
Key-Value	A giant hash map	Blazing-fast lookups by key	Redis, DynamoDB
Graph	Nodes + edges as first-class	Social graphs, recommendations	Neo4j

Real systems mix them. An e-commerce app might keep orders in PostgreSQL (needs ACID) and the product catalog in MongoDB (flexible, read-heavy). The skill isn't picking one — it's explaining why each part of your system uses what it uses.

QA Lens When one app spans multiple stores, the bugs live in the seams. An order in PostgreSQL referencing a product that was deleted from MongoDB has no foreign key to save it — nothing enforces consistency across two databases. Test those cross-store references explicitly, and test each store's failure independently: catalog down but orders up should degrade gracefully, not 500 the whole checkout.

SQL vs NoSQL — the decision behind most others

The most common storage fork. SQL enforces a strict schema with ACID transactions; NoSQL trades some of that rigor for flexibility and horizontal scale.

Choose by guarantee, not by hype

SQLPostgres, MySQL

Structured schema, enforced at write time
ACID transactions — strong consistency
Rich joins and ad-hoc queries
Scales up (vertically) more naturally than out

Best forBanking, inventory, orders — anywhere stale or partial data is dangerous.

NoSQLMongo, Cassandra, DynamoDB

Flexible / evolving schema
Often eventually consistent
No joins — model around access patterns
Scales out horizontally with ease

Best forHigh write throughput and unstructured data — logs, feeds, events.

The honest default

Start with SQL (Postgres) unless you have a concrete reason not to. ACID, mature tooling, and decades of operational knowledge are hard to beat. Adopt NoSQL when a specific access pattern or scale requirement forces it — and say which one in the interview.

Common misconception

"NoSQL is faster than SQL." Neither is "faster" — they're fast at different things. A key-value lookup in Redis beats a five-table join, but a Postgres query with the right index beats a MongoDB aggregation that has to fan out across documents. NoSQL's headline win is horizontal scale and schema flexibility, and it pays for both by pushing joins, transactions, and consistency work up into your application code.

QA Lens The riskiest word above is "eventually." On a NoSQL store, a test that writes then immediately reads back can pass on a fast day and fail under load when the read hits a lagging replica. Test for read-your-own-writes expectations explicitly, and never assume a freshly written row is visible everywhere the instant the write returns.

Database Indexing — the back-of-the-book shortcut

Without an index, finding a row means scanning every row. Fine at 100 rows; a catastrophe at 100 million. An index is a separate data structure that turns a linear scan into a fast lookup — like the index at the back of a textbook.

With an index vs without

requestresponse· hover a node to trace it

The dashed path is the trap: a full scan feels fine at 100 rows and becomes a catastrophe at 100 million.

Most databases default to B-tree indexes: a sorted, balanced tree giving O(log n) lookups. Create one on email and the engine builds a tree mapping email values straight to row locations.

But indexes aren't free — and that trade-off is the whole interview answer:

Reads on indexed columns get dramatically faster
Every INSERT/UPDATE must also update the index (slower writes)
Indexes consume extra storage

The rule of thumb: index columns that appear in WHERE clauses and JOIN conditions — and resist indexing everything.

QA Lens Performance tests on tiny seed data are theater — a full scan and an index lookup both finish in 2ms when there are 50 rows. Load realistic volumes, then read the query plan (EXPLAIN) to confirm the index is actually used. A missing index shows up as a Seq Scan that quietly scales linearly into an outage.

Vertical Partitioning — split wide tables by how they're read

When a table sprouts dozens of columns, vertical partitioning slices it into narrower tables joined by a shared key. The organizing principle: separate columns by access pattern.

(Its sibling, horizontal partitioning — splitting by rows instead of columns — is better known as sharding, covered in the Scaling page. Vertical splits columns; horizontal splits rows.)

Split one wide table by access pattern

requestresponse· hover a node to trace it

Each query now touches only the bytes it needs — and billing fields can sit behind stricter access controls.

Your profile page needs a name and avatar, not a credit card number. Splitting the table means each query touches only the bytes it needs — fewer reads from disk, faster responses. It also lets you put stricter access controls on the billing partition and tune each table's indexes and storage independently.

When to mention it

Bring up vertical partitioning when you have a wide table whose columns are accessed in very different contexts and security tiers — the textbook case being splitting sensitive billing fields away from public profile fields.

QA Lens Splitting one table into three turns one write into three — and that's no longer atomic unless someone wraps it in a transaction. Test the half-written user: create an account, kill the process between the Profile insert and the Auth insert, and check whether you've made a user who exists but can't log in. Then verify the security promise actually holds: the profile-page query must not be able to select from the billing partition at all.

Caching — the single biggest read win

A cache stores frequently accessed data in a fast layer (usually memory) so you skip the database on most reads. Done right, it's the highest-leverage performance technique you have.

Cache-Aside · Live Hit / Miss

hitmiss

Watch the hit rate climb as hot keys settle into cache, then press Expire to see the cold-start miss storm.

Play with it

Let it run and watch the hit rate climb as hot keys (A and B come up most) settle into the three cache slots. The first request for any key is an amber miss that detours to the database and stores the value; every request after is a green hit served straight from cache. Hit Expire cache to flush it and watch the cold-start miss storm before it warms back up.

The pattern above is cache-aside (lazy loading): check the cache, return on a hit, and on a miss fetch from the DB and write the result back for next time. Redis and Memcached are the usual tools.

The genuinely hard part of caching isn't reading — it's invalidation. When the underlying data changes, the cache must update or expire, or users see stale data:

TTL

Entries auto-expire after a set time. Simple; tolerates brief staleness.

Write-through

Update cache and database together on every write. Fresh, but slower writes.

Write-behind

Update cache first, flush to DB later. Fast, but risk losing un-flushed writes.

There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors.

QA Lens Caching turns "stale data" into a first-class test case. After an update, assert the user sees the new value within the promised TTL — not the old one forever. Probe the cold start (empty cache under load → does the DB survive the stampede?) and the thundering herd when a hot key expires and a thousand requests miss at once.

Denormalization — trade clean schema for fast reads

Normalized schemas split data across many tables to avoid duplication. Clean — but rendering one screen might mean joining five tables. Denormalization deliberately duplicates data to cut those joins.

Normalized (joins)
Denormalized (one read)

Normalized — read = three joins

requestresponse· hover a node to trace it

Every read of one order pays for two joins — clean to store, expensive to render.

Showing an order with the buyer's name and product title means joining three tables on every read. Clean to store, expensive to read.

Denormalized — read = one row

requestresponse· hover a node to trace it

One row now returns everything — but a renamed user must be updated everywhere their name was copied.

All the display data lives in one row, so a single read returns everything. The cost: if a user renames themselves, you must update user_name everywhere it was copied.

Denormalization is a read optimization — reach for it when the system is read-heavy and those reads drown in joins. It's the norm in NoSQL, where joins aren't supported. The interview rule: never propose denormalization without a plan to keep the duplicated data consistent.

QA Lens Duplicated data is duplicated bugs waiting to drift. Write a test that renames a user and then verifies every copy of user_name updates — orders, invoices, cached summaries. Drift between copies is the classic denormalization defect, and it never shows up in a single-table unit test.

Blob Storage — the right home for big files

Images, videos, PDFs, backups — "binary large objects" don't belong in your rows. Blob storage (Amazon S3 and friends) stores massive unstructured files cheaply and durably.

Files in blob storage, URL in the database

requestresponse· hover a node to trace it

Only a short URL ever enters the database — the heavy bytes flow from the blob store to the CDN to the user.

The pattern is simple and worth committing to memory: store the file in blob storage, put the returned URL in your database. Your DB stays lean (a short URL string, not a 10MB image), and the blob store handles durability, replication, and serving. S3 advertises eleven nines of durability (99.999999999%) and plugs straight into a CDN for fast delivery.

Reflex rule

Whenever a design touches media — avatars, uploads, video, documents — say it out loud: "Files go in blob storage; the database holds the URL." It's an expected, easy point to score.

QA Lens The bug hides in the seam between the two stores. Test the orphan cases: a DB row pointing at a file that failed to upload, and a file with no row referencing it. Verify access control on the URL (is that "private" upload actually reachable by a logged-out stranger?) and that large uploads fail gracefully rather than half-writing.

Test Yourself

Answer from memory first, then expand to check.

Q1. A query that took 5ms in staging takes 8 seconds in production. Staging has 1,000 rows; production has 10 million. What's your first diagnostic step?

Run EXPLAIN on the production query and look for a sequential scan. A missing index is invisible at small scale — both a full scan and an index lookup are instant on 1,000 rows. This is also why performance tests on tiny seed data prove nothing.

Q2. Indexes make reads faster. Why not index every column?

Every INSERT/UPDATE must also update every index, so writes get slower with each one — and each index costs storage. Index what appears in WHERE and JOIN clauses; leave the rest.

Q3. Vertical partitioning and sharding both "split" data. What's the difference in one sentence each?

Vertical partitioning splits a table by columns (profile fields here, billing fields there) to match access patterns. Sharding — horizontal partitioning — splits by rows (users A–M on one node, N–Z on another) to scale storage and write load across machines.

Q4. You denormalized user_name into the orders table for fast reads. A user renames themselves. What must your test now verify?

That every copy of the duplicated data gets updated — orders, invoices, cached summaries. Drift between copies is the classic denormalization defect, and no single-table unit test will catch it. Never propose denormalization without naming the mechanism that keeps copies in sync.

Q5. Where does a user's 10MB profile video live, and what goes in the database?

The video goes in blob storage (S3 or similar, fronted by a CDN); the database stores only the URL. Then test the seams: a row pointing at a failed upload, an orphaned file with no row, and whether a "private" upload's URL is really private.

Quick Revision

Databases

Relational, document, key-value, graph. Pick by data shape and access pattern.

SQL vs NoSQL

ACID + joins vs flexible + horizontal scale. Default to SQL; justify NoSQL.

Indexing

B-tree → O(log n) reads. Faster reads, slower writes. Don't index everything.

Vertical Partitioning

Split wide tables by access pattern; touch only the columns you need.

Caching

Cache-aside for fast reads. Invalidation (TTL / write-through) is the hard part.

Denormalization

Duplicate data to kill joins on read-heavy paths — with a consistency plan.

Blob Storage

Big files in S3; store only the URL in the database. Cheap, durable, CDN-ready.

Where to Next

You can store data efficiently on one machine. What happens when one machine isn't enough?

Scaling — vertical, horizontal, load balancers, replication, sharding
APIs & Communication — the contracts that move this data

Databases — there is no "best" one​

SQL vs NoSQL — the decision behind most others​

Database Indexing — the back-of-the-book shortcut​

Vertical Partitioning — split wide tables by how they're read​

Caching — the single biggest read win​

Denormalization — trade clean schema for fast reads​

Blob Storage — the right home for big files​

Test Yourself​

Quick Revision​

Where to Next​

Databases — there is no "best" one

SQL vs NoSQL — the decision behind most others

Database Indexing — the back-of-the-book shortcut

Vertical Partitioning — split wide tables by how they're read

Caching — the single biggest read win

Denormalization — trade clean schema for fast reads

Blob Storage — the right home for big files

Test Yourself

Quick Revision

Where to Next