Skip to main content

Data Storage

Where data lives, and how to find it before the user gives up

Your product launches. Sign-ups pour in. Then the dreaded query that ran in 5ms with a thousand rows takes 8 seconds at ten million — and the whole app grinds with it. Storage decisions made on day one quietly set the ceiling for day one thousand.

APIs shuttle data between programs. This page is about the data itself: choosing the right database, finding rows fast, splitting tables that grew too big, caching the hot stuff, trading normalization for speed, and parking big files where they belong.

What this page covers

Databases and their flavors, SQL vs NoSQL, indexing, vertical partitioning, caching, denormalization, and blob storage — seven decisions that decide whether your reads stay fast as data grows. Every concept ends with a QA testing lens: how a tester would probe or break it.


Databases — there is no "best" one

A database is just an organized store optimized for putting data in and getting it back out. The trap is asking "which database is best?" The honest answer is it depends on your data's shape and how you read it.

One app, four database shapes
requestresponse· hover a node to trace it
Application
RelationalPostgreSQL, MySQL
DocumentMongoDB
Key-ValueRedis, DynamoDB
GraphNeo4j
Notice one application talking to several stores at once — real systems mix them, and the bugs live in the seams.
TypeShapeShines atExamples
RelationalTables with strict relationshipsTransactions, consistency, joinsPostgreSQL, MySQL
DocumentFlexible JSON-like docsEvolving / varied schemasMongoDB
Key-ValueA giant hash mapBlazing-fast lookups by keyRedis, DynamoDB
GraphNodes + edges as first-classSocial graphs, recommendationsNeo4j

Real systems mix them. An e-commerce app might keep orders in PostgreSQL (needs ACID) and the product catalog in MongoDB (flexible, read-heavy). The skill isn't picking one — it's explaining why each part of your system uses what it uses.

QA Lens When one app spans multiple stores, the bugs live in the seams. An order in PostgreSQL referencing a product that was deleted from MongoDB has no foreign key to save it — nothing enforces consistency across two databases. Test those cross-store references explicitly, and test each store's failure independently: catalog down but orders up should degrade gracefully, not 500 the whole checkout.


SQL vs NoSQL — the decision behind most others

The most common storage fork. SQL enforces a strict schema with ACID transactions; NoSQL trades some of that rigor for flexibility and horizontal scale.

Choose by guarantee, not by hype
SQLPostgres, MySQL
  • Structured schema, enforced at write time
  • ACID transactions — strong consistency
  • Rich joins and ad-hoc queries
  • Scales up (vertically) more naturally than out
Best forBanking, inventory, orders — anywhere stale or partial data is dangerous.
NoSQLMongo, Cassandra, DynamoDB
  • Flexible / evolving schema
  • Often eventually consistent
  • No joins — model around access patterns
  • Scales out horizontally with ease
Best forHigh write throughput and unstructured data — logs, feeds, events.
The honest default

Start with SQL (Postgres) unless you have a concrete reason not to. ACID, mature tooling, and decades of operational knowledge are hard to beat. Adopt NoSQL when a specific access pattern or scale requirement forces it — and say which one in the interview.

Common misconception

"NoSQL is faster than SQL." Neither is "faster" — they're fast at different things. A key-value lookup in Redis beats a five-table join, but a Postgres query with the right index beats a MongoDB aggregation that has to fan out across documents. NoSQL's headline win is horizontal scale and schema flexibility, and it pays for both by pushing joins, transactions, and consistency work up into your application code.

QA Lens The riskiest word above is "eventually." On a NoSQL store, a test that writes then immediately reads back can pass on a fast day and fail under load when the read hits a lagging replica. Test for read-your-own-writes expectations explicitly, and never assume a freshly written row is visible everywhere the instant the write returns.


Database Indexing — the back-of-the-book shortcut

Without an index, finding a row means scanning every row. Fine at 100 rows; a catastrophe at 100 million. An index is a separate data structure that turns a linear scan into a fast lookup — like the index at the back of a textbook.

With an index vs without
requestresponse· hover a node to trace it
indexedno index
WHERE email = ...
B-tree indexO(log n) lookup
Full table scanO(n)
Row on disk
The dashed path is the trap: a full scan feels fine at 100 rows and becomes a catastrophe at 100 million.

Most databases default to B-tree indexes: a sorted, balanced tree giving O(log n) lookups. Create one on email and the engine builds a tree mapping email values straight to row locations.

But indexes aren't free — and that trade-off is the whole interview answer:

Reads on indexed columns get dramatically faster
Every INSERT/UPDATE must also update the index (slower writes)
Indexes consume extra storage

The rule of thumb: index columns that appear in WHERE clauses and JOIN conditions — and resist indexing everything.

QA Lens Performance tests on tiny seed data are theater — a full scan and an index lookup both finish in 2ms when there are 50 rows. Load realistic volumes, then read the query plan (EXPLAIN) to confirm the index is actually used. A missing index shows up as a Seq Scan that quietly scales linearly into an outage.


Vertical Partitioning — split wide tables by how they're read

When a table sprouts dozens of columns, vertical partitioning slices it into narrower tables joined by a shared key. The organizing principle: separate columns by access pattern.

(Its sibling, horizontal partitioning — splitting by rows instead of columns — is better known as sharding, covered in the Scaling page. Vertical splits columns; horizontal splits rows.)

Split one wide table by access pattern
requestresponse· hover a node to trace it
profile readsauth pathcheckout
Users Tablename, email, password,…
Profilename, email, avatar, b…
Authpassword_hash, login_c…
Billingbilling_address, card_…
Each query now touches only the bytes it needs — and billing fields can sit behind stricter access controls.

Your profile page needs a name and avatar, not a credit card number. Splitting the table means each query touches only the bytes it needs — fewer reads from disk, faster responses. It also lets you put stricter access controls on the billing partition and tune each table's indexes and storage independently.

When to mention it

Bring up vertical partitioning when you have a wide table whose columns are accessed in very different contexts and security tiers — the textbook case being splitting sensitive billing fields away from public profile fields.

QA Lens Splitting one table into three turns one write into three — and that's no longer atomic unless someone wraps it in a transaction. Test the half-written user: create an account, kill the process between the Profile insert and the Auth insert, and check whether you've made a user who exists but can't log in. Then verify the security promise actually holds: the profile-page query must not be able to select from the billing partition at all.


Caching — the single biggest read win

A cache stores frequently accessed data in a fast layer (usually memory) so you skip the database on most reads. Done right, it's the highest-leverage performance technique you have.

Cache-Aside · Live Hit / Miss
hitmiss
on miss
App
Cache
Databasesource of truth · A B C D EHIT RATE 0% · 0 hits / 0 misses
Watch the hit rate climb as hot keys settle into cache, then press Expire to see the cold-start miss storm.
Play with it

Let it run and watch the hit rate climb as hot keys (A and B come up most) settle into the three cache slots. The first request for any key is an amber miss that detours to the database and stores the value; every request after is a green hit served straight from cache. Hit Expire cache to flush it and watch the cold-start miss storm before it warms back up.

The pattern above is cache-aside (lazy loading): check the cache, return on a hit, and on a miss fetch from the DB and write the result back for next time. Redis and Memcached are the usual tools.

The genuinely hard part of caching isn't reading — it's invalidation. When the underlying data changes, the cache must update or expire, or users see stale data:

TTL
Entries auto-expire after a set time. Simple; tolerates brief staleness.
Write-through
Update cache and database together on every write. Fresh, but slower writes.
Write-behind
Update cache first, flush to DB later. Fast, but risk losing un-flushed writes.

There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors.

QA Lens Caching turns "stale data" into a first-class test case. After an update, assert the user sees the new value within the promised TTL — not the old one forever. Probe the cold start (empty cache under load → does the DB survive the stampede?) and the thundering herd when a hot key expires and a thousand requests miss at once.


Denormalization — trade clean schema for fast reads

Normalized schemas split data across many tables to avoid duplication. Clean — but rendering one screen might mean joining five tables. Denormalization deliberately duplicates data to cut those joins.

Normalized — read = three joins
requestresponse· hover a node to trace it
JOINJOIN
Ordersorder_id, user_id, pro…
Usersuser_id, name
Productsproduct_id, title, pri…
Every read of one order pays for two joins — clean to store, expensive to render.

Showing an order with the buyer's name and product title means joining three tables on every read. Clean to store, expensive to read.

Denormalization is a read optimization — reach for it when the system is read-heavy and those reads drown in joins. It's the norm in NoSQL, where joins aren't supported. The interview rule: never propose denormalization without a plan to keep the duplicated data consistent.

QA Lens Duplicated data is duplicated bugs waiting to drift. Write a test that renames a user and then verifies every copy of user_name updates — orders, invoices, cached summaries. Drift between copies is the classic denormalization defect, and it never shows up in a single-table unit test.


Blob Storage — the right home for big files

Images, videos, PDFs, backups — "binary large objects" don't belong in your rows. Blob storage (Amazon S3 and friends) stores massive unstructured files cheaply and durably.

Files in blob storage, URL in the database
requestresponse· hover a node to trace it
1 upload2 URL3 store URL4 replicate5 download
App Server
User
Blob StoreAmazon S3
Databaseholds the URL
CDN Edge
Only a short URL ever enters the database — the heavy bytes flow from the blob store to the CDN to the user.

The pattern is simple and worth committing to memory: store the file in blob storage, put the returned URL in your database. Your DB stays lean (a short URL string, not a 10MB image), and the blob store handles durability, replication, and serving. S3 advertises eleven nines of durability (99.999999999%) and plugs straight into a CDN for fast delivery.

Reflex rule

Whenever a design touches media — avatars, uploads, video, documents — say it out loud: "Files go in blob storage; the database holds the URL." It's an expected, easy point to score.

QA Lens The bug hides in the seam between the two stores. Test the orphan cases: a DB row pointing at a file that failed to upload, and a file with no row referencing it. Verify access control on the URL (is that "private" upload actually reachable by a logged-out stranger?) and that large uploads fail gracefully rather than half-writing.


Test Yourself

Answer from memory first, then expand to check.

Q1. A query that took 5ms in staging takes 8 seconds in production. Staging has 1,000 rows; production has 10 million. What's your first diagnostic step?

Run EXPLAIN on the production query and look for a sequential scan. A missing index is invisible at small scale — both a full scan and an index lookup are instant on 1,000 rows. This is also why performance tests on tiny seed data prove nothing.

Q2. Indexes make reads faster. Why not index every column?

Every INSERT/UPDATE must also update every index, so writes get slower with each one — and each index costs storage. Index what appears in WHERE and JOIN clauses; leave the rest.

Q3. Vertical partitioning and sharding both "split" data. What's the difference in one sentence each?

Vertical partitioning splits a table by columns (profile fields here, billing fields there) to match access patterns. Sharding — horizontal partitioning — splits by rows (users A–M on one node, N–Z on another) to scale storage and write load across machines.

Q4. You denormalized user_name into the orders table for fast reads. A user renames themselves. What must your test now verify?

That every copy of the duplicated data gets updated — orders, invoices, cached summaries. Drift between copies is the classic denormalization defect, and no single-table unit test will catch it. Never propose denormalization without naming the mechanism that keeps copies in sync.

Q5. Where does a user's 10MB profile video live, and what goes in the database?

The video goes in blob storage (S3 or similar, fronted by a CDN); the database stores only the URL. Then test the seams: a row pointing at a failed upload, an orphaned file with no row, and whether a "private" upload's URL is really private.


Quick Revision

Databases

Relational, document, key-value, graph. Pick by data shape and access pattern.

SQL vs NoSQL

ACID + joins vs flexible + horizontal scale. Default to SQL; justify NoSQL.

Indexing

B-tree → O(log n) reads. Faster reads, slower writes. Don't index everything.

Vertical Partitioning

Split wide tables by access pattern; touch only the columns you need.

Caching

Cache-aside for fast reads. Invalidation (TTL / write-through) is the hard part.

Denormalization

Duplicate data to kill joins on read-heavy paths — with a consistency plan.

Blob Storage

Big files in S3; store only the URL in the database. Cheap, durable, CDN-ready.


Where to Next

You can store data efficiently on one machine. What happens when one machine isn't enough?