System Design: A File Storage Service Like Dropbox
Design a file storage service with uploads, sync, deduplication, and sharing, scaling to petabytes while keeping reads fast and cheap.
What you'll learn
- ✓Chunking and content-addressable storage
- ✓How sync clients detect changes
- ✓Metadata vs blob separation
- ✓Deduplication strategies
- ✓Sharing and permissions at scale
Prerequisites
- •Familiar with HTTP and databases
What and Why
A Dropbox-style service needs to upload, store, version, and sync files across many devices and users. The interesting parts are not “save bytes to disk” but how to do that efficiently when files are huge, often duplicated, and shared across organizations.
Mental Model
Separate metadata from blob storage. Metadata is small, queryable, transactional. Blobs are large, immutable, content-addressed. This split is the heart of every modern storage service.
A file becomes a list of chunks. Each chunk is identified by a hash. The metadata layer maps file paths to chunk lists. The blob layer just stores chunks.
Architecture
The upload path chunks files into 4 MB pieces, hashes each chunk, and uploads only chunks the server has never seen. The metadata service writes the file record once all chunks are confirmed.
Client -> Chunker -> Hash chunks -> Check which exist
|
upload missing chunks
|
Blob Store (S3)
|
commit metadata (file -> chunk list)
|
Metadata DB The blob store is object storage like S3. The metadata DB is typically a relational database sharded by user id. A notification service pushes change events to other devices so they can sync.
Sync clients keep a local index keyed by path and last-known content hash. On startup they request “changes since cursor X” and apply incoming patches.
Trade-offs
Content-addressed chunks save storage but break encryption boundaries: if two users upload the same file, you cannot easily encrypt with a per-user key and still dedupe. Many services dedupe within an account only.
Strong consistency in the metadata layer simplifies sync but caps write throughput. Most systems pick per-user strong consistency and cross-user eventual consistency.
Aggressive client-side caching makes the UI feel instant but causes nasty bugs when the server state diverges. A clear cursor-based sync protocol is non-negotiable.
Practical Tips
Issue pre-signed URLs so clients upload chunks directly to object storage. This keeps your servers out of the data path and saves cost.
Use a Merkle tree of chunk hashes to quickly compare client and server state. A single root hash mismatch tells the client to walk down and find the differing subtree.
Soft-delete files with a tombstone and TTL. Hard deletion confuses users and breaks shared links. Garbage-collect orphan chunks asynchronously.
Cap shared folder size and member count. Permissions checks on every read get expensive when a folder is shared with thousands of users; cache them.
Wrap-up
A file storage service is really a metadata problem with blobs attached. Get the chunking, hashing, and sync protocol right and the rest is engineering effort. The pattern repeats across products from photo backup to source control. Once you see the chunked, content-addressed model, you will spot it everywhere.
Related articles
- System Design System Design: Photo Sharing App
A pragmatic walkthrough for designing a photo sharing service: upload paths, storage tiers, CDN delivery, feed generation, and the trade-offs that matter at scale.
- System Design Designing Rate Limiters: A System Design Deep Dive
A senior-engineer guide to designing rate limiters: algorithms, distributed coordination, trade-offs, and production patterns that actually scale.
- System Design System Design: Building a Scalable Chat Application
Design a real-time chat system that supports millions of users with low latency messaging, presence, and message persistence at scale.
- System Design System Design: Real-Time Leaderboard with Redis
Design a real-time leaderboard using Redis sorted sets, handling millions of score updates per second with low latency rankings.