Skip to content
C Codeloom
System Design

System Design: A File Storage Service Like Dropbox

Design a file storage service with uploads, sync, deduplication, and sharing, scaling to petabytes while keeping reads fast and cheap.

·3 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • Chunking and content-addressable storage
  • How sync clients detect changes
  • Metadata vs blob separation
  • Deduplication strategies
  • Sharing and permissions at scale

Prerequisites

  • Familiar with HTTP and databases

What and Why

A Dropbox-style service needs to upload, store, version, and sync files across many devices and users. The interesting parts are not “save bytes to disk” but how to do that efficiently when files are huge, often duplicated, and shared across organizations.

Mental Model

Separate metadata from blob storage. Metadata is small, queryable, transactional. Blobs are large, immutable, content-addressed. This split is the heart of every modern storage service.

A file becomes a list of chunks. Each chunk is identified by a hash. The metadata layer maps file paths to chunk lists. The blob layer just stores chunks.

Architecture

The upload path chunks files into 4 MB pieces, hashes each chunk, and uploads only chunks the server has never seen. The metadata service writes the file record once all chunks are confirmed.

Client -> Chunker -> Hash chunks -> Check which exist
                                        |
                            upload missing chunks
                                        |
                                    Blob Store (S3)
                                        |
                            commit metadata (file -> chunk list)
                                        |
                                  Metadata DB
Upload and dedup flow

The blob store is object storage like S3. The metadata DB is typically a relational database sharded by user id. A notification service pushes change events to other devices so they can sync.

Sync clients keep a local index keyed by path and last-known content hash. On startup they request “changes since cursor X” and apply incoming patches.

Trade-offs

Content-addressed chunks save storage but break encryption boundaries: if two users upload the same file, you cannot easily encrypt with a per-user key and still dedupe. Many services dedupe within an account only.

Strong consistency in the metadata layer simplifies sync but caps write throughput. Most systems pick per-user strong consistency and cross-user eventual consistency.

Aggressive client-side caching makes the UI feel instant but causes nasty bugs when the server state diverges. A clear cursor-based sync protocol is non-negotiable.

Practical Tips

Issue pre-signed URLs so clients upload chunks directly to object storage. This keeps your servers out of the data path and saves cost.

Use a Merkle tree of chunk hashes to quickly compare client and server state. A single root hash mismatch tells the client to walk down and find the differing subtree.

Soft-delete files with a tombstone and TTL. Hard deletion confuses users and breaks shared links. Garbage-collect orphan chunks asynchronously.

Cap shared folder size and member count. Permissions checks on every read get expensive when a folder is shared with thousands of users; cache them.

Wrap-up

A file storage service is really a metadata problem with blobs attached. Get the chunking, hashing, and sync protocol right and the rest is engineering effort. The pattern repeats across products from photo backup to source control. Once you see the chunked, content-addressed model, you will spot it everywhere.