URL Shortening System Design

1. Requirements

Functional Requirements:

Generate shorter URLs (short links) given longer URLs.
Users navigate to the original URL using the short link.
High availability.

Non-Functional Requirements:

Low latency for shortening and redirection.
Short links should be short (6-8 characters).
Handle a large number of requests.

Capacity Estimation:

100 million URLs shortened per day.
1 billion URL redirections per day.
Storage for 5 years.

2. High-Level Design

The system consists of:

URL Shortening Service: Accepts long URLs and generates short URLs.
URL Redirection Service: Resolves short URLs to their original URLs.
Data Store: Stores the mappings between short and long URLs.

[Diagram of the high-level architecture: Client <-> URL Shortening Service <-> Data Store, Client <-> URL Redirection Service <-> Data Store]

3. Data Model

Using a relational database.

URLs Table:

Field	Type	Description
id	BIGINT	Primary key, auto-increment
short_url	VARCHAR(8)	Shortened URL identifier
original_url	VARCHAR(2048)	Original URL
created_at	TIMESTAMP	Timestamp of URL shortening
expires_at	TIMESTAMP	Optional expiration timestamp

Indexes:

Index on short_url for fast lookups.
Index on created_at for archival purposes.

4. Endpoints

URL Shortening Endpoint:

Endpoint: /shorten
Method: POST

Request:

{
  "long_url": "https://www.example.com/very/long/url"
}

Response:

{
  "short_url": "http://short.url/xyz123"
}

URL Redirection Endpoint:

Endpoint: /xyz123 (where xyz123 is the short URL)
Method: GET
Response: HTTP 302 Redirect to the original URL.

5. Design Considerations

Generating Unique Short Links:

Base 62 Encoding: Use digits (0-9), lowercase letters (a-z), and uppercase letters (A-Z) for a total of 62 characters. A 6-character short link can represent 62^6 possible URLs.
ID Generation: Use an auto-incrementing ID from the database. Convert the ID to base 62.

Example: ID = 12345 -> short_url = "dnh" (base 62 encoded)
Collision Handling:
- Before inserting a new short URL, check if it already exists in the database. If it does, generate a new ID and try again. To mitigate repeated collisions, implement a retry mechanism with a limit.

Data Structures and Databases:

Database: Relational database (e.g., MySQL, PostgreSQL) for durability and consistency.
Cache: Use a caching layer (e.g., Redis, Memcached) in front of the database to reduce latency for frequently accessed URLs.

Optimizing for Low Latency Redirection:

Caching:
- Cache short-to-long URL mappings in a distributed cache.
- Use a CDN (Content Delivery Network) to cache redirection responses closer to the users.
Database Optimization:
- Index the short_url column in the database.
- Use database connection pooling.
Load Balancing:
- Distribute traffic across multiple redirection servers using a load balancer.

Scaling the System:

Horizontal Scaling: Add more servers to the URL Shortening and Redirection services.
Database Sharding: Shard the database based on the id or short_url to distribute the load.
Replication: Use database replication for read scalability and fault tolerance.

6. Trade-offs

Component	Approach	Pros	Cons	Alternatives	Pros	Cons
Short Link Generation	Base62 Encoding of ID	Simple, easy to implement, guaranteed uniqueness (with collision handling)	Requires database access for ID generation	UUID Generation	No database access for short link generation	Higher chance of collisions, longer short URLs
Data Store	Relational Database (MySQL)	Strong consistency, supports complex queries, mature technology	Can be a bottleneck for high read/write workloads, requires schema management	NoSQL Database (e.g., Cassandra)	High scalability, better suited for write-heavy workloads	Eventual consistency, more complex data modeling
Caching	Redis	Fast, in-memory data store, supports various data structures	Data loss in case of failure, requires careful cache invalidation strategy	Memcached	Simpler than Redis	Fewer features than Redis
Redirection	HTTP 302 Redirects	Simple, standard, supported by all browsers	Adds an extra hop, slightly higher latency	HTTP 301 Redirects	Cached by browsers, lower latency for subsequent requests	Not suitable if the mapping changes frequently

7. Other Approaches

Approach 1: Using UUIDs

Instead of encoding an auto-incrementing ID, we can use UUIDs (Universally Unique Identifiers). We can either use the UUID directly (which is very long) or hash the UUID and take the first 6-8 characters.

Pros:

No need for a database to generate IDs.
Can generate short links independently.

Cons:

Higher chance of collisions (though still very low).
Longer short URLs if using the full UUID.

Approach 2: Bloom Filters

Use Bloom filters to check if a short URL exists before querying the database. This can reduce the number of database lookups.