Scaling an Event Ticketing System
An event ticketing system is one of the most challenging distributed systems to design. When tickets for a popular concert go on sale, the system must handle massive traffic spikes, guarantee zero overselling of limited inventory, and deliver sub-second response times under extreme load. This course section explores the architecture, strategies, and trade-offs involved in scaling such a system from thousands to millions of concurrent users.
At its core, a ticketing system faces a fundamental tension: high read throughput (users browsing events) versus strict write consistency (reserving specific seats). The CAP theorem tells us we cannot have all three properties simultaneously — and in ticketing, consistency is non-negotiable. You cannot sell the same seat to two people.
The diagram above illustrates the two-path architecture — a common pattern in ticketing systems where reads and writes follow separate pipelines. Reads are served from cached data with eventual consistency, while writes go through a serialized path that guarantees inventory integrity.
Key Scalability Challenges
| Challenge | Description | Typical Scale |
|---|---|---|
| Flash Traffic | 10x-100x normal load in seconds | 100K+ concurrent users |
| Inventory Integrity | Prevent double-booking of seats | Zero oversell tolerance |
| Seat Selection Consistency | Lock seats during user decision window | 5–15 min hold timers |
| Payment Latency | 3rd-party gateway delays | 2–30 sec per transaction |
| Search & Discovery | Faceted event search under load | Millions of event records |
Footnotes
-
Gilbert, S. & Lynch, N. "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" — ACM SIGACT News, 2002. The CAP theorem's formal proof establishing that consistency and availability cannot both be guaranteed during network partitions. ↩
-
Kleppmann, M. "Designing Data-Intensive Applications" O'Reilly, 2017 — Chapter on derived data and the lambda architecture pattern for separating read/write paths. ↩
System Design Interview: Design Ticketmaster w/ a Ex-Meta Staff Engineer
Core Architectural Components
1. Load Balancing & Traffic Management
At the edge, a Global Server Load Balancer routes users to the nearest data center. Behind it, L7 load balancers distribute requests across API server pools.
For ticketing, weighted round-robin with connection draining is preferred over least-connections. The reason: during a sale, new servers spinning up shouldn't immediately receive reservation requests until their local caches are warm.
2. Caching Strategy
The read path for ticketing is overwhelmingly dominant — 95%+ of requests are users browsing events, viewing seat maps, or checking availability. A multi-tier caching strategy is essential:
- L1 — Browser Cache: Static event data, images, and venue maps served with
Cache-Controlheaders - L2 — CDN Edge Cache: Event listings, pricing tiers cached at edge nodes; TTL set to 30–60s during live sales
- L3 — Application Cache (Redis): Seat availability maps, event metadata; updated on every reservation via cache invalidation
- L4 — Database Query Cache: Frequent queries for event lists, venue details
During a high-demand onsale event, cache hit ratios must stay above 90% for the system to survive. Every cache miss that reaches the database adds latency and load.
3. The Inventory Problem — Preventing Overselling
This is the hardest problem in ticketing systems. You have seats for an event and concurrent buyers. Solutions include:
Pessimistic Locking (Row-Level)
Traditional approach using SELECT ... FOR UPDATE. Simple but creates severe lock contention under high concurrency.
Optimistic Concurrency Control
Use version columns: UPDATE seats SET version=version+1, status='reserved' WHERE id=X AND version=V. If zero rows affected, the seat was already taken — retry or fail.
Distributed Locking (Redis-based)
Use SETNX (SET if Not eXists) to atomically claim a seat in Redis, then persist to database asynchronously. This makes the Redis cluster the source of truth for availability during the sale.
Queue-Based Reservation
All reservation requests enter a message queue. A single-threaded consumer processes reservations sequentially per event, guaranteeing no conflicts. This is the approach used by large-scale systems.
Where is the number of consumer instances, is the processing time per reservation, and is the Kafka partition count (one per event).
Footnotes
-
Nginx documentation on load balancing algorithms and connection draining: https://docs.nginx.com/nginx/admin-guide/load-balancer/ ↩
-
Redis Labs case study on ticketing system caching patterns — "Achieving 99.9% Cache Hit Ratio During Peak Load" — demonstrating multi-tier caching in high-concurrency event systems. ↩
-
Kafka documentation on exactly-once semantics and partition-based ordering: https://kafka.apache.org/documentation/ — Using one partition per event to serialize all seat reservation operations, preventing concurrency conflicts. ↩
The Overselling Nightmare
Overselling is unrecoverable in ticketing. You cannot ask a customer to give up their seat after purchase. Every architectural decision must prioritize inventory integrity over availability. During flash sales, it is better to reject requests (fail closed) than to sell the same seat twice (fail open).
Ticket Reservation Flow at Scale
- 1Step 1
The user selects a seat on the interactive seat map. The frontend sends a
POST /api/events/{id}/seats/{seatId}/holdrequest to the API Gateway. The request includes the user's session token and a idempotency key to handle network retries safely. - 2Step 2
The API Gateway checks rate limits — typically 10 requests/second per user during onsales. Bots and scrapers are filtered using CAPTCHA, browser fingerprinting, and token bucket algorithms. Malicious traffic is rejected before reaching the reservation service.
- 3Step 3
The Reservation Service attempts to acquire a Redis distributed lock on the seat using
SET seat:{eventId}:{seatId} <userId> NX EX 600(10-minute TTL). If the lock succeeds, the seat is temporarily held. If it fails (key already exists), return 409 Conflict — the seat is held by another user. - 4Step 4
After acquiring the Redis lock, the service writes the reservation to the primary PostgreSQL database with status
HELD. An outbox pattern is used — the reservation and an event record are written in a single transaction. A CDC (Change Data Capture) pipeline picks up the outbox entry and publishes it to Kafka. - 5Step 5
The user is redirected to payment. A 10–15 minute countdown timer starts. The payment service calls the 3rd-party payment gateway (Stripe/PayPal). On success, the seat status becomes
CONFIRMED. On failure or timeout, a scheduled job releases the hold and deletes the Redis lock, returning the seat to the available pool. - 6Step 6
Once the reservation is confirmed, the system invalidates the Redis availability cache for that event (
DEL availability:{eventId}). A WebSocket push notification updates the seat map for all connected users in real-time. The final state is: Database =CONFIRMED, Redis Lock = expired naturally, Cache = stale (invalidated), User = notified via email + push.
Handling Flash Sales — The Queue-Based Approach
The most robust pattern for extreme scale is virtual queuing — made famous by systems like Queue-it and used by Ticketmaster during onsales. Instead of letting all users hit the reservation service simultaneously, they enter a waiting room:
The key parameters for virtual queuing:
| Parameter | Description | Typical Value |
|---|---|---|
| Drain Rate | Users let through per second | 50–200/s |
| Batch Size | Users released per interval | 100–500 |
| Hold Timer | Time user has to complete purchase | 10–15 min |
| Queue TTL | Max wait time before session expires | 2–4 hours |
| Heartbeat Interval | Keep-alive polling from waiting room | 15–30s |
The drain rate is calculated based on the reservation service capacity and average purchase time :
For example, if the reservation cluster can handle 5,000 concurrent sessions and the average purchase takes 5 minutes, then:
This is deliberately conservative — a safety factor of 0.8 prevents the system from becoming saturated.
Footnotes
-
Queue-it virtual waiting room technology — used by Ticketmaster and other major ticketing platforms to manage flash sale traffic. See: https://queue-it.com/ ↩
Queue Position Optimization
Pre-sort users in the virtual queue by session quality score. Users with verified accounts, payment methods on file, and no bot signals get a higher priority within their position band. This reduces abandonments in the purchase flow and increases revenue per onsale.
Scaling Lifecycle of a Ticketing System
Monolith Phase
Stage 1Single server running a Rails/Django app with a relational DB. Handles 100–1,000 concurrent users. Pessimistic locking (SELECT FOR UPDATE) works fine at this scale. Suitable for local venues and small events."
Horizontal Read Scaling
Stage 2Add read replicas, Redis cache layer, and CDN for static assets. Introduce a load balancer. This extends capacity to ~10,000 concurrent users. Inventory integrity still handled by primary DB with row locks."
Service Decomposition
Stage 3Split into microservices: Event Service, Inventory Service, Reservation Service, Payment Service, Notification Service. Introduce Kafka for async event-driven communication. Capacity: ~50,000 concurrent users."
Distributed Inventory
Stage 4Move inventory availability to Redis as source of truth during sales. Use Redis Cluster for sharding by event_id. Implement queue-based reservation. Add virtual waiting room for onsales. Capacity: 100K–500K concurrent users."
Global Multi-Region
Stage 5Deploy across multiple cloud regions with active-passive failover. Implement CRDT-based caches for read availability. Use dedicated queue drain clusters per region. Capacity: 1M+ concurrent users globally."
Concurrent User Capacity by Scaling Stage
Maximum concurrent users handled at each architectural stage
Database Architecture & Sharding Strategy
The database layer is where consistency meets scale. For a ticketing system, we use a hybrid approach — different database technologies for different access patterns:
Primary Database: PostgreSQL (OLTP)
PostgreSQL with serializable isolation level for all reservation and payment operations. Key optimizations:
- Partition events by date range — old events are archived to cold storage
- Index heavily on
(event_id, seat_status)for availability queries - Use advisory locks for event-level operations:
pg_advisory_lock(event_id)serializes all seat modifications for a given event in one DB node
Sharding Strategy
For global scale, shard by event_id using hash-based sharding:
This ensures all seats for a single event live on the same shard, eliminating cross-shard transactions for the common case of "reserve seat for event X." The shard count should be set to 2× the number of physical DB nodes to allow for future resharding via consistent hashing.
Event Sourcing for Audit Trail
Every state transition in the ticketing lifecycle is stored as an event record:
| Event Type | Description | Example |
|---|---|---|
SEAT_HELD | User selected seat, timer started | {seatId, userId, expiresAt} |
SEAT_RELEASED | Hold expired or user abandoned | {seatId, reason} |
PAYMENT_INITIATED | Payment gateway called | {seatId, amount, gatewayRef} |
SEAT_CONFIRMED | Payment successful | {seatId, orderId, confirmationCode} |
SEAT_REFUNDED | Refund processed | {seatId, refundAmount, reason} |
This event log enables perfect auditability — critical for the ticketing industry where disputes, fraud detection, and regulatory compliance require a complete history.
Footnotes
-
Vogels, W. "Eventually Consistent" — ACM Queue, 2009. Discussion of consistent hashing and its role in minimizing data movement during resharding operations in distributed systems. ↩
Edge Cases & Advanced Topics
1import redis 2 3r = redis.Redis(cluster={...}) 4 5def hold_seat(event_id: int, seat_id: str, user_id: str) -> bool: 6 lock_key = f"seat:{event_id}:{seat_id}" 7 # NX = only set if not exists, EX = TTL in seconds 8 acquired = r.set(lock_key, user_id, nx=True, ex=600) 9 if acquired: 10 # Persist to DB asynchronously 11 publish_reservation_event(event_id, seat_id, user_id) 12 return True 13 return False
Key Concepts in Ticketing System Scalability
Real-World Capacity Planning
Based on industry data from large-scale ticketing platforms:
| Metric | Estimate | Source |
|---|---|---|
| Taylor Swift Eras Tour onsale | 3.5M verified fans queued | Ticketmaster |
| Peak concurrent users during onsale | 500K–2M | Industry reports |
| Average tickets per transaction | 2–4 | Industry average |
| Abandonment rate after hold | 15–25% | Payment analytics |
| Bot traffic during popular onsales | 60–80% of total requests | Bot management firms |
| Revenue loss from bot scalping | $15B+ annually (US) | Research estimates |
Capacity planning must account for the worst-case concurrent write throughput:
Where is peak concurrent users, is the fraction of users attempting to reserve per second (~5%), and is the hold abandonment rate. For 1M concurrent users:
With a virtual queue draining at 50 users/sec and 80% converting to reservations:
This is why virtual queuing is essential — it reduces the write load from 62,500/sec to 40/sec, a 1500× reduction.
Footnotes
-
Ticketmaster Verified Fan program data and industry reports on bot traffic during major onsales — including analysis of the Taylor Swift Eras Tour onsale incident and subsequent congressional testimony. ↩
Don't Forget the Payment Gateway
The payment gateway is often the bottleneck in a ticketing system. Third-party gateways (Stripe, Adyen) have rate limits — typically 100–500 req/sec per merchant account. Request rate limit increases well in advance of major onsales. Account for 3–5 second HTTP timeouts with retry logic and circuit breakers. A payment gateway outage at minute 5 of a 10-minute hold window is catastrophic.
Pre-warm Everything
Before a major onsale, pre-warm all caches. Load event data, seat maps, and pricing into Redis and CDN. Scale up API servers, reservation consumers, and DB read replicas 30 minutes before the onsale time. Enable autoscaling with a minimum floor — don't scale to zero and wait for scale-up during a sale. Pre-allocate Redis connections pool to maximum expected concurrency.
Knowledge Check
Which approach provides the strongest guarantee against overselling in a distributed ticketing system?
Explore Related Topics
Distributed Systems: Architecture, Coordination, and Consensus
The course covers distributed system fundamentals, consistency‑availability trade‑offs, consensus via Raft, and data partitioning methods.
- Key traits: concurrent components, no global clock, independent failures; network partitions reveal common fallacies.
- CAP forces a consistency vs. availability choice during partitions; PACELC adds latency vs. consistency when no partition (e.g., Cassandra prefers latency).
- Raft election: followers timeout, become candidates, request votes, and win leadership with a quorum of ⌊N/2⌋+1, avoiding split‑brain.
- Consistent hashing minimizes reshuffling to ~K/n keys on node addition, while range sharding speeds range queries but can hotspot.
Shipping Speed vs. Clean Architecture in Early-Stage Startups: An Engineering Case Study
Systems Programmer Interview Preparation
The course provides a structured roadmap to ace systems programmer interviews, covering OS internals, concurrency, memory management, C/C++ mastery, and networking.
- Core domains (OS internals, concurrency, memory, C/C++, networking) each ~20‑25%; know syscall flow and – context‑switch cost.
- Assess baseline, then deep‑dive into process lifecycle, lock‑free structures, page‑fault handling, and TCP/epoll server implementation.
- Hone C/C++ low‑level skills (pointers, UB, move semantics, ABI) and use ASan, Valgrind, perf.
- Practice tracing, concurrency bugs, implementations, performance analysis, and low‑level system design.
- Follow the 12‑week timeline, solve targeted problems, do mock interviews, and read kernel source.