Case Study: Designing a Chat System
Case Study: Designing a Chat System
Designing a real-time chat system like WhatsApp or Slack introduces a new challenge: Stateful, Persistent Connections. Unlike traditional HTTP requests that are short-lived, chat requires a constant open connection so that messages can be pushed to the user instantly.
The Requirements
- Functional: 1-on-1 and Group chats. Online/Offline status. Message history.
- Non-Functional: Extremely low latency (< 100ms for delivery). High reliability (no lost messages). Scalability to millions of concurrent users.
The Connection: WebSockets vs. HTTP Long Polling
Traditional HTTP is "Pull-based"—the client must ask for data. For chat, we need "Push-based" communication.
- WebSockets: Provides a full-duplex, persistent connection. Once the "handshake" is done over HTTP, the connection stays open, allowing the server to push messages to the client instantly. This is the industry standard for chat.
Handling Presence (Online/Offline)
How do we know if a user is online?
- Heartbeat: The client sends a small "ping" over the WebSocket every 30 seconds. If the server doesn't receive a ping for over a minute, it marks the user as
offline.
The Path of a Chat Message
- 1Step 1
User A sends a message
{ 'to': 'user_B', 'text': 'Hello!' }over their open WebSocket connection to Chat Server 1. - 2Step 2
Chat Server 1 receives the message, saves it to a NoSQL database (like Cassandra or HBase) for history, and sends an 'ACK' back to User A: 'Message received by server'.
- 3Step 3
Server 1 checks the Presence Store (Redis) to see where User B is. Redis says: 'User B is connected to Chat Server 2'.
- 4Step 4
Server 1 forwards the message to Server 2 (using a message broker like Kafka or a simple RPC call).
- 5Step 5
Server 2 finds the open WebSocket for User B and pushes the message: 'New message from A: Hello!'. User B's device vibrates instantly.
Storage Strategy
- Message History: Chat apps have a massive number of small writes and frequent reads for the most recent messages. A Column-Family store like Cassandra is ideal because it handles high write volumes and allows for efficient "get the last 50 messages" queries.
- Presence Store: Needs to be extremely fast with high expiration support. Redis is the perfect choice.
Scaling to Millions
- Distributed Chat Servers: Use a Load Balancer to distribute WebSocket connections across thousands of chat servers.
- Message Broker: Use Kafka or RabbitMQ to handle the communication between chat servers, ensuring that if one server is overloaded, messages aren't lost.
- Consistency: Use sequence numbers or timestamps to ensure messages are displayed in the correct order, even if they arrive out of sequence.
Common Mistakes
- Using a Relational DB for History: A SQL database will struggle with the sheer volume of tiny, concurrent writes from millions of users.
- Missing Offline Handling: What if User B is offline? The server must recognize this from the Presence Store and instead send a Push Notification (via FCM or APNS).
- Infinite WebSocket Retries: If the network is bad, the client might try to reconnect 100 times a second, crashing your Load Balancer. Use Exponential Backoff on the client.
Recap
- WebSockets enable real-time, bidirectional communication.
- A Presence Store (Redis) tracks which server a user is connected to.
- NoSQL (Cassandra) handles the high-volume storage of message history.
- Push Notifications bridge the gap when a user is disconnected.
Knowledge Check
Why are WebSockets preferred over standard HTTP for a chat application?