Sharad Raj

Embracing the cosmos 🌌

Thundering Herd Problem

28 Feb 2026

You have likely contributed to a “Thundering Herd” situation.

Have you ever tried to book a Tatkal ticket on IRCTC at exactly 10:00 AM?

Or have you ever rushed to buy a smartphone during Flipkart’s Big Billion Days at exactly 12:00 AM?

If you’ve experienced anything like that, you already know the feeling. If not, keep reading.

An example

In distributed systems:

Thousands or millions of processes are waiting for an event and when triggered at the exact same time overwhelm the system.

2. What is it really ?

A large number of processes or threads waiting for an event are simultaneously brought to life when that event occurs so when all processes wake up at the exact same time they all try to access a shared resources like a database or a network connection simultaneously.

And the resource has limited capacity so the sudden burst of requests overwhelms it which results in resource exhaustion severe latency, or a complete crash.

3. Where does it occur ?

  1. Load Balancers / Servers: When a crashed node comes back online or when a load balancer routes a huge burst of waiting traffic to a newly spun-up server.

  2. Caching: Most common. When a highly accessed cache key expires, thousands of requests hit the underlying database at a time to fetch the data.

  3. Databases: When connection pools are exhausted because too many worker nodes try to connect to the DB simultaneously after a network blip.

4. Normal Traffic Spike vs. Thundering Herd

Feature Normal Traffic Spike Thundering Herd
Nature of Load Gradual or sustained high traffic over minutes/hours. Instantaneous, highly synchronized burst at the exact same millisecond.
Trigger Marketing push, organic growth, viral content. A specific system event (e.g., Cache TTL expiry, server restart, specific timestamp).
System Behavior System slowly degrades; autoscaling can usually handle it. System crashes instantly; autoscaling doesn’t have time to react.

5. Cache Expiry

  1. Scenario: IPL finals on live on Hotstar and the current score is cached in Redis with a Time-To-Live (TTL) of 60 seconds … millions of users are refreshing their apps to see the score.

  2. All good path: The server checks Redis. The score is there (Cache Hit). Redis serves millions of requests effortlessly. The DB is safe.

  3. The Disaster (TTL Expires): At exactly 60.001 seconds, the cache expires. In that exact millisecond, 50,000 users ask for the score.

  4. The Herd: The server checks Redis. The score is missing for all 50,000 requests (Cache Miss).

  5. The Crash: Because the cache is empty, all 50,000 requests bypass the cache and hit the Database directly to fetch the score. The DB connection pool is instantly exhausted. The Database crashes. The servers crash waiting for the DB. The system goes down.

6. Whats the Impact?

7. How to prevent ?

Architecture difference can be seen in the following diagram:

A. Jitter (Staggered Expiry)

The coder set the TTL of all related cache keys to exactly 60 seconds, they will all expire at once.

Solution: Add a random “jitter” to the TTL. Instead of setting TTL to exactly 60 seconds, set it to 60 seconds + random(0, 15) seconds.

This staggers the cache expiration, spreading the database hits over a wider time window.

B. Mutex / Cache Locking

When the cache expires and 50,000 requests come in, do not let all of them go to the DB.

Solution: The first request acquires a lock (mutex). The other 49,999 requests are forced to wait. The single authorized request fetches the data from the DB, populates the cache, and releases the lock. The waiting requests then read from the newly populated cache.

C. Stale-While-Revalidate (SWR)

This is heavily used by platforms like Hotstar and Netflix.

Solution: Never let the user experience a cache miss. When the TTL expires, continue serving the “stale” (old) data to the millions of users. Meanwhile, fire a single background thread to the Database to fetch the fresh data and update the cache silently. Users might see a 2-second old score, but the system stays alive. (Preference: Availability over immediate Consistency).

D. Request Coalescing (Collapsing)

If the server receives 1,000 identical requests for the same exact data (e.g., get_movie_details(id=5)) at the same time, it collapses them into a single request. It queries the DB once, and when the DB returns the result, it fans out the response to all 1,000 waiting clients.

E. Cache Warming (Pre-computation)

Do not wait for users to trigger the DB fetch. If you know BookMyShow opens movie bookings at 9:00 AM, or IRCTC opens Tatkal at 10:00 AM, run a script at 8:55 AM that fetches the required data from the DB and pushes it into the cache.

F. Exponential Backoff & Jitter on Retries

If the DB is down, clients shouldn’t immediately retry every 1 second.

Solution: Clients should retry after 1s, then 2s, 4s, 8s (Exponential Backoff). Add randomness (Jitter) so that all failed clients don’t retry at the exact same 2-second mark.

G. Rate Limiting

Drop excess traffic before it reaches the core system. If the DB can only handle 1,000 QPS, rate-limit incoming requests at the API Gateway level to ensure the DB is never overwhelmed.