Sale-Day Scaling: How We Handle 100x Traffic Spikes
Sale days separate eCommerce platforms that survive from those that show up on Twitter as a meme. The good news: handling 100x normal traffic is solved engineering. It just needs to be done early enough, with the right patterns. Here's the playbook.
Key takeaways
- Cache aggressively. Most reads should never touch your origin.
- Queue everything that can be async, orders, emails, recommendations.
- Pre-warm. Auto-scaling reacts too slowly for instant spikes.
- Have graceful degradation. When something breaks, degrade gracefully, don't fail catastrophically.
- Rehearse. Run load tests at 2x your expected peak before sale day.
Why this matters
A 4-hour outage during a sale day costs not just revenue but trust. Customers who couldn't check out during BBD often don't come back during normal traffic.
The architecture patterns
Cache everything readable
Product catalogs, category pages, search results, all cached aggressively at CDN level. Stale-while-revalidate patterns serve cached responses while fresher data updates in the background.
Queue order processing
Don't process the order synchronously. Accept the order to a fast write queue, return success, process asynchronously. The customer's checkout completes in 200ms instead of 2s.
Database read replicas
Read traffic is 50-100× write traffic. Multiple read replicas handle it; the primary handles writes only.
Inventory as a separate fast store
Inventory checks happen on every product view. Use Redis or DynamoDB for sub-millisecond reads. Reconcile with the source of truth asynchronously.
Pre-warm at scale
Auto-scaling reacts to load with a 30-60s delay. For instant spikes, you need to be pre-warmed at sale-start scale. Schedule the warm-up.
CDN cache poisoning prevention
Vary cache keys carefully, segment cache by region, by logged-in vs not, by A/B group. Bad cache keys lead to wrong content served to wrong users at scale.
The operational patterns
Feature flags for graceful degradation
Recommendations down? Disable the recommendation widget. Search slow? Cache more aggressively. Reviews offline? Hide the section. Each feature should have an off switch.
Real-time observability
Dashboards showing the top 5 metrics that matter: orders/min, checkout completion %, error rate, latency, queue depth. Alerts on anomalies.
War room
On sale day, have engineering present, dashboards on big screens, fast communication channel. Most fires get put out in minutes if caught early.
Post-mortem the next day
What worked, what didn't, what to fix before next sale. Every sale day is the rehearsal for the next.
Common pitfalls
Trusting auto-scaling alone. It scales up too late for instant peaks.
Single-region. A single AZ outage during a sale is catastrophic. Multi-AZ minimum; multi-region for big businesses.
No graceful degradation. When a service fails, the whole checkout fails. Build feature flags.
Database hot rows. Inventory of the most-purchased product becomes a hot row. Cache and async-update.
What we recommend
Run a load test at 2x your expected peak two weeks before sale day. Find what breaks. Fix it. Run again. The teams who don't rehearse are the ones who melt.
FAQs
What about CDN choice? Cloudflare, CloudFront, Fastly all handle 100x. Differences are pricing and feature set.
How much does sale-day infrastructure cost? Typically 3-5x normal infrastructure spend for the 24-48h of peak.
Can we just hire AWS to scale us? No. The architecture has to support scale; AWS just provides capacity.
