7. Bottlenecks and Scaling¶

This final step makes the design production-ready. It addresses what breaks first and how the system survives growth or failure.

What to Cover¶

Single points of failure.
Database and cache scaling.
Load balancing across tiers.
Replication and failover.
Service discovery and health checks.
Monitoring, logging, and alerting.
Security, cost, and operational limits.
Graceful degradation and fallbacks.
Backpressure, retry storms, and queue growth.

Reliability Checklist¶

Can the system survive a node failure?
Can traffic be rerouted automatically?
Are reads and writes still acceptable under partial outages?
Is there a clear recovery path for data or worker failures?
Do we know how to detect the problem quickly?
Can we slow down or shed load safely?

Availability Targets and SLAs¶

Availability measured in "nines":

Target	Downtime/Year	Downtime/Month	Use Case
99% (2 nines)	3.6 days	7.2 hours	Internal tools
99.9% (3 nines)	8.8 hours	43 minutes	Most web services
99.99% (4 nines)	52 minutes	4.3 minutes	SaaS, payments
99.999% (5 nines)	5 minutes	26 seconds	Critical infrastructure

Failover Patterns¶

Active-Passive¶

Primary handles all traffic, passive is standby.
Failure → passive takes over (10-30 sec downtime).
Simple, used with master-slave databases.

Active-Active¶

Both servers handle traffic, balanced by DNS/load balancer.
Failure → degraded service (no downtime).
Complex, needs master-master databases or distributed consensus.

Availability in Series vs Parallel¶

Series: 99.9% × 99.9% = 99.8% overall (multiply).
Parallel: 1 - (1 - 99.9%)² = 99.9999% (use redundancy).

Lesson: Redundancy in parallel improves availability; avoid single points of failure.

Scaling Questions¶

What is the next bottleneck after the first optimization?
How will the system behave during a spike?
How are retries and failures handled safely?
How does the system recover from partial outages?
What gets degraded first if capacity runs out?

Operational Concerns¶

Metrics for latency, errors, saturation, and throughput.
Logs for debugging and audits.
Traces for distributed request flows.
Capacity planning and alert thresholds.
Safe deploys, rollbacks, and feature flags.

Output of This Step¶

A list of the main risks.
Concrete scaling techniques for each risk.
A production-readiness story, not just a happy-path design.
A clear explanation of how the system fails and recovers.

Common Mistakes¶

Ending with the first architecture sketch.
Forgetting observability and recovery.
Treating scale as only a database problem.
Not mentioning failure modes or recovery behavior.

« Previous Next »