7. Bottlenecks and Scaling

This final step makes the design production-ready. It addresses what breaks first and how the system survives growth or failure.

What to Cover

  • Single points of failure.
  • Database and cache scaling.
  • Load balancing across tiers.
  • Replication and failover.
  • Service discovery and health checks.
  • Monitoring, logging, and alerting.
  • Security, cost, and operational limits.
  • Graceful degradation and fallbacks.
  • Backpressure, retry storms, and queue growth.

Reliability Checklist

  • Can the system survive a node failure?
  • Can traffic be rerouted automatically?
  • Are reads and writes still acceptable under partial outages?
  • Is there a clear recovery path for data or worker failures?
  • Do we know how to detect the problem quickly?
  • Can we slow down or shed load safely?

Availability Targets and SLAs

Availability measured in "nines":

Target Downtime/Year Downtime/Month Use Case
99% (2 nines) 3.6 days 7.2 hours Internal tools
99.9% (3 nines) 8.8 hours 43 minutes Most web services
99.99% (4 nines) 52 minutes 4.3 minutes SaaS, payments
99.999% (5 nines) 5 minutes 26 seconds Critical infrastructure

Failover Patterns

Active-Passive

  • Primary handles all traffic, passive is standby.
  • Failure → passive takes over (10-30 sec downtime).
  • Simple, used with master-slave databases.

Active-Active

  • Both servers handle traffic, balanced by DNS/load balancer.
  • Failure → degraded service (no downtime).
  • Complex, needs master-master databases or distributed consensus.

Availability in Series vs Parallel

  • Series: 99.9% × 99.9% = 99.8% overall (multiply).
  • Parallel: 1 - (1 - 99.9%)² = 99.9999% (use redundancy).

Lesson: Redundancy in parallel improves availability; avoid single points of failure.

Scaling Questions

  • What is the next bottleneck after the first optimization?
  • How will the system behave during a spike?
  • How are retries and failures handled safely?
  • How does the system recover from partial outages?
  • What gets degraded first if capacity runs out?

Operational Concerns

  • Metrics for latency, errors, saturation, and throughput.
  • Logs for debugging and audits.
  • Traces for distributed request flows.
  • Capacity planning and alert thresholds.
  • Safe deploys, rollbacks, and feature flags.

Output of This Step

  • A list of the main risks.
  • Concrete scaling techniques for each risk.
  • A production-readiness story, not just a happy-path design.
  • A clear explanation of how the system fails and recovers.

Common Mistakes

  • Ending with the first architecture sketch.
  • Forgetting observability and recovery.
  • Treating scale as only a database problem.
  • Not mentioning failure modes or recovery behavior.