On July 4, 2012, Netflix went down. Not a small hiccup - a three-hour outage affecting millions of users during peak holiday streaming time. The culprit was an AWS Elastic Load Balancer in one region that became overwhelmed during a traffic surge. Netflix had redundancy. Netflix had failover plans. But the load balancer itself was a single point of failure their architecture assumed would never be a bottleneck. It was.
That incident reshaped how Netflix built systems. It also became one of the founding stories behind Chaos Engineering - deliberately breaking production systems in controlled ways to find the failures before traffic finds them for you.
The lesson is not that Netflix made a mistake. It is that every system has a load at which it breaks. The question is whether you know what that load is, and whether you have designed for what happens after it arrives.
Vertical vs. Horizontal Scaling
When your service starts struggling under load, you have two directions to go.
Vertical scaling means making the machine bigger. You move from a server with 8GB RAM to one with 32GB. You upgrade the CPU. For a while, this works. It requires no architectural changes, which makes it seductive. The problem: you hit a ceiling. There is a largest machine available. Past that ceiling, vertical scaling stops being an option, not because of money but because the hardware does not exist.
Horizontal scaling means adding more machines. Instead of one server handling all requests, you add two, then ten, then a hundred. Traffic is distributed across the pool. If one machine dies, the others keep serving. You can add capacity incrementally as demand grows, and you can scale back down when demand drops. The cost: your application has to be designed to run on multiple instances simultaneously, which introduces problems that a single-machine app does not have.
Most modern systems are designed for horizontal scaling from the start. Stateless services - where each request carries everything needed to handle it, with no local memory of previous requests - are the architecture that makes this possible. When a server holds no session state, any server can handle any request. You can route traffic to whichever machine has capacity without worrying about which machine the user last talked to.
Load Balancers
A load balancer sits in front of your pool of servers and distributes incoming traffic across them. When a request arrives, the load balancer picks a backend server and forwards the request. The requesting client never knows which server it hit.
Load balancers use different algorithms for distribution. Round-robin rotates through servers sequentially: request 1 goes to server A, request 2 to server B, request 3 to server C, request 4 back to server A. Least-connections routes new requests to whichever server is currently handling the fewest. IP hash uses a hash of the client's IP address to consistently route the same client to the same server - useful when you need sticky sessions, though that often indicates state that should have been externalized.
Key Point: A load balancer is a point of failure too. Production systems use pairs of load balancers with failover, or distributed load balancing at the DNS level. The Netflix outage happened because their load balancer was treated as infrastructure that could not fail, when it absolutely could.
Auto-Scaling and the Cold Start Problem
Cloud platforms let you define auto-scaling groups: pools of servers that grow when CPU or request rates cross a threshold and shrink when demand drops. A rule might say: when average CPU across the pool exceeds 70%, add two instances; when it drops below 30% for ten minutes, remove one.
Auto-scaling works well when demand rises gradually. It breaks down during sudden spikes - a viral post, a product launch, a Black Friday rush - because provisioning a new server takes time. By the time the new instance is ready, the spike may have already overwhelmed the existing pool.
The mitigation is pre-warming: anticipating spikes and scaling up before they hit. You schedule your auto-scaling rules to expand the pool ahead of known events. You keep a warm baseline that can absorb sudden bursts without waiting for provisioning. And you track your auto-scaling metrics carefully so you can lower the trigger threshold before an event rather than reacting to one in progress.
You cannot always predict when traffic will double. You can make it cheaper and faster to absorb the doubling when it arrives.
Key Point: Stateless services enable horizontal scaling by ensuring any server can handle any request. When services hold session state locally, you couple the user to a specific server - and that coupling makes scaling brittle.