Amazon EC2′s high-profile outage in the US East region has taught us a number of lessons. For many, the take-away has been a realization that cloud-based systems (like conventionally-hosted systems) can fail. Of course, we knew that, Amazon knew that, and serious companies who performed serious availability engineering before deploying to the cloud knew that. In cloud environments, as in conventionally-hosted environments, you must implement high-availability if you want high availability. You can’t just expect it to magically be highly-available because it is “in the cloud.” Thorough and thoughtful high-availability engineering made it possible for EC2-based Netflix to experience no service interruptions through this event.
Only those companies that failed to perform rudimentary availability design on their EC2-based systems have experienced prolonged outages as a result of this week’s event. This is only as would be expected – Amazon.com does not promise to make your application highly available. What Amazon EC2 provides is a rich set of tools that allows anyone that is serious about building a highly available application to do so on EC2.
This week’s EC2 failures have provided plenty of fodder for the cloud skeptics, as well it should. Cloud skeptics hold EC2 and other cloud services’ feet to the fire, forcing them to address real concerns with the paradigm. What is far more alarming than the told-you-sos from the cloud skeptics is the torrent of media ignorance regarding what cloud computing and EC2 fundamentally provide.
Take this article in the Wall Street Journal for example. Quoting the authors:
A main issue at the center of this controversy is why Amazon hasn’t been able to re-route capacity between data centers that would have avoided this problem and ensured the websites of its users would still operate properly.
Here the authors seem to be referring to EC2 availability zones. As most who have worked even a little with EC2 know, when you run an instance or store volumes in one availability zone, there is no automatic mechanism available to “re route capacity” between availability zones. If you want your application to survive the failure of an availability zone, you must implement a high availability contingency. For instance, you can frequently back up (using the EBS snapshot feature) your storage volumes so that you can re-instantiate the system in a surviving availability zone. In this week’s outage, all but one of Amazon’s US East availability zones were functioning normally within about four hours. Only those customers with systems in one of Amazon’s four US-East zones could not reliably access their data in their Elastic Block Store (EBS) volumes. If those customers had simply performed regular backups (snapshots) of their volumes, the outage would have been confined to a few hours, not 40+ hours.
It is hard to blame the media though, when even the customers of EC2 showcase a complete misunderstanding of what they should expect from the infrastructure on which they have built the systems that support their very businesses. In the same article cited above, Simon Buckingham, CEO of Appitalism is extensively quoted misunderstanding the fundamentals of EC2:
We’re past the point of this being a routine outage… Customers like myself have assumed that if part of Amazon’s data center goes down, then traffic will get transferred in an alternative capacity… The cloud is marketed as being limitless, but what this outage tells us is it’s not.
That is an interesting assumption indeed, Mr. Buckingham. I would assume that the CEO of a web-based company would spend at least enough time understanding his company’s own infrastructure to realize that he should be talking to his own engineers about why the failed to design a robust multi-zone backup solution on EC2, rather than imagining capabilities for EC2 that do not and have never been asserted to have existed.
I hope the upshot of this event will be more comprehensive and careful engineering of solutions deployed to the cloud. I fear however, that given the tenor of the media coverage and customer reactions, the onus will not be placed where it belongs, on the customers’ own engineers, and instead will only result in undeserved bad press for Amazon.
For what its worth, we at Blue Gecko frequently help our customers deploy robust highly-available solutions on EC2, that would have easily recovered in the four-to-five hour time frame, or even experienced no outage at all, rather than the 40+ hour nightmare affecting some customers of EC2.
Related posts: