February 15, 2008: The lessons of Amazon's S3 failure

Techie news sites have been all a-twitter about the Amazon S3 outage today. Many writers and commenters are venting their outrage and frustration, bandying about words like unacceptable and single point of failure and severe blow to confidence. Many seem to feel that Amazon has let them down, although Amazon’s SLA only promises 99.9% uptime.

There are a few lessons we can learn from this:

  1. 100% uptime is not possible for any continuous service. Even the electric company, which has a much more mature, redundant, and well built-out system, experiences failures. Your data network will too.
  2. Know your SLA. Really know it - exactly how it is measured and calculated, exactly who pays, and how much, when an SLA is missed. If you are the entity offering the SLA, are you sure you understand it? Are you sure you’ve done everything possible to help your customers understand it?
  3. Be ready for the downtime. Ask yourself right now - if all computing systems were down, what would happen? How would my life and my business continue to operate, what would the priorities be? Do I have a way to meet those priorities?
  4. Be ready and able to fully inform (or be informed) when downtime does occur. It’s not something you can sweep under the rug. People will notice it - and they’ll be a lot happier with candid and truthful answers.

Failure happens. The question is not if failure will happen, but when it will happen. The next question is how you’ll handle it when it does happen to you.

Comments (View)
blog comments powered by Disqus