On Tuesday, someone at Amazon Web Services made a small mistake that caused a 3-4 hour disruption of service for those using the servers in Northern Virginia.
Amazon was debugging slowness in its S3 billing system and says a typo in a command meant to take small number of the S3 billing servers offline Tuesday ended up taking a much larger number of S3 servers offline. One of which was responsible for all metadata and location information in the Northern Virginia data centers.
If you curious, you can read the full description from Amazon here.
This wasn’t a malicious attack, it was triggered simply by human error.
Compounding the mistake, Amazon never circled back to have a critical review of some key systems after launch.
While that is harsh to say, it’s an easy trap for any company to fall into. By nature, success leads to enhancing or building new features – not carefully reviewing what is already making money and working.
This is a vital step that needs to be built into any software project.
The Silver Lining
Amazon has some brilliant and fast acting people that were able to get things operational quickly. As well, they have already made some changes that will help prevent this kind of mistake again and are looking into some more long term preventative measures.
It’s important to be transparent about problems and show what you have learned – that keeps you honest and helps to build trust.