The Amazon Web Services outage: business continuity implications and actions
- Published: Friday, 17 March 2017 08:57
Following the recent long-running outage at Amazon Web Services Continuity Central conducted a reader-survey to ascertain the level of impact and whether it will result in business continuity professionals reviewing their business continuity plans. The survey also sought to capture the lessons that can be learned from the outage. This article provides a summary of the results…
On February 28th 2017 a four-hour outage impacted one of Amazon Web Services’ (AWS) largest cloud regions, US-EAST-1 in North America. Since many enterprises rely on AWS this outage, many times longer than the expected annual downtime for the S3 cloud storage system where the issue occurred, is highly concerning.
The outage, caused by high error rates affecting the Amazon Simple Storage Service (Amazon S3), commenced at 12:35 pm ET and was fully restored by 4:49 pm ET, according to AWS. Amazon S3 is ‘object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web’ says AWS. It is marketed as being ‘designed to deliver 99.999999999% durability’; a claim which is now clearly questionable!
Following the incident Continuity Central conducted an online survey with the aim of hearing directly from business continuity professionals and understanding the actions that they took and are taking in response to the outage. The online survey was closed after 100 responses were received and the results are as follows.
The survey asked whether the respondent’s organization used any AWS services. The responses show just how widespread AWS’s reach is: 35.5 percent of respondents’ organizations use AWS and 11 percent don’t yet but plan to do so in the future.
20 percent of respondents said that their organization had been affected by the February 28th outage. These respondents were invited to briefly describe the impacts; and substantive responses received were as follows:
- We had about 25 separate user-facing systems that were unavailable, affecting users number in the 100s of thousands. When the outage first occurred, we first tried to switch to an alternate region for those systems that are in multiple regions (not all of them are), but could not because we could not use the AWS load balancing service, which was also impacted by the outage. In the end we just had to wait for Amazon to resolve the problem and test it. The only effective business continuity actions we were able to take were around communications.
- Outsourced services that we use/access were unavailable during the outage, however, there was minimal if any impact. No business continuity actions were taken.
- We were minimally impacted and the outage window was well within our tolerable allowances therefore only situation status/monitoring and communication was undertaken.
- We rely on third party vendors that use the service. We quickly adopted a new technology and are now questioning if we are going to continue with the vendor that provided the initial service.
- Not directly but indirectly. For example, in the course of business we noticed links in others’ websites were not working, creating a delay in our own business process.
- It had a pretty big impact for us as all customer document submissions and UW viewing of those documents is dependent on S3.
- We could not upload attachments to Hubspot. We could not login to GoToMeeting.
- We were coincidentally in the middle of our AWS DR exercise so more than 60% of our apps had already failed over to AWS West. The remaining mission critical apps were able to fail over to on prem servers. We had little customer impact.
- This affected a vendor that uses Amazon Web Services. During the outage we were unable to monitor a large number of devices in the field that we are responsible for. Fortunately, there was no adverse impact beyond that.
- The impact was minor as only one of our software applications uses AWS - although we're in the process of evaluating two other applications that would run on the AWS platform.
Business continuity plan reviews
The survey asked ‘Will your organization be reviewing its business continuity plans in the light of the AWS outage?’ Interestingly, 40 percent of respondents said that their organization would be conducting a review of business continuity plans following the outage. 36 percent said that no review would take place; and 24 percent replied that they did not know whether a review would take place.
The final question in the survey asked respondents ‘What lessons do you think the business continuity profession can learn from the AWS incident?’ 62 percent of respondents took the time to answer the question and substantive responses were as follows (published verbatim except for spell corrected):
- Not to rely solely on SLAs and guarantees for resiliency, need to diversify infrastructure resources.
- Never be smug or complacent ... no matter how robust you think your BCP is!
- I would say this applies to all Cloud providers. Everyone things that because it is in the cloud you are protected. that depends on what you pay for and how you maintain control of your own infrastructure through load balancing, etc between two different Geo Locations. No matter what the cloud provider is they maintain the control.
- Frequent review and testing of your Plan are essential components to your survival.
- AWS and other cloud providers serve millions of hours of compute time every month, so a lot of 9's is deceiving, they could be out for hours and probably still meet SLA's.
- Obviously one of the key considerations besides price and the actual delivery of service is, what is the service provider's Business Continuity Plan?
- Cloud strategies should be re-looked and perhaps maintaining critical processes and data on site should be considered.
- If you buy something that is supposed to be almost 100% reliable it needs to be tested in the buyer’s environment to make sure that is true - clearly not in this case.
- Cloud is not as resilient as it's cracked up to be. It is fallible just like other solutions.
- This stresses the importance of planning properly. Strategic objectives changes overtime. Therefore, you always have to ask questions with regards to the relevance of your business continuity plans. Are they aligned with organisational objectives? The fact this downtime took this long to sort out, it means there was a "belief" that this will never happen.
- That even systems that have been designed to be up all the time can still fail and they should be planned for.
- Nothing is for sure...
- Cloud services are not always all they're made out to be. When you're dependent on the provider's tools to fail over, if they experience an outage that affects not only their service but also those tools, you're dead in the water.
- Thorough supplier assessment is more important than many people think.
- Educate businesses on what AWS and others actually mean by 99.999999999........and how it may not actually be as resilient as they think due to the conditions that have to be met to reach that.
- Make sure that for any production critical objects held in S3, we can replicate them to a second AWS S3 region.
- RTO=0 in async replication is a MUST!
- Using a third party provider to host or manage your service does not entirely mitigate the risk of outage or transfer the ownership of ensuring continuity for that service, no matter the size, experience, or reputation of the provider.
- Another reminder that business continuity planning must be an ongoing process...identify and mitigate new or previously unidentified risks (internal and external) and test plans to ensure that we are prepared to respond and recover.
- Obviously even cloud technology has its flaws. Assure there are strong SLA penalties in contracts with cloud vendors and well-documented manual processes to continue to ‘process’ work until systems come back online.
- Assumptions about reliance on the uptime of third and fourth and fifth party providers need to be assessed.
- What is a backup scenario for using cloud services like AWS? You normally use many of their services, so there is no way to have everything installed at another cloud provider, because they are all different. So how do we get along with such a single point of failure?
- Failure is not an option.... It killed us in a critical time.
- I've spent most of a four-decade career working in high-demand, high-availability computing environments where ‘failure is not an option’ were words to live by. The hard-learned reality is that ‘failure is *always* an option,’ regardless of the time/money/energy invested in building so-called bullet-proof systems. There is not, has never been, and likely will never be any such thing. The lesson for business continuity planners? Simple: failures will happen. Prepare for that, and don't be deflected from the task by those who wear rose-colored glasses. At the end of the day, we must be the ‘dinosaurs’ who understand that man-made systems will suffer from man-made flaws.
- Claims about reliability numbers are void. The downtime clearly proves that AWS's claim was false - and not only questionable as you write. The whole BC architecture has to be analyzed instead of believing marketing numbers about availability.
- No matter how large the business providing a service, their information security and business continuity plans should be reviewed. Do not trust a one page audit summary.
- Diversity of cloud assets is important.
- Interdependences are critical.
- In our BIA we should analyze when IT falls and what can we do. Redundancy is important.
- I have always been cautious about using public services for critical infrastructures. Over the past three years I have put five organisations over from their own to public data services. The results have been a) not one has achieved a better availability rate b) none have achieved a cost saving in fact 3 out of the 5 are spending 15% plus more c) 3 of the organisations are working to move back to their own data centre. The received wisdom is that Cloud Services are better and cheaper, but the actuality is that they are not for very many organisations, but nobody wants to admit it as their jobs are on the line!
- Business Continuity needs are changing, we need to work more and be prepared for running the business with workaround procedures without access to systems.
- You are missing the meaning of the AWS 99.9999 etc claim - it is about durability (i.e. prevention of data loss) rather than availability. Still looks like disposable marketing though.
- Not to be over reliant on a particular vendor.
- No cloud service is available to five 9's or more. No critical application - such as Salesforce or BCM tools like Fusion which are based on Salesforce - should be deployed on AWS. If it's critical, it should be SaaS hosted, not Cloud hosted.
- When using a service such as AWS, ensure that ‘multi-zoning’ is part of the service agreement, AT NO EXTRA COST!!!
- In term of business continuity/resiliency, evaluate in depth through the industry the Third Services Provider that offer cloud solutions such as recovery to cloud and the redundancy infrastructure that the company has in place before entering into any agreement.
- If it is a critical function you need to have processes in place to get through the unexpected outage of technology. Depending on the criticality you need to build your own high availability capacity independent of any single vendor.
- Eggs basket etc 99.9% reliability etc there is no cloud only someone else's computer!
- You can never be too comfortable.
- The ‘cloud’ is not magic. Always configure an out of region recovery strategy. We already have a requirement that all applications configured in AWS must be configured in East and West
- I am not sure there is a lesson here (just) for the business continuity profession. IT can, and does, misfire from time to time. The business continuity professional's role is to deploy cynicism & scepticism against any promised service delivery - such as asking the user, "So, what if AWS actually goes down?". Why does Continuity Central imply that AWS is beyond having a breakdown? I am not sure I understand the emotion driving the survey…
- First of all Business Continuity is about workforce and physical location... as in, what would you do during a pandemic. Disaster Recovery is the practice of moving service from a primary data center to an alternate data center. Business Resiliency is the combination from both. My hope is that my senior executives have learned that no matter what is promised, unless they prove that they have automatic failover AND a high level of fault tolerance built into their servers AND storage, you need to have your own DR plans. I am a seven year veteran of Disaster Recovery and find that most companies are not as prepared as they need to be and lack the foresight to invest appropriately. All it takes is one major occurrence to ruin a company's reputation.
- Technology dependencies have high impact and disruptions will occur but it is how you manage that counts.
- We all need to pay closer attention to vendors who deliver critical applications/services in support of critical processes we perform.
- ‘Cloud’ is just another data center that the customer does not control and is subjected to exactly the same issues as if they did.
- The Cloud is not a panacea and when a big provider goes down, you will have no idea when it is coming back.
- That cloud solutions do not address all technology-related risks and that good preparation is critical.
- Technology is prone to failure, even the best technology. Never put all your eggs in one basket and have back-ups for your back-ups. This is something emergency management does well but not so much in BC. Make a plan and then make a plan for when the resources in your original plan are not available.
- Cloud is not the silver bullet many make it out to be therefore, ensure any contract with them is robust.
- Big is not always better.
- Carefully review service contracts to determine the level of redundancy and recovery.
- Don't believe what the salesman tells you. Do your own research
- Do not depend on a single cloud service provider. Have at least two providers if budgets permit, and replicate mission-critical data at the secondary cloud service. Despite claims to the contrary, none of them is totally reliable, as clearly shown by AWS.
- That the cloud is not as invulnerable as most people thought. Perhaps cloud vendors are growing too fast, trying to get as much market share as possible, but at the risk that they are maintaining the infrastructure to delivered based upon their promises. It may be wise to wait for a bit before committing to a cloud or hybrid cloud strategy.
- Always have a backup plan
- The cloud hasn't been tested through real events to make any claim to its stability.
- There will be more hesitation to rely on the ‘Cloud’.
- Nothing is bulletproof. Vulnerabilities are more transparent than people realize and therefore require more robust analysis on a constant basis.
- Just confirmed that 11 9's of availability is a myth.
- Durability is not reliable! It shows how importance Availability is!!!