Business continuity lessons from the recent Azure outage
- Published: Thursday, 20 September 2018 08:29
On September 4th 2018, storms that hit southern Texas resulted in downtime at Microsoft Azure’s South Central US region, which had knock-on impacts for services outside of the region. Dave Bermingham looks at the incident and the business continuity lessons that can be learned from it.
Cloud services such as Azure are now mission critical for many organizations so any downtime can cause business continuity problems. A major outage such as the one that occurred on September 4th issues a reminder to organizations that however reliable a cloud service is, outages are possible and need to be planned for. This article considers the potential preparatory actions that organizations can take.
Microsoft explained the reasons for the outage as follows:
“In the early morning of September 4, 2018, high energy storms hit southern Texas in the vicinity of Microsoft Azure’s South Central US region. Multiple Azure datacenters in the region saw voltage sags and swells across the utility feeds. At 08:42 UTC, lightning caused electrical activity on the utility supply, which caused significant voltage swells. These swells triggered a portion of one Azure datacenter to transfer from utility power to generator power. Additionally, these power swells shutdown the datacenter’s mechanical cooling systems despite having surge suppressors in place. Initially, the datacenter was able to maintain its operational temperatures through a load dependent thermal buffer that was designed within the cooling system. However, once this thermal buffer was depleted the datacenter temperature exceeded safe operational thresholds, and an automated shutdown of devices was initiated. This shutdown mechanism is intended to preserve infrastructure and data integrity, but in this instance, temperatures increased so quickly in parts of the datacenter that some hardware was damaged before it could shut down. A significant number of storage servers were damaged, as well as a small number of network devices and power units.
“While storms were still active in the area, onsite teams took a series of actions to prevent further damage – including transferring the rest of the datacenter to generators thereby stabilizing the power supply. To initiate the recovery of infrastructure, the first step was to recover the Azure Software Load Balancers (SLBs) for storage scale units. SLB services are critical in the Azure networking stack, managing the routing of both customer and platform service traffic. The second step was to recover the storage servers and the data on these servers. This involved replacing failed infrastructure components, migrating customer data from the damaged servers to healthy servers, and validating that none of the recovered data was corrupted. This process took time due to the number of servers damaged, and the need to work carefully to maintain customer data integrity above all else. The decision was made to work towards recovery of data and not fail over to another datacenter, since a fail over would have resulted in limited data loss due to the asynchronous nature of geo replication.
“Despite onsite redundancies, there are scenarios in which a datacenter cooling failure can impact customer workloads in the affected datacenter. Unfortunately, this particular set of issues also caused a cascading impact to services outside of the region.”
Possible business continuity measures
In the light of the above, what could organizations have done to minimize the impact of this outage? No one can blame Microsoft for a natural disaster such as a lightning strike. But at the end of the day if your only disaster recovery plan is to call, tweet and email Microsoft until the issue is resolved, you just received a rude awakening. IT IS UP TO YOU to ensure you have covered all the bases when it comes to your business continuity and disaster recovery plans.
Here are some of my initial thoughts on the possible business continuity measures and the issues that this incident raises:
Availability Sets (Fault Domains/Update Domains) – in this scenario, even if you built Failover Clusters, or leveraged Azure Load Balancers and Availability Sets, it seems the entire region went offline so you still would have been out of luck. While it is still recommended to leverage Availability Sets, especially for planned downtime, in this case you still would have been offline.
Availability Zones – while not available in the South Central US region yet, it seems that the concept of Availability Zones being rolled out in Azure could have minimized the impact of the outage. Assuming the lightning strike only impacted one data center, the other data center in the other Availability Zone should have remained operational. However, the outages of the other non-regional services such as Azure Active Directory (AAD) seems to have impacted multiple regions, so I don’t think Availability Zones would have isolated you completely.
Global Load Balancers, Cross Region Failover Clusters, etc. – whether you are building SANLess clusters that cross regions, or using global load balancers to spread the load across multiple regions, you may have minimized the impact of the outage in South Central US, but you may have still been susceptible to the AAD outage.
Hybrid-Cloud, Cross Cloud – about the only way you could guarantee resiliency in a cloud wide failure scenario is to have a disaster recovery plan that includes having realtime replication of data to a target outside of your primary cloud provider and a plan in place to bring applications online quickly in this other location. These two locations should be entirely independent and should not rely on services from your primary location to be available, such as AAD. The disaster recovery location could be another cloud provider, in this case AWS or Google Cloud Platform seem like logical alternatives, or it could be your own data center, but that kind of defeats the purpose of running in the cloud in the first place.
Software as a Service – while software as service such as Azure Active Directory (ADD), Azure SQL Database (Database-as-Service) or one of the many SaaS offerings from any of the cloud providers can seem enticing, you really need to plan for the worst case scenario. Because you are trusting a business critical application to a single vendor you may have very little control in terms of DR options that includes recovery OUTSIDE of the current cloud service provider. I don’t have any words of wisdom here other than investigate your DR options before implementing any SaaS service, and if recovery outside of the cloud is not an option, then think long and hard before you sign-up for that service. At the least make the business stakeholders aware that if the cloud service provider has a really bad day and that service is offline there may be nothing you can do about it other than call and complain.
I think in the very near future you will start to hear more and more about cross-cloud availability and people leveraging solutions to build robust HA and DR strategies that cross cloud providers. Truly cross cloud or hybrid cloud models are the only way to truly insulate yourself from most conceivable cloud outages.
Dave Bermingham is technical evangelist and cloud MVP at SIOS Technology.