IT disaster recovery, cloud computing and information security news

Options for fully and quickly recovering from a major Azure cloud outage

Jonathan Meltzer examines four different options for ensuring application-level continuity through high availability and disaster recovery provisions in a hybrid or exclusively Azure cloud environment.

Cloud failures - both major and minor - are inevitable. What is not inevitable is extended periods of downtime or unacceptable data loss caused by any resulting service outages.

One particularly devastating outage occurred in the South Central US Region of Microsoft’s Azure cloud when it experienced a catastrophic failure on 4th September 2018. A severe thunderstorm triggered a series of problems that ultimately resulted in bringing down an entire data center / centre. Many customers were offline for a full day, and some for over two days. Microsoft has since addressed the problems that led to the outage, but the incident will not soon be forgotten by IT professionals tasked with ensuring application-level continuity.

The untold stories from that memorable day involve the many customers that were back up and running within minutes of the outage. Indeed, there are ways all Azure customers can prepare to survive virtually any outage with very little downtime, and minimal or no data loss, even when catastrophe strikes.

This article examines four different options for ensuring application-level continuity in a hybrid or exclusively Azure cloud environment. Two of the options are general-purpose, and two are unique to Microsoft’s SQL Server database, a popular application in the Azure cloud.

The four options, summarized in the table below, include:

  • The Azure Site Recovery Service
  • SQL Server Failover Cluster Instances with Storage Spaces Direct
  • SQL Server Always On Availability Groups
  • Third-party Failover Clustering Software.

These four options provide different forms of HA and/or DR protection, and can be used individually or in various combinations to meet the specific needs of different applications:

Availability in the Azure Cloud

Availability in the Azure Cloud

Before discussing the four HA/DR options, it is helpful to understand some of the basic service availability provisions in the Azure cloud, which exist at three different levels: within a site, within a region and across multiple regions.

To the surprise of many Azure customers, having servers in different Availability Sets distributed across different Fault Domains offered no protection during the South Central US Incident described above. This limitation highlights an important difference between HA and DR. While each Azure Fault Domain is located in a different rack, the racks in an Azure Availability Set are all located in the same data center. Such a configuration does afford some HA protection from some potential causes of downtime, such as a server failing, but it affords no DR protection in a disaster affecting an entire data center.

To protect against outages affecting entire data centers, Azure offers Availability Zones (AZs), which were created in response to the South Central US Incident. Each region that supports AZs has a minimum of three data centers that are interconnected with a high-bandwidth and low-latency network to support HA and/or DR protection against single-site outages.

While Azure offers a 99.99 percent uptime guarantee for configurations using AZs, be forewarned that downtime excludes many common causes of failures. Among the exclusions are customer and third-party software, and what could be called ‘human error’ - the little (or big) mistakes all administrators inevitably and occasionally make when implementing and operating complex configurations. Had AZs existed during the South Central US Incident, most customers could have been back up and running within an hour.

To protect against major disasters, Azure offers Region Pairs. Every region in the Azure cloud now gets paired with another region within the same geography, such as the US, Europe or Asia. The pairs are separated by at least 300 miles (480 km) and are strategically chosen to enable rapid recoveries during widespread power or network outages, or major natural disasters. The pairing also enables Microsoft to maintain service levels across the Azure cloud as it rolls out planned updates one region at a time.

Azure Site Recovery Service

This first option is Azure’s DR-as-a-service (DRaaS) offering. ASR fully replicates both physical and virtual machines to other Azure sites, potentially in another region, or from on-premises instances to the Azure cloud. The service delivers a reasonably rapid, albeit manual, recovery from both system and site outages.

Like all DRaaS offerings, ASR has some limitations. The most serious one is the inability to automatically detect and failover from many failures that cause downtime at the application level. This is why Azure characterizes the service as being for DR and not for HA, as the latter normally does require automatic failover.

ASR is capable of meeting a recovery time objective (RTO) in the 3-4 minute range depending, of course, on how quickly administrators are able to manually detect and respond to the failure. For applications with a low or zero recovery point objective (RPO), recovery times can be considerably longer. The reason is, as a DR service, ASR must replicate data to a different site across the WAN. For transaction processing applications requiring high throughput performance, the data replication must be asynchronous; that is, the primary does not wait for the secondary to confirm the completion of transactions. This ‘replication lag’ creates a trade-off between accommodating RTOs and RPOs that typically results in an increase in recovery times.

For critical applications with an RPO of zero or close to zero, manual processes are needed to ensure that all data, such as from a transaction log, has been fully replicated on the secondary before the failover can occur. This extra, manual, effort lengthens the recovery time, and is another reason why services like ASR are suitable for DR, but not HA.

SQL Server Failover Cluster Instance with Storage Spaces Direct

SQL Server offers two of its own options for high availability and disaster recovery: Failover Cluster Instances (FCIs) and Always On Availability Groups.

Failover Cluster Instances afford two notable advantages: the feature is available in the less expensive Standard Edition of SQL Server; and FCIs protect the entire SQL Server instance, including user and system databases. A major disadvantage for HA and/or DR needs is its requirement for cluster-aware shared storage, such as a storage area network (SAN), as a means to ‘replicate’ (or actually share) data between the primary and secondary. But shared storage has not historically been available in the Azure cloud—or in any other cloud service for that matter.

The lack of shared cloud storage was addressed in Windows Server 2016 with the introduction of Storage Spaces Direct. S2D is software-defined storage that creates a virtual SAN, enabling data to be shared between multiple instances. Support for S2D was also added in SQL Server 2016, and it quickly became a popular choice for use with FCI. But S2D requires that the primary and secondary instances reside within the same data center, making this option viable for some HA needs, but not for DR. FCI can still be used for DR purposes, but data replication across the WAN will need to be provided either by log shipping or a third-party failover clustering solution.

It is worth noting that unlike for DR, data replication for HA purposes within the low-latency LAN environment of a single data center can be fully synchronous; that is, the primary and secondary datasets are updated simultaneously. Synchronous replication makes it possible for full recoveries to occur automatically and in real-time to accommodate those applications with a low or zero RPO, and a low RTO of just a few seconds.

SQL Server Always On Availability Groups

Always On Availability Groups is SQL Server’s most capable option for both HA and DR, but its use requires licensing the more expensive Enterprise Edition. HA configurations using Always On Availability Groups are able to recover from failures within 5-10 seconds, including for applications with a low or zero RPO. This option also has readable secondaries (with appropriate licensing) for querying the databases.

An Always On Availability Groups configuration that provides both HA and DR protection consists of a three-node arrangement with two nodes in a single Availability Set and the third node in a separate Availability Zone or Region. One notable limitation is that only the user database is replicated, and not the entire SQL instance (including any system-generated databases), which must be protected by some other means.

In addition to being cost-prohibitive for many database applications, this approach has another disadvantage. Because this option is specific to SQL Server, IT departments will need to implement separate HA and/or DR provisions for all other applications, including those using a different database. Having multiple HA/DR solutions can substantially increase complexity and costs for licensing, training, implementation and ongoing operations. Which is why organizations increasingly prefer to use general-purpose third-party solutions.

Third-party Failover Clustering Software

With its application- and platform-agnostic design, the failover clustering software option is able to provide a complete HA and DR solution for virtually all Windows and Linux applications in private, public and hybrid cloud environments.

Being application-agnostic eliminates the need to have separate HA/DR provisions for different applications. Being platform-agnostic makes it possible to leverage, while not depending on, various capabilities and services in the Azure cloud, including Fault Domains, Availability Sets and Zones, Region Pairs, and Azure Site Recovery.

These complete solutions include, at a minimum, real-time data replication, continuous monitoring capable of detecting failures at the application level, and configurable policies for failover and failback. Most failover clusters are able to satisfy RTOs as low as 20 seconds and RPOs under one second, and most also offer a variety of value-added capabilities, including the ability to provide both HA and DR protection for the entire SQL Server instance using FCIs in the less expensive Standard Edition.

One notable disadvantage is the inability to read secondary instances of SQL Server databases. But given the application-specific nature of two of these options, and the inability of basic availability-related provisions in the Azure cloud to detect common causes of failure at the application level, having a general-purpose failover clustering solution is becoming increasingly inevitable as mission-critical database and other applications migrate to the cloud.

Conclusion

All four of the above options can have a role to play, separately or in various combinations, in making the continuum of HA and DR continuity more effective and affordable for the full spectrum of enterprise applications - from those that can tolerate some data loss and extended periods of downtime, to those that require five-9s of uptime with minimal or no data loss.

To survive the next Azure outage, including a major one like the South Central US Incident, make certain that your HA and/or DR provisions are configured with at least two and preferably three nodes spread across two regions, preferably in a Region Pair. Also be sure to understand any and all limitations in whatever services and options you choose, including the requirement for manual processes that might be needed to detect every possible failure and to trigger failovers in ways that ensure both application continuity and data integrity.

The author

Jonathan Meltzer is director, Product Management, at SIOS Technology. He has over 20 years of experience in product management and marketing for software and SaaS products that help customers manage, transform, and optimize their human capital and IT resources. Prior to SIOS Jonathan worked for both startups and established companies, including Compaq, Sun Microsystems, Kronos, Iron Mountain and EMC. He holds a BS in Computer Engineering from Syracuse and an MBA in Marketing from Columbia.



Want news and features emailed to you?

Signup to our free newsletters and never miss a story.

A website you can trust

The entire Continuity Central website is scanned daily by Sucuri to ensure that no malware exists within the site. This means that you can browse with complete confidence.

Business continuity?

Business continuity can be defined as 'the processes, procedures, decisions and activities to ensure that an organization can continue to function through an operational interruption'. Read more about the basics of business continuity here.

Get the latest news and information sent to you by email

Continuity Central provides a number of free newsletters which are distributed by email. To subscribe click here.