The state of critical application availability in public cloud and hybrid cloud environments
- Details
- Published: Wednesday, 04 September 2019 07:22
Frank Jablonski looks at how organizations can provide both high availability and disaster recovery for critical applications running in purely public and hybrid cloud environments.
Some executives may believe that enterprise applications running in the public cloud are guaranteed to have high availability. But no cloud service provider currently guarantees uptime at the application level. There are, of course, ways to assure high availability for critical applications running in private, public and hybrid cloud environments. But that responsibility lies with the enterprise, not with the cloud service provider.
This article highlights some important information business and IT executives need to know to provide both high availability (HA) and disaster recovery (DR) protections for critical applications running in purely public and hybrid cloud environments, beginning with an understanding of what is and is not guaranteed in the service level agreement (SLA).
Caveat emptor in the cloud
While all cloud service providers (CSPs) define ‘downtime’ somewhat differently, all include only a limited set of all possible causes of downtime at the application level. In effect, the SLAs only guarantee the equivalent of ‘dial tone’ at the system level, or specifically, that at least one instance will have connectivity to the external network.
Putting it another way: many of the most common causes of downtime are excluded. Here are just a few examples from actual SLAs:
- Factors beyond the CSP’s reasonable control (such as carrier network outages and natural disasters);
- The customer’s software, or third-party software or technology, including application software (such as SQL Server and SAP);
- Faulty input or instructions, or any lack of action when required (meaning the mistakes inevitably made by mere mortals).
It is quite reasonable for CSPs to exclude these and other causes of downtime that are beyond their control. But it would be irresponsible for the IT department to use these exclusions as excuses for not providing adequate HA and/or DR protections for critical applications.
High availability and/or disaster recovery
The ‘and/or’ in this topic has real significance. A detailed explanation is provided in Business continuity and disaster recovery planning for SQL Server, but here is a summary of the key take-aways.
The differences between HA and DR are rooted in the differences between ‘failures’ and ‘disasters’. Failures are short in duration and small in scale (such as a single server crashing), while disasters have widespread and enduring impacts (such as a severe storm that knocks out power and networks and closes roads for days).
Because the redundant resources needed to restore full operation after a local failure can also be local, the data replication can occur synchronously over a local area network. This enables the standby instance to be ‘hot’ and ready to take over immediately and automatically, which should be the goal of HA provisions. Four-nine’s (99.99 percent) of uptime is generally accepted by IT professionals as constituting mission-critical HA.
For DR, the redundant resources must be separated geographically across a wide area network, where the data replication will need to occur asynchronously to prevent the inherent latency from adversely impacting on the performance of applications requiring high transactional throughput. This replication lag makes the standby instance at best ‘warm’ (out of synch with the active instance) and results in an unavoidable delay during what will need to be a manual recovery process.
These fundamental differences also influence the different recovery point and recovery time objectives typically established for HA and DR protections. Because most data has great value (otherwise why capture and store it?) low or zero recovery point objectives (RPO) are common for both HA and DR purposes. RPO is the maximum period during which data loss can be tolerated. If no data loss is tolerable, the RPO is zero.
There are normally significant differences, however, in HA and DR recovery time objectives (RTO), which is the maximum tolerable duration of an outage. Critical applications have low RTOs, usually on the order of a few seconds for HA purposes, and high-volume database applications generally have the lowest. For HA, synchronous data replication makes it relatively easy to satisfy both a low or zero RPO and a low RTO of a few seconds. For DR, RTOs of many minutes or even hours are common owing to the extraordinary cost of implementing provisions capable of fully recovering from a widespread disaster in just a few minutes.
HA/DR options in and for the cloud
The HA and DR options available in and for the cloud fall into four categories. The first is those available within the cloud from the CSP. For HA, these normally include services based on redundant resources deployed in data centers / centres and zones. The latter, often called availability zones, enable synchronous replication between multiple data centers to protect against an outage at one of the data centers. For DR, all CSPs have what could be called Do-It-Yourself or DIY DR, which is a viable option because, compared to HA, DR is relatively easy to implement with data backups or snapshots, and ‘warm’ standby instances that are available in every cloud. Some CSPs also now have DR-as-a-Service (DRaaS) offerings that are more turnkey, but still require manual processes to affect a full recovery.
The second category includes capabilities built into the operating system. Windows Server Failover Cluster (WSFC) is a popular option in private clouds, but it requires shared storage, which is not available in any public cloud. The Datacenter Edition of Windows Server 2016 addressed this problem with Storage Spaces Direct. S2D is software-defined storage capable of creating a virtual storage area network (SAN) to satisfy WSFC’s need for shared storage. But S2D requires the servers be deployed within a single data center, making it incompatible with the availability zones preferred in HA configurations. For Linux, which lacks the equivalent of WSFC, administrators have two basic choices: create custom configurations based on open source software or use a commercial HA/DR solution (see category four).
The third category includes those capabilities bunded with application software. SQL Server, for example, offers two such options: Failover Cluster Instances and Always On Availability Groups. The former has the advantages of being included in the Standard Edition and affording protection for the entire SQL Server instance. But its dependency on WSFC makes it incompatible with the cloud. The latter offers carrier-class protection, but requires licensing the substantially more expensive Enterprise Edition, which cannot be cost-justified for many if not most database applications.
Another disadvantage of using any application-specific option is the need to have different HA and/or DR provisions for different applications. Having multiple solutions can substantially increase complexity and costs for licensing, training, implementation and ongoing operations. This is yet another reason why administrators increasingly prefer to use purpose-built failover clustering solutions.
The fourth and final category is commercial failover clustering software purpose-built for providing a complete HA and DR solution for any application running on either Windows or Linux in public, private and hybrid clouds. These solutions combine, at a minimum, data replication, continuous application-level monitoring and configurable failover/failback recovery policies. These capabilities enable the software to detect any and all downtime at the application level, regardless of the cause(s), including those excluded in the SLA.
More detailed information on these options is included in this article about the September 2018 failure of Azure’s South-Central US Region: Options for fully and quickly recovering from a major Azure cloud outage.
The purpose of being purpose-built
The aim of purpose-built failover clustering solutions is to make robust HA/DR protections more dependable and affordable, and these solutions are proven to fulfill this purpose in practice. Unlike cloud-based or application-specific options, these commercial solutions are designed to support all applications. Having a single solution (albeit with different versions for Windows Server and Linux) makes it easier to implement, test, operate, update and otherwise manage HA/DR provisions for all applications.
Simplified testing is representative of the many advantages offered by failover clustering solutions. Testing of HA/DR configurations is vitally important, but it has traditionally been disruptive and difficult, forcing administrators to take short-cuts that can result in failover provisions failing when needed. This advantage alone is why a growing number of administrators are choosing to use purpose-built solutions.
In addition, their ability to operate in both SAN-based and SANless environments gives administrators the flexibility to choose from among purely private, purely public or hybrid cloud configurations, whichever is the most cost-effective for each and every application. And then be able to monitor and manage all of them from a single pane of glass.
This cost-effective configuration consists of a two-node HA failover cluster spanning two availability zones in one region, along with a third instance deployed in a separate region to facilitate full recoveries from widespread disasters.
Confidence in the cloud
The cloud’s agility, scalability and affordability make a compelling case for migrating enterprise applications. Yet despite its state-of-the-art technology and money-back guarantees, many organizations remain reluctant to migrate their critical applications.
CSPs know this, which is why many now acknowledge the need for third-party failover clustering solutions, and they do so through official certifications, inclusion in software marketplaces, how-to documentation and other ways to assist customers wanting to migrate critical applications to their clouds. Microsoft has similar arrangements involving SANless failover clustering software for Windows Server and SQL Server.
The cloud is eminently capable of running your critical applications, but only if you accept responsibility for providing HA and DR protections at the application level.
About the author
Frank Jablonski is VP of Global Marketing at SIOS Technology, where he leads marketing and communications activities worldwide. His career spans more than 20 years and includes driving worldwide go-to-market development and execution in senior leadership positions at Acronis, Syncsort, CA, FilesX, Genuity and EMC. Frank holds a Bachelor of Science degree in Mechanical Engineering from the University of Massachusetts Lowell.