Eliminating weak links in the chain of high availability
- Details
- Published: Friday, 29 October 2021 07:51
This article by Dave Bermingham will dig beneath the surface of application high availability, looking at the cloud servers, storage, network, and other related infrastructure components to help readers identify and eliminate weak links in the chain that ensures high availability.
Conversations about application high availability (HA) quickly devolve into a discussion of ‘the nines’. How best to configure a cluster of servers to ensure four nines of availability? That is, how can the servers be configured to be available 99.99 percent of the time? That 99.99 percent figure marks the generally-accepted definition of what constitutes HA, but it can be distracting. Merely configuring a cluster of servers for 99.99 percent availability doesn’t guarantee that users will be able to access a critical application 99.99 percent of the time. If the storage system or the load balancers or some other aspect of the application environment are only rated for three nines of availability – 99.9 percent uptime – then your application infrastructure won’t be as highly available as you expect.
It’s important to look at the entirety of the infrastructure you are trying to maintain – the servers, storage, network, and other related infrastructure components – to determine where the weak links may lie and how best to eliminate them.
Scoping out the servers
Let’s start with the servers. If you’re using cloud services from AWS, Azure, or Google, it’s relatively easy to configure a cluster of servers for HA. You’ll need at least two virtual machines (VMs) and they should be deployed in separate but geographically proximate data centers / centres. Depending on the service provider you use, these may be called ‘zones’ or ‘availability zones’. The key point is that you want them to be in physically separate places in case a local catastrophe of some kind takes an entire data center offline. If that happens, your cluster can rely on failover management software to enable the VM in the other data center to take over.
AWS, Azure, and Google will all provide a service level agreement guaranteeing that at least one of the VMs in your cluster will be available 99.99 percent of the time. Note, though, that 99.99 percent availability of a VM is not the same as 99.99 percent availability of your critical applications, and the SLAs are often thunderously silent on that point. It is up to you to deploy a failover clustering solution that ensures that each VM in your cluster has copies of the applications you want to run and synchronized copies of all active data. Some applications, including Microsoft SQL Server and SAP S/4HANA, have tools that can facilitate database synchronization, but there are third party SANless clustering tools that are application agnostic and can perform block level replication to ensure the synchronization of your applications and data among server nodes.
Considering storage
The above raises questions about how and where you’ll store your data and how it will move between the cluster nodes. The type of storage system you select to support your cloud VMs can affect the overall availability of your cluster. If you are using Locally Redundant Storage (LRS), Zone Redundant Storage (ZRS), or Geo Redundant Storage (GRS) on Azure, for example, the SLA caps the access guarantee for read and write operations at 99.9 percent. You can gain a read access guarantee of 99.99 percent if you use Read Access-Geo Redundant Storage (RA-GRS), but you're still limited to 99.9 percent write access availability. Google will guarantee access to standard storage in “in a multi-region or dual-region location of Cloud Storage” ≥99.95 percent of the time. That guarantee drops to ≥99.9 percent of the time if you configure standard storage in a regional location of Cloud Storage.
The significance of those digits is not to be underestimated. If you are expecting your application to be available 99.99 percent of the time, you’re expecting no more than 4.375 minutes of downtime per month (or 52.5 minutes per year, if the SLA measure specifies an annual figure). If your storage system is available only 99.9 percent of the time, it could be down for 43.75 minutes per month (or 8.75 hours over the course of a year)—and if your storage is down, so is, effectively, your application (even if the servers supporting it are up).
Testing the network
Because the VMs in your HA cluster will sit in different zones or data centers, you need to think about the reliability of your links to the cloud, the networking appliances that may be orchestrating traffic within an individual cloud (such as load balancers), and the data interconnects between the data centers where your primary and secondary infrastructures reside. Although network availability between VMs in the same vNet are covered under the standard compute SLA, you may be using other network services whose availability can impact overall application availability.
Consider Azure Express Route, an option that provides private connections between on-premise and Azure data centers. While the privacy may sound attractive, the Azure SLA for Express Route guarantees only 99.95 percent availability. Are you looking at an Azure Basic Gateway for VPN? The SLA guarantees only 99.9 percent availability for the Basic Gateway (though that increases to 99.95 percent for all Gateway for VPN offerings except Basic). The AWS Direct Connect service, which links your network directly to AWS, can be backed by an SLA promising 99.99 percent availability—but only if you configure for maximum resilience (that is, with least four dedicated connections across a minimum of two Direct Connect locations and with no fewer than two connections in a single location). If you configure the Direct Connect option with only two dedicated connections, the SLA guarantees only 99.9 percent availability.
Eliminating the weak links
A deeper examination of all the components that inform your infrastructure can quickly identify where the weak links exist. The good news is that every cloud provider provides services in each of these areas – from servers to storage to networking – that can ensure overall HA. You just need to be sure you select the services at each level that promise to meet that 99.99 percent accessibility goal that you’re expecting.
The author
Dave Bermingham is the Senior Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high availability expert and has been elected a Microsoft MVP for the past 12 years: 6 years as a Cluster MVP and 6 years as a Cloud and Datacenter Management MVP. Dave holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare, and education.