Business continuity and disaster recovery planning for SQL Server
- Published: Thursday, 13 June 2019 07:41
This article by Dave Bermingham provides some practical guidance to help system and database administrators tasked with creating business continuity and disaster recovery plans.
For those administrators who hate to plan, General George Patton offers this advice: “A good plan today is better than a perfect plan tomorrow.” No business continuity or disaster recovery plan can possibly address every possible event or set of circumstances, which is why both the BC and DR plans should continually evolve as lessons learned inform various improvements.
Providing the guidance needed to create a solid business continuity plan would fill a book. But because the business continuity plan forms the foundation for the disaster recovery plan, at least some discussion is warranted here. What follows is a summary of seven steps that have proven to be useful when creating and enhancing business continuity plans.
Step #1: Prepare to plan
This step mostly involves gathering pertinent information about key personnel, customers and suppliers, facilities, utilities, security provisions, records, operating procedures and processes, service and licensing agreements, applicable privacy regulations, etc.
Step #2: Establish plan objectives
The business continuity plan must support the organization’s core mission, and that requires establishing a set of objectives based on an assessment of potential disruptions. Of particular interest to IT are the recovery time and recovery point objectives (covered below), as well as the budget available before, during and after a disruption.
Step #3: Identify and prioritize potential threats and impacts
While it is not possible to foresee every way business might someday be disrupted, there are likely threats based on the organization’s locations and circumstances. Every facility could lose power, but only some might experience a tornado, hurricane or earthquake. Use probabilities to determine priorities, and estimate the potential duration of every threat.
Step #4: Develop mitigation and business continuity strategies
This is the core of the business continuity plan, and should include ways to minimize business impacts before, during and after recovering from a disruption. For IT, the mission-criticality of each application will be used to determine its priority in the DR plan. For all departments, the ability to maintain communications will be key, especially in the event the plan fails and a contingency is urgently needed.
Step #5: Identify teams and tasks
This step could be included in Step #4, but it is kept separate here to convey its importance. After all, it is people who will implement the business continuity plan and people who will take action to compensate for any of the plan’s deficiencies, such as critical tasks not included in a checklist. This step should also establish a line of succession with alternate members or teams identified should the primary ones be unavailable.
Step #6: Test the plan
The best way to uncover holes in the plan and prepare teams for implementing it is to test it—thoroughly and regularly—by simulating business disruptions caused by the threats identified. Scheduled power outages can serve as ideal opportunities to conduct these tests, but some should also occur unannounced.
Step #7: Maintain/enhance the plan
This step is ongoing and serves as the feedback loop for adjusting, updating, enhancing and otherwise maintaining the plan based on lessons learned during the tests and actual disruptions. Anything new, such as a new facility, application or service, should also go through the planning process separately or as part of this ongoing step.
Disaster recovery planning
The disaster recovery plan for IT builds on the business continuity plan with specific strategies that protect the organization’s data and ensure critical applications can continue to run with minimal or no disruption. There are two aspects of the business continuity plan that are fundamental to DR planning: the threat assessment in Step #3 and the business impact analysis in Step #4. The former identifies those disasters the organization is most likely to experience, while the latter determines which applications are mission-critical.
The plan should recognize the difference between ‘failures’ and ‘disasters’ because that difference can result in the need to implement different provisions for high availability (HA) and disaster recovery. Failures are short in duration and small in scale, affecting a server, rack, or the power or cooling in a data center / centre. Disasters have enduring impacts and are more widespread, affecting entire data centers in ways that preclude rapid localized recovery. For example, a tornado, hurricane or earthquake might knock out power and networks, and close roads, making the data center inaccessible for days.
Perhaps the biggest difference involves replication to redundant resources (systems, software and data), which can be local - on a local area network- to recover from a failure. By contrast, the redundancy required to recover from a disaster must be ‘long distance’ across a wide area network. For database applications that require high transactional throughput performance, the ability to replicate the active instance’s data synchronously across the LAN enables the standby instance to be ‘hot’ (in synch with the active instance), ready to take over immediately and automatically in the event of a failure. Such rapid response should be the goal of all HA provisions.
Because latency inherent in the WAN would adversely impact on the throughput performance in the active instance when using synchronous replication, data is normally replicated asynchronously in DR configurations. This means that updates to the standby instance always lag behind updates being made to the active instance. This makes the standby instance ‘warm’ and results in an unavoidable delay during disaster recovery. The delay in DR provisions is tolerable because disasters are rare and usually affect the users whose ability to work is also disrupted.
These differences lead to differences in the recovery time objectives (RTO) and recovery point objectives (RPO) established for HA and DR purposes. RTO is the maximum tolerable duration of an outage. Mission-critical applications have low RTOs, normally on the order of a few seconds for HA, and high-volume online transaction processing applications generally have the lowest. For DR, RTOs of many minutes or even hours are fairly common owing to the extraordinary cost of implementing provisions capable of fully recovering from a widespread disaster in mere minutes.
RPO is the maximum period during which data loss can be tolerated. If no data loss is tolerable, then the RPO is zero. Because most data has great value (Otherwise there would be no need to capture and store it.) low RPOs are common for both HA and DR purposes. For HA, synchronous data replication makes it relatively easy to satisfy a low or zero RPO.
The situation for DR is substantially different, however, with a low RPO creating the need for a potential tradeoff with RTO. Here’s why: for applications with an RPO of zero, manual processes are required to ensure that all data (e.g. from a transaction log) has been fully replicated on the standby instance before the recovery—in the form of a failover—can occur. This additional effort has the effect of increasing the recovery times.
The DR options
With a recognition that DR is different from HA, and that longer RTOs of many minutes or even hours are the norm when recovering from a disaster, system and database administrators have considerable latitude when choosing different DR provisions for different applications.
In the public cloud, most service providers have what could be called Do-It-Yourself DR guided by templates, cookbooks and other tools. DIY is viable because, unlike HA provisions, DR is relatively easy to implement by replicating data to warm standby instances in another availability zone or region. A few cloud service providers now offer managed DR-as-a-Service (DRaaS), which automatically replicates active instances (software and data) to one or more standby instances, also in another availability zone or region to enable recoveries from widespread disasters. For both DIY and DRaaS, the recovery process is manual.
While DR is different from HA, it is possible (and generally preferable) to add DR to an existing HA configuration. These multi-node, multi-site, combined HA/DR configurations have the advantage of being implemented using a single solution vs. requiring separate provisions for HA and DR, which can also be different for different applications.
There are two popular options for combining HA and DR provisions for SQL Server. One is SQL Server’s own Always On Availability Groups feature that is capable of satisfying demanding RTOs and RPOs. But this approach requires the more expensive Enterprise Edition and protects only the user databases, not the entire SQL Server instance.
The other HA/DR combo option is third-party failover clustering solutions. These are purpose-built to support virtually all applications running on Windows Server and Linux in public, private and hybrid clouds. They are implemented entirely in software and usually include real-time data replication, continuous monitoring for detecting failures at the system and application levels, and configurable policies for failover and failback.
Failover clustering solutions that integrate with Windows Server Failover Clustering enable the use of SQL Server Failover Cluster Instances (FCIs) spanning data centers or cloud regions. SQL Server FCIs are supported by both the Standard and Enterprise Editions of SQL Server for Windows. Because SQL Server 2017 for Linux lacks the equivalent of WSFC, the failover clustering solution must handle all data replication and other functionality directly.
Whatever option(s) you choose, keep in mind that the only thing harder than planning for how to recover from disasters is trying to explain why you didn’t!
David Bermingham is Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high-availability expert and has been honored to be elected a Microsoft MVP for the past 8 years: 6 years as a Cluster MVP and 2 years as a Cloud and Datacenter Management MVP. David holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare and education.