Implementing a cloud disaster recovery solution: strategies and considerations
What should you consider before using the cloud for disaster recovery? Martin Welsh and Patricia Palacio provide some guidance.
Whatever the company size or industry, the truth is that your business can't afford downtime but traditional DR strategy investments have been difficult to justify. The majority of organizations attempt to protect only mission critical applications, leaving second-tier, but still valuable, systems vulnerable to extended outages. It's hard to justify improving your disaster recovery capabilities when you're under pressure to cut IT costs and when DR is seen as an expensive insurance policy.
The major challenges faced when planning your disaster recovery strategies are:
Cost: there has been a direct correlation between speed of recovery and cost - the faster you wanted to recover, the more it would cost you. But are cloud-based DR solutions a way to achieve the recovery capabilities of advanced DR services at a more affordable price? Leveraging the cloud for disaster recovery gives small and medium-sized business the same capabilities that larger companies have had for years, as well as offering all organizations a cost-saving solution that will meet business requirements while reducing the time, money, and resources expended.
Complexity: to accomplish strict recovery times with traditional disaster recovery, the DR site has to mirror the primary site. This means the server on the DR site has exactly the same hardware configuration, operating system, drivers, etc., as the physical server you are trying to recover at the production site. But with complex and ever-changing technologies and vendors, keeping the perfect sync between the two sites is difficult: which greatly increases the chances of DR failure. Is cloud-based DR the solution to enable removal of the differences between the primary and secondary site so you can be freed from the synchronization handcuffs?
Difficult to manage: traditional disaster recovery recoveries procedures are manual and require keeping a run book that describes every recovery step. Creating and maintaining run books is labor intensive, plus executing the run book can take hours; and that’s only if everything goes well: all this is counterproductive as lost time means lost revenue. Cloud-based DR provides automated processes that help speed recovery and maintain accuracy. Some cloud-based DR vendors make the process of recovery as simple as dragging and dropping icons and having software perform the complex steps.
Difficult to scale: shared or dedicated disaster recovery sites provide little flexibility to respond to computing change demands; but cloud allows you to have environments that are ready to scale up at a moment’s notice. In a DR situation where your user base starts to use the environment more, cloud-based DR can scale up dynamically to meet this increased demand. After the event is over and usage decreases, the solution can scale back down to a minimum level of servers.
DR needs to be easier to manage, less complex and simple to scale. Many of these positive factors are why companies take advantage of the cloud and virtualization.
This article will help you to understand whether cloud disaster recovery is right for your organization. There are a number of alternative strategies for implementing DR in the cloud and we will discuss the key elements to consider.
What cloud model to choose?
Your ultimate goal is to build a cloud-based disaster recovery capability, but how to select the cloud model? What is the best cloud model for your organization? Public or private cloud? Or maybe a hybrid or community cloud approach fits best? There are some key factors that you need to consider before selecting the correct deployment model.
Technological diversity and price are not the only considerations when deciding on your cloud strategy, there are other important issues that an organization should also factor in: such as the kind of data they might be sending to the cloud, performance, security, compliance, service-level agreements, and availability.
Public cloud: the ‘pay-as-you-use’ model is a relatively inexpensive model and users can turn resources on and off as needed. This model price often looks very attractive, however, if you look at the total monthly cost, the picture changes. The monthly cost associated with cloud-based DR solutions combines a monthly subscription fee, the amount of Internet bandwidth used, the amount of storage space consumed and the number of virtual processors.
In general, from the technical point of view, we can say that public cloud service is best for organizations that have relatively homogeneous infrastructures - mostly, virtualized x86 servers, rather than a mix of UNIX and mainframe systems.
Another important consideration to factor in is to ensure that your public cloud provider can accommodate your disaster recovery requirements (data protection, security, performance, etc.): not every cloud provider is equipped to handle every situation. For a cloud DR service to provide true business continuity, it must also facilitate reconfiguring the network setup for an application after it is brought online in the backup site. Public Internet facing applications would require additional forms of network reconfiguration through either modifying DNS or updating routes to redirect traffic to the failover site. To support any of these features, cloud platforms need good coordination with network service providers.
Private cloud: private clouds can be external to a company’s data center; IT organizations with mixed platforms can reduce costs leveraging cloud models, such as shared resources delivered via a private cloud. The private cloud allows IT managers to have complete control over available assets, while adhering to the security standards required both within the cloud and in the data center.
The hybrid cloud is composed of two or more clouds (private, community, or public). Hybrid cloud disaster recovery reduces costs and provides high availability. Leveraging hybrid models, you can extend services from the data center into the cloud allowing applications to span both rather than being bound in either. These capabilities are what make the hybrid cloud models very attractive for organizations with legacy applications or where the virtualization level is still low.
Organizations that require warm stand-by replicas can leverage public cloud to cheaply maintain the state of an application using low cost resources under ordinary operating conditions and pay only for the more powerful–and expensive–resources after a disaster occurs.
In contrast, an organization using its own private resources for disaster recovery must always have servers available to meet the resource needs of the full disaster scenario, resulting in a much higher cost during normal operations.
What's best for you: a hot site, a cold site or something in between?
Disaster recovery means different things to different people. For some businesses a definition of DR might be simple data backups, while another might be referring to full standby server farms ready to take over production duties at a moment's notice. Cloud computing can be leveraged to meet any data protection requirements.
Business continuity and data protection requirements are measured in terms of RTO and RPO. The DR goal is to minimize data loss, downtime and cost.
The recovery point objective (RPO) represents the point in time, prior to unplanned disruption, to where data is restored and reflects the amount of tolerable data loss. The definition of the required RPO is a business decision in consultation with IT. In some cases, absolutely no data can be lost: a near zero RPO requires continuous synchronous replication. Other organizations can afford to implement asynchronous replication providing RPOs that could range from a few seconds to hours or even days. A common practice with cloud-based DR solutions is to replicate the entire virtual machines (VMs) to the cloud so they systems can be spun up and hosted in the cloud if the primary data center becomes unavailable.
Recovery time objectives (RTO) are related to downtime. This metric refers to the amount of time it takes to recover a service from a disaster event, i.e the acceptable system downtime.
Ensuring that your cloud-based disaster recovery solution allows you to meet your RTO requirement is critical. In general cloud-based solutions can provide a faster time to recovery at a price point that even small businesses can afford.
The capability to meet the RPO and RTO requirements depends on the type of backup and data protection strategy in place between the primary and the DR site. There are three different types of DR sites that can be implemented; of course each of these DR strategies implies different recovery tradeoffs:
Cold site disaster recovery: in a cold site, also known as a standby location, everything required to restore a service must be procured and delivered to the site before the recovery process can begin and therefore the RTO is high. The delay going from a cold site to full operation can be significant. Data is often restored from tapes or only replicated on a periodic basis, leading to an RPO of hours or days. Cold sites are the least expensive sites; they can be acceptable for applications that do not require strong protection or availability guarantees.
Warm site disaster recovery: the cloud makes cold site disaster recovery antiquated. Leveraging cloud computing, a site in a ‘warm’ state can be easily provisioned with the equivalent hardware present in your primary data center. Because of this, the systems can be brought online within minutes. RTOs are still high but smaller than the ones provided by a cold site. Data is keep at the warm site by either asynchronous or synchronous replication schemes depending on the RPO requirements. Duplicated hardware exists but active costs such as electricity and network bandwidth are lower during normal operation which provides cost-benefit advantages.
Hot site disaster recovery: with storage array replication between sites, hot site DR becomes a much more attractive option. A hot site typically provides a set of stand-by servers that mirror the primary data center. Critical servers can be spun up in minutes on a shared or private cloud host platform, providing minimal RTO and RPOs values. Synchronous replication is typically used for hot sites to prevent any data loss. This form of DR protection is the most expensive but provides minimal RTOs and RPOs.
Hot sites and warm sites can be implemented less expensively through cloud service providers than doing them in-house because of shared equipment. Ultimately it is a business decision whether to pay the extra premium for a hot site with reduced delay, or to find some other approach.
In traditional disaster recovery models (dedicated and shared) organizations are forced to make the tradeoff between cost and recovery time. In contrast, cloud-based DR solutions enable small recovery times at a fraction of the cost of traditional recovery, as illustrated in figure one.
Figure one: The tradeoff between cost and recovery time.
Application consistency and replication: how do you keep your data in sync with the cloud-based DR environment?
Cloud-based DR solutions are basically a combination of data being moved through long distances combined with the ability to quickly start up applications at the DR site after a disaster has impacted the primary site. Consequently we can see that replication and application consistency are two key factors to consider when taking DR to the cloud.
Replication: the amount and type of data that is sent to the cloud disaster recovery site (or replicated) varies depending on your RTO and RPO requirements. But, what do you need to consider before deciding what replication processes to implement? Is software-based replication a better option than hardware based replication? Should you implement real-time-replication? In general, we can say that nothing beats synchronous replication. Nonetheless synchronous replication introduces different levels of performance impact, and it’s recommended that you run performance benchmarks to determine that your environment tolerates the impact. Some vendors provide synchronization and bandwidth estimator tools to assist with the assessment of network bandwidth requirements.
Another issue is understanding latency in the cloud and the unpredictable nature of the various network connections between your primary site and your cloud provider. Consider the following:
Application consistency: synchronous replication ensures zero data lost but does not always guarantee application consistency. Application consistency is critical for mission critical applications and these may require the DR mechanism to be application specific, ensuring that all relevant states are properly replicated to the cloud DR site.
The selection of the data protection strategy is a compromise between cost and your organization’s business continuity requirements. Minimizing the recovery time and the data lost due to disaster allows your applications to rapidly come back online after a failure occurs.
For applications that require aggressive RTOs and RPOs, as well as application awareness, replication of the entire virtual machine is the option of choice. If you choose to simply replicate the VMs to the cloud to redirect users to a cloud-based VM in the event of a failure, your biggest challenge will be DNS. The VMs in the cloud will have IP addresses that are local to the cloud-based virtual network subnet. DNS record modifications are required so that the virtual server can be found when it’s running in the cloud. The major hypervisor vendors offer features for performing these tasks and redirecting the user workload, but some third-party backup vendors offer similar capabilities that can be used without admins having to configure IP address injections and DNS modifications.
One-size-fits-all? Can all applications be recovered to the cloud?
There is no doubt that the primary vehicle for cloud-based disaster recovery is virtualization. The benefits of virtualization, while not necessarily specific to cloud platforms, still provide important features for disaster recovery:
All of these benefits are directly attributable to virtualizing a server. But what if your organization has physical servers that are a poor candidate for virtualization or haven’t been virtualized yet?
It’s possible to recover physical applications in a cloud DR environment leveraging tools already present in the virtual environment normally used for P2V conversions or P2V backups. In this case, physical servers will be converted to virtual systems for use during DR events only.
Physical-to-virtual (P2V) disaster recovery can take longer in a disaster scenario than a fully virtualized environment, but it typically much faster and less costly than a dedicated physical to physical solution.
Cloud-based DR providers should support recoveries for virtualized (virtual-to-virtual: V2V), non-virtualized (physical-to-virtual:P2V) and physical mixed environments, including for multiple operating systems.
Security: isolation, privacy and regulatory concerns
Security has become the main concern when evaluating cloud-based disaster recovery solutions; after all, your data, applications, databases and other intellectual property are no longer located where you can physically protect them. Organizations often make the assumption that reliable protection of their data will be included in the service price, but some vendors add extra charges for encryption, for example. Encryption is the best way to protect your data, so ensure that your DR provider offers it.
There are numerous tools available to secure cloud servers and IT managers will have to carefully evaluate them. Evaluate your cloud vendor’s firewall protection (especially if you are using public cloud); ask for details of the intrusion detection and intrusion protection capabilities, antivirus capabilities, etc.
Private cloud keeps the cloud infrastructure on the premises, inside company firewalls, and under the direct control of the IT group. For security reasons, mission critical applications or those that hold classified data should remain in a private cloud or a shared government cloud. Less critical resources could be protected by public DR solutions. Nonetheless, private cloud isn’t impenetrable; vulnerabilities exist with any connection to the Internet.
Cloud DR solutions provide compliance headaches when used in highly regulated industries such as healthcare or finance. When you’re evaluating a public cloud-based DR vendor, inquire about specific laws governing the protection, storage, retrieval, and retention of data to ensure that your cloud-based DR solution will be compliant. The cloud service provider should allow auditing of security and compliance policies. Here, in the US, common regulations include the Health Insurance Portability and Accountability Act (HIPAA), Sarbanes-Oxley Act (SOX), the Patriot Act, and for European customers the EU Data Protection Directive (EUDPD).
Using encryption, obfuscation, virtual LANs and virtual data centers, cloud providers can deliver trusted security even from physically shared, multitenant environments, regardless of whether services are delivered in private, public or hybrid form.
After the disaster has passed...
The fail back step in the disaster recovery process is as critical as any other step in the process. While your business is running in DR mode at the remote site, your cloud-based DR solution must support bidirectional replication so that any data that was created at the DR site can be re-synchronized back to the primary site.
Documentation and testing: may be the most important consideration of all
Testing disaster recovery plans with traditional DR approaches is disruptive for those subject matter experts that must stop important tasks to be available during tests. DR planners must coordinate the work with multiple teams, analyze results manually and generate reports by hand. And all that work can quickly become obsolete.
The cloud greatly extends disaster recovery options; and can substantially reduce costs and recovery times. It does not, however, change the fundamentals of having to devise a solid disaster recovery plan, testing it periodically, and having users trained and prepared appropriately.
The cloud-based DR solution should be simple and automated, but be sure that your DR provider has an automated mechanism for testing your DR solution as well: you should run regular tests to validate your RTO and RPO requirements.
In summary, cloud solutions are here to stay. It is up to each organization to find the best way to integrate a cloud offering into either their production, or disaster recovery strategy. Many organizations are starting small, the same way they did with managed hosting services, with one of two applications in the cloud or even doing backups in the cloud.
It is important that any DR centric solution be woven into the larger IT production roadmap. This approach is more likely to be successful. For example, as an application is being evaluated for a move to a virtual platform, consider the virtual/cloud DR options. DR projects for DR’s sake tend to struggle long-term, but integrated with long term IT goals, they tend to become entwined into long-term programs.
Cloud providers can offer production in the cloud, non-production in the cloud and DR cloud services. The solution gets complicated when it comes to designing a strategy that aligns with a company’s current needs and long-term goals. The upfront analysis is worth it in the long run.
About the authors
Martin Welsh, CBCP, MBCI, leads Cognizant’s Disaster Recovery and Business Continuity Practice. As the Practice Lead, he is responsible for the development and support of disaster recovery and business continuity methodologies. He has product management, business development and hands-on experience developing IT and business recovery programs, strategies and services for Fortune 500 clients. Marty can be reached at Martin.Welsh@cognizant.com.
Patricia Palacio, Senior Manager, Cognizant Disaster Recovery Practice. Patricia is an experienced technical architect with a diverse background in disaster recovery solutions; most recently specializing in virtual platforms and data protection.