IT disaster recovery, cloud computing and information security news

The RTO reality check

In an increasingly virtualized world, recovery time objectives (RTOs) are becoming more and more difficult to measure, as conventional disaster recovery testing is almost impossible to do in a live environment on operative systems. Peter Godden looks at the issue…

One of the most important duties of any IT department is the provisioning of mission critical applications without downtime. Entire business models are built on the ability to offer services 24/7 as customers just do not accept any form of disruption to their subscribed services. Some industries even have disaster recovery point objectives (RPO) which must be met to stay legally compliant. To achieve the best possible uptime and stay compliant, businesses have often invested in stretched cluster technology to resist hardware failure of an entire data centre / center. Snapshots and regular backups on the software side are meant to shorten RPOs and RTOs to a minimum. The stretched cluster provides – in the best case – transparent failover in the case of a hardware failure or even a complete disaster affecting an entire site. The downsides of the stretched cluster are its high cost and complexity and the often-overlooked fact that it does not protect against logical errors or lost data.

The million-dollar question: what is your RTO?

With a job so important and systems bought with huge budgets, the all-important question of how safe any system is boils down to: What is your RTO? An IT manager whose job it is to guarantee uptime at all times, at all costs, should be able to provide the answer. Well, in reality it just does not work that easily. Even with the most advanced and expensive storage systems, based on stretched cluster technology, most IT admins are unable to state their system’s RTO. Of course, the exact RTO could be determined by conducting a comprehensive disaster recovery test, which should be done regularly to demonstrate that the system is working and ready for a possible disaster in the real world.

Disaster recovery testing is very simple in theory! You just pull the plug and start the stopwatch. However, no IT manager in their right mind would ever just pull the plug of a working system, even though they have paid a lot of money to be assured that they could. As systems get ever more complex, even planned disaster recovery testing can become close to impossible; especially in large organizations with a lot of data. Just planning the test can take days of preparation as many different IT departments are involved that all have to get ready at the same time. Often, the only time of year when those large corporations can actually do their disaster recovery testing is between Christmas and New Year’s Day, just so they have enough time to do it and get it fixed in case it goes wrong. But even with eight day’s time, it is not unheard of that these large corporations do not manage to conduct a full disaster recovery test, as the amount of data to be moved is just too big to do it in a way that is 100 percent safe.

What does that say about these organizations’ compliance when they are in reality unable to do disaster recovery testing effectively? Legal requirements in many countries state that an organization must be able to demonstrate that their IT security system can be controlled with a minimum of effort. In reality this is not the case for most organizations, which means that most organizations with basic stretched cluster technology are not even compliant, including large and well known organizations in the banking, medical and finance sectors.

Another problem impacting disaster recovery testing is that most applications today are running constantly and it is not possible to stop them. A transparent failover on a hardware-level just switches over from one storage destination to the other without disruption. But that does not work on applications that in the worst case are connected to several databases. A disaster recovery test will stop the application and will re-start it on another machine. A short interruption is consequently inevitable just because systems today run in a virtual world and not a physical one. Since most environments are virtualized these days, an obvious question must be raised: why do so many organizations bank their money on old hardware-based disaster recovery strategies that are unable to deliver zero RTO as the applications running on them are virtualized anyway, and don’t allow them to effectively test their disaster recovery strategy easily?

The Ransomware threat: a blessing in disguise?

The upside of the recent wave of ransomware attacks is that organizations are slowly coming to understand that their setup, based on a safe hardware-only approach does not protect them from disasters made by software or humans and that their disaster recovery strategy is not complete with just a stretched cluster. Confronted with the obvious shortcomings of their disaster recovery setup, organizations need to look beyond hardware to protect themselves from disasters and at the same time get confidence in their ability to demonstrate their systems’ ability to recover at any time.

Hypervisor-based replication

To solve the critical shortcomings of the current standard of disaster recovery strategies, organizations must understand the fundamental fact that their hardware-based strategy with snapshot technology was built for a physical data centre / center and does not match the requirements of the new virtualized world. The next logical step to guarantee disaster recovery in a virtual data centre is hypervisor-based replication that solves not only the threat of logical errors, but restores trust in an organization’s disaster recovery strategy. Disaster recovery  testing on a daily basis, without planning or disruption to services, with just a few clicks and an auditable report afterwards stating the exact RTO sounds like a dream to IT admins that are stressed about their disaster recovery testing. Hypervisor-based replication makes this dream a reality.

The author

Peter Godden is VP EMEA, Zerto. Peter is responsible for driving a growth plan across the EMEA region to contribute to the success of the company. With over 17 years’ experience within the storage industry, Peter is an accomplished executive including senior roles at Isilon, HP, Lefthand Networks, Veritas and Backbone Software. Combined with extensive sales expertise, his background as a qualified electrical engineer has helped customers craft solutions to some of the toughest data centre / center challenges.

Want news and features emailed to you?

Signup to our free newsletters and never miss a story.

A website you can trust

The entire Continuity Central website is scanned daily by Sucuri to ensure that no malware exists within the site. This means that you can browse with complete confidence.

Business continuity?

Business continuity can be defined as 'the processes, procedures, decisions and activities to ensure that an organization can continue to function through an operational interruption'. Read more about the basics of business continuity here.

Get the latest news and information sent to you by email

Continuity Central provides a number of free newsletters which are distributed by email. To subscribe click here.