|
By Jerome M. Wendt
Synchronous replication is a technology that organizations often view as synonymous with the highest levels of application availability. In fact, a SearchStorage article entitled ‘WAN Mirroring and Replication’ written a little over a year ago even makes the assertion that organizations using synchronous replication can achieve recovery point objectives (RPOs) that remain near zero with recovery time objectives (RTOs) typically on the order of minutes. But is this assertion about synchronous replication really true and under what circumstances? And is it possible that asynchronous replication can actually deliver better RPOs and RTOs over a WAN for disaster recovery (DR) than synchronous replication?
Synchronous replication is often held out as the ‘Gold’ standard if organizations want to achieve the highest levels of availability and recoverability for their applications. It is true that synchronous replication applies writes to both the source and target sites in lockstep, ensuring that the data states at both sites are always in sync. The downside to this approach, however, is that it may adversely affect application performance at the source site.
The distance to the remote site - and therefore the laws of physics - determine how long it takes to apply the write at a remote site, with longer distances imposing higher latencies. At some point, this imposed latency will begin to adversely affect application performance at the source site. The viability of using synchronous replication for DR will depend upon the response time requirements of different application environments, but most organizations will not use synchronous replication when the distance to the remote site exceeds 50 kilometers.
Organizations generally assume as long as they remain within the distance limitations of the application, they can successfully fail over and recover an application at a remote site. This assumption may lead organizations to select synchronous replication as their preferred replication option. But is the decision to select synchronous replication for remote DR based on fact or fiction?
With synchronous replication, organizations may assume that when the data in the two sites is in sync, they can fail the application over from the primary site to the remote DR site at any time, enjoying zero data loss and near real-time application recovery. However, this assumption may or may not be true depending on how the environment is configured. To achieve this, minimally the following must hold true (this is not meant to be a comprehensive list):
* The servers at the remote DR site are "aware" of the state of the application servers at the production site and can initiate a recovery. Data being synchronously replicated to the remote site is only part of the equation. If an application recovery is to occur at the remote site, there must be some sort of event that triggers the failover of the application itself from the production site to the DR site. If the software used to replicate the data has no awareness of the state of the production application, there is no way to successfully fail over the application over to the DR site so processing occurs uninterrupted.
* The data must be in a consistent state at the remote site. Just because data is synchronously replicated from the production site to the remote DR site does not mean that the data at the DR site is in a useable condition. Any number of events can occur that make the state of the data at the remote site unusable and unrecoverable. The WAN link between the two sites can be interrupted or broken. A write transaction may not have completed at both the production and DR site when a disaster happens. The application may not have flushed its buffers on the production server so it is holding some writes in its queue.
Unless all of these tasks are perfectly orchestrated such that the copies of data are in sync, organizations must assume the data at the remote DR is not in a consistent, recoverable state even though the data is being synchronously replicated. In fact, organizations will likely find that the vast majority of the time that they cannot do a real-time failover of a production application to a remote site in part because the data at the remote site is not in a consistent state with the production site.
* There must be a mechanism to bring the application server in the remote DR site into a production state. Organizations may need to take any number of steps to recover the application at the remote site. They need to find a past checkpoint (these checkpoints tend to occur about once every 15 minutes for database applications) where the data at the remote site is in a recoverable state. Then once they find this checkpoint, they need to apply writes that have occurred since the checkpoint occurred. Organizations also need to change the network setting so the application server at the DR site assumes the ‘primary’ role, notifying other applications in the network that this application is now operating from a remote DR site.
Like I said, this list is not meant to be a comprehensive list of everything that an organization needs to consider when implementing synchronous replication. Rather it is meant to point out that organizations cannot and should not assume that merely implementing synchronous replication will provide real time application availability or failover. In fact, one IT director at a New York financial institution recently discovered that after implementing synchronous replication it still took him and his staff over 40 minutes to fail over an application server from a production site to his DR site.
Organizations need to exercise a great deal of caution when implementing synchronous replication and should not assume that synchronous replication is synonymous with instantaneous application recovery, especially when it comes to recovering applications at remote DR sites. If anything, by the time one looks at the costs and different steps that one has to go through to support synchronous replication, a solid case for using asynchronous replication in lieu of synchronous replication can be made for remote DR.
Author: Jerome M. Wendt is President and Lead Analyst at DCIG.

This article was first published by DCIG and is reproduced with permission.
SPONSOR:
DR-Scout: a solution that can deliver a faster, simpler and more economical solution for remote DR for application servers.

•Date: 29th May 2009• Region:US/World •Type: Article •Topic: IT continuity
Rate this article or make a comment - click here |