|
Don’t ignore the issue of planned downtime when making IT continuity plans, says Alan Arnold.
Computers have become so reliable that there is an increasing temptation for CIOs to use this as an excuse not to invest in solutions that ensure high availability. Their usual argument is: “My system never breaks. It is ultra-reliable, so why should I consider committing the additional cost of implementing a high availability solution?”
CIOs often regard costs associated with increasing system availability as a business overhead, rather than an investment required to facilitate a strategic business tactic. However, confusing the terms ‘reliability’ and ‘availability’ is hazardous to an organization’s health.
According to Dr. Frank Soltis, IBM’s iSeries Chief Scientist, “Computers are now so reliable and manage themselves so well that many businesses believe that they do not need to consider the advantages of high availability because their systems will not fail. That is a view based on reliability not availability, which makes it a view that is potentially dangerous for the business.”
“Just because it’s reliable does not mean it’s available. What happens while your system is off-line running a back up? How do you cope with batch processing or application upgrades? If your system is off-line, still being reliable but unavailable to your users, then you have a problem,” says Dr. Soltis.
As Dr. Soltis makes clear, there are a number of potential sources of downtime.
Hardware failure is among the rarest and, therefore, least significant of them. Downtime can also result from a wide variety of other unpredictable conditions, such as environmental factors, human error, malfeasance, power interruptions and application failure. That list, which includes only unplanned events, still grossly understates the downtime problem.
More than 80 percent of downtime results not from unplanned incidents, but rather from planned, unavoidable activities such as data backups, database reorganizations and hardware and software upgrades. The fact that they are planned may reduce the impact these events have on business operations, but it does not eliminate their cost.
Whatever the reason, if the system is not available, users cannot access the data and applications they need to do their jobs. Consequently, the business starts losing money with the very first minute of downtime; regardless of whether it was planned or unplanned.
In the past, organizations took advantage of “maintenance and batch windows,” which were hours when the business was closed and allowed IT to shut systems down with minimal impact on the business. In today’s global e-business environment, critical systems must be available 24 hours a day, seven days a week. Planned or unplanned downtime can hurt a business because information is not available, decisions are not made, orders are not shipped, funds are not transferred and customers cannot interact with the organization—in short, business stops.
Reliability is not availability
Reliability simply measures the mean time between failures for the subject piece of hardware. Availability measures the accessibility of data and applications, regardless of the state of the hardware. Whether or not they consciously recognize it, availability, not reliability, is the metric that matters to organizations. When systems are down, customers rarely know, and almost never care, what the reason is. They just know that you are no longer able to serve them. It doesn’t take a hardware failure to drive customers away in a competitive business environment. All it takes is not being there when customers want you to be.
Planned and unplanned downtime
There are a number of possible reasons why an IT organization might schedule downtime for a particular server or application, but they all fit into one of four generic categories:
- Backup and recovery of the physical environment (approximately 10 percent of planned downtime)
- Hardware, network, operating system and other systems software maintenance and upgrades (10 percent)
- Batch processing (10 percent)
- Application and database maintenance (50 percent)
In all, planned downtime comprises about 80 percent of all downtime.
Alternatively, there are an almost unlimited number of sources of unplanned downtime. Fortunately they are all rare and only about 20 percent of all downtime is unplanned. Causes underlying this category include:
- Environmental factors (approximately 20 percent of unplanned downtime)
- Operator and user errors (40 percent)
- Application errors (40 percent)
Despite unplanned downtime accounting for a much smaller portion of the total downtime, it captures more headlines because it includes natural disasters such as hurricanes, fires, floods and earthquakes, along with human malfeasance such as terrorism, actions of disgruntled employees and attacks by hackers.
While unplanned downtime is rare relative to planned downtime, the former tends to be the much more expensive of the two when calculated on an hourly basis. Planned downtime can be scheduled for periods when the business is shut down or, in the case of 24/7 operations, during slow periods when system unavailability will have a lesser impact on the business. In contrast, unplanned downtime, by definition, cannot be scheduled. It’s just as likely to hit when the business can least afford it as during the slow periods.
The cost of downtime
Facilitating increased availability is not free, so the question becomes how do you cost-justify it? You justify it by looking at the hourly cost of downtime and multiplying that by the average hours of avoidable downtime historically incurred each year. Comparing that to the contemplated investment in the downtime avoidance technologies and processes allows you to calculate the payback period and potential return on investment. Organizations that undertake this analysis are typically surprised by how large the return can be because, before performing the calculations, they typically grossly underestimate the amount of downtime they incur and its hourly cost.
Dunn & Bradstreet reports that 59 percent of Fortune 500 companies experience at least 1.6 hours of downtime per week. This includes downtime resulting from software failures, required system reboots and normal maintenance. This translates into at least 83.2 hours of downtime in a 52-week year.
How much does each of those hours cost an organization? That obviously depends on the organization, but the answer is usually much larger than what it first imagines. Downtime costs accumulate as a result of a number of factors:
- Customers who can’t complete a purchase due to an unavailable system may go to a competitor rather than wait for the system to return to service, resulting in an immediate loss of revenue.
- Currently loyal customers, and prospects who would have gone on to become loyal customers, may be lost to the competition, resulting in the loss of what would otherwise be a future revenue stream.
- Damage to the company’s reputation can lead to a loss of sales to customers who weren’t directly affected by the downtime event.
- Deferred cash flows may lead to increased financing costs or decreased interest income.
- Employees who cannot carry on their jobs still have to be paid, creating a large drain on productivity.
- In order to recover from an outage and catch up on the backlog of work, employees may have to be paid overtime, part-time employees may have to be brought in, and extra equipment may have to be rented.
- Extra costs may be incurred to take advantage of expedited shipping so as to minimize customer dissatisfaction.
- Remedial marketing programs, possibly including sales discounts, may be required to win back disgruntled customers.
- If the outage results in the missing of any contract deadlines or regulated reporting deadlines, the organization may incur significant penalties and fees. And increasingly, more than just dollar costs, it’s the law. The need to comply with increasingly demanding regulations means that continuous data availability is mandatory across more and more industries such as financial, pharmaceutical, food, manufacturing and healthcare.
Broadening the perspective
Until recently, many organizations’ idea of maintaining high availability was to have a good disaster recovery plan in place. Like tactics to avoid downtime from hardware failures, disaster recovery is only one very small component of a complete managed availability methodology because, unlike other causes of downtime, disasters are exceptionally rare. However, even in the very narrow realm of disaster recovery, traditional approaches are inadequate.
The way that disaster preparedness was typically done in the past—and is still done in many organizations today—was to take nightly tape backups of all data, ship the tapes offsite and make arrangements to recover that data at another company-owned site or at a third-party site, should the need arise.
There are a few serious problems with this approach. First, tape is slow. Retrieving the necessary tapes, getting the resources ready at the recovery site, loading from tape what, at most modern organizations today, are exceptionally large databases and applications, and configuring the systems to begin running the organization’s operations can take days. This is a problem because, historically, few companies have been able to survive for very long after such lengthy outages.
Another problem with recovering from tape backups is caused by the timing of backup operations. Tape backups are taken at a point in time, typically at night. The contents represent the organization’s data as at that point. During the day, normal business operations can add, delete and modify considerable data. These updates may be backed up locally to a journal, but, being local, that journal will likely be destroyed by any disaster that is significant enough to destroy the primary database. If the data center is destroyed just before a nightly backup tape is about to be made, all of the data updates applied that day will be lost. This lost data is referred to as orphan data.
Tape backups are, therefore, inadequate for most organization’s disaster recovery needs, but that’s only a problem on the rare occasions when a disaster occurs.
In addition, traditional backup approaches create an ongoing problem in that the very act of taking a backup typically creates downtime. Some software cannot run at all when a backup job is running. In other cases, databases offer a ‘backup while active’ option that allows most applications to run during the backup operation, but this is often not adequate for 24/7 operations. When a backup job runs, it tries to pull all data off a disk drive as fast as it can. This typically occupies much of a computer’s resources, including the channels that data passes through from the disk drives. Business applications, even though they can technically still be online, may be so severely impeded by the backup job’s monopolization of resources as to make them virtually unusable.
The solution to all downtime problems—planned and unplanned—is to adopt a comprehensive high availability solution. Such a solution consists of more than hardware and software; it is a way of doing business that ensures the availability of your systems. That way you can be certain that your customers can interact with you where they want, how they want, and when the want–which is usually now.
Disaster recovery, although very topical due to recent events, is an outdated approach.
It sets a business off on wrong thinking. As suggested above, one of the problems with traditional disaster recovery thinking is that it addresses, although only partially, a problem that is almost never a problem, while completely ignoring issues that can frequently grind a business to a halt. The much bigger problems are non-disastrous unplanned downtime and, in particular, planned downtime.
The best tactics for addressing this broader downtime perspective offer the added benefit of simultaneously maintaining availability in all circumstances—planned and unplanned alike. To use an admitted oversimplification, it all comes down to redundancy.
Points of failure
Redundancy is not a new concept. RAID and disk mirroring technologies have been around for some time. RAID maintains sufficient redundancy to recreate data should a disk fail. One form of RAID is disk mirroring which creates a completely redundant copy of all data on a disk. One of the problems with these technologies is that they can survive only single points of failure. If a disaster wipes out an entire data center, mirroring will not save the data because they usually require that the redundant disks be in close proximity. In addition, RAID only addresses data. It does not help in the least if a server is unavailable.
To safeguard against all downtime, it’s necessary to create fully redundant systems— including hardware, software, data and system settings. Fortunately, there are high availability solutions on the market that will allow organizations to do that transparently, with little or no operator intervention beyond the initial installation. How to choose the solution that is most appropriate for your organization is the subject of the remainder of this article.
Recovery Time and Recovery Point Objectives
What and how you create and maintain redundancy and facilitate the switchover to redundant resources when necessary should be determined by your organization’s recovery needs. Those needs can be best met by establishing appropriate recovery point and recovery time objectives and putting in place the technologies, policies and procedures necessary to achieve those objectives.
A recovery point objective (RPO) defines how much data you are willing to lose in the event of a disaster or other data failure. Achieving an RPO of zero lost data requires the total elimination of the orphan data described above.
What should your organization’s RPO be? Because the cost of ensuring that an RPO can be met increases as you increase the stringency of your RPO, the intuitive answer, “zero lost data,” is not always the correct one. For example, if a financial institution irretrievably loses a single piece of data, that may represent a multi-thousand-dollar, or even multi-million-dollar debit or credit from or to a customer’s account. Therefore, the bank is likely to adopt an RPO of zero lost data for its banking systems. In contrast, while this situation is increasingly rare, a manufacturer that receives orders in the mail and that produces paper-based shipping documents is likely to have a very lax RPO because it can easily recreate the lost data and it can physically count inventory to confirm that it has properly accounted for all shipments.
The manufacturing example illustrates an important point when it comes to setting an RPO. An organization that has paper backup for all transactions may decide that it is not worth investing heavily to support a low RPO because data losses are rare and the organization feels that it would make more sense to spend the money necessary to recreate its data on those infrequent occasions when the electronic version is lost.
However, many organizations no longer have this option. Today’s e-business environment eliminates paper, and when there is no paper backup, RPO becomes critical because a loss of the electronic versions of data is irreversible.
The most appropriate RPO differs not only among organizations, but also among systems within a single enterprise. For instance, that financial institution mentioned previously may adopt an RPO of zero lost data for its online banking systems, but it may be willing to live with a considerably less strict RPO for any administrative systems where data still originates on paper forms. Thus you need to consider the appropriate APO for each part of the organization because they are not all equal.
Recovery time objectives (RTO) refer to how much downtime an organization is willing to tolerate. Organizations that use traditional tape-based disaster recovery techniques have, whether intentionally or not, adopted an RTO that is usually measured in days or, at best hours, but never minutes. Enabling an RTO of minutes or even seconds is possible, but it involves an investment in technologies, processes and procedures that go far beyond the old approaches.
Like RPO, since it is more costly to be in a position to achieve a short RTO than a long one, the most appropriate RTO differs among organizations and even among systems within an organization. An organization that does not depend heavily on its computer systems for its continuing operations (something that is rare today), or one where operations do not generate a high hourly value, may be satisfied with a long RTO. On the other hand, a large online brokerage that, as stated above, incurs millions of dollars of downtime costs per hour will likely set its RTO very close to zero.
Availability: a continuum, not a point
As suggested by the discussion above, availability is not a yes or no decision. Instead, there a continuum of options for achieving the most appropriate RTO and RPO in the context of an organization’s economic environment.
The numbers are dwindling, but there are still some organizations for which traditional backup and recovery processes are appropriate, even considering the exceptionally lax RTO and RPO they imply. For these organizations, while waiting days to recover operations that may be as much as a day out of date is undesirable, the alternatives may not be cost-justifiable due to the low economic value of their operations. In contrast, some companies, such as large financial institutions and online retailers, can justify large investments in the technologies and methodologies that will allow them to achieve RTOs and RPOs that are extremely close to zero. Between these two extremes is a wide continuum of availability solutions that can best meet the unique needs of each organization.
Improving the RPO requires copying updates (additions, deletions and changes) to the primary system and moving them offsite more frequently than the traditional nightly backups. The more frequently this is done, the less orphan data there will be. Reducing orphan data to close to zero requires that you replicate data updates offsite in near-real time. To eliminate orphan data completely, the replication process must occur within each data update transaction on the primary system. Under this scenario, the transaction on the primary system is not considered complete until the data updates are also copied to the backup system. Advanced high availability solutions make this possible, without the need to modify applications, by taking advantage of a technology called “remote journaling,” which is beyond the scope of this article.
Improving RTO depends on two factors: the state of the systems in the backup location and the organization’s ability to switch users to them.
Using the traditional approach, tapes are sent offsite nightly, but are not loaded on the backup system. The vault containing these tapes may or may not be at the system recovery site. In addition, that recovery site may be owned by a third-party that offers recovery services to a number of customers, depending on only a small number of its customers needing to use the backup servers at a time.
Under this traditional scenario, recover times can be exceptionally long. Depending on the size of an organization’s databases and applications, loading the backup servers when the primary site becomes unavailable for an extended period can take many hours or even more than a day and, if the backup tapes are not at the recovery site, that process can only begin after a courier has delivered the tapes. Then the servers have to be configured and users must be given access to them. Because of the lengthy recovery time under this scenario, recovery won’t be attempted unless an outage is expected to be quite lengthy. Instead, the organization will live with the downtime.
The way to dramatically reduce the time required for this element of the recovery process is to eliminate tape. Instead, replicate data, application and system configuration changes from the primary data center to disks attached to an active server at the backup site in real- or near real-time over a network. Then, should disaster strike the primary data center, the backup site will contain a completely up-to-date, ready-to-run replica of the primary server.
The other factor that affects recovery time is the facility used to switch users to the backup server. Modern high availability solutions can reduce this time to close to zero. Many solutions can also monitor the health of the primary server and automatically switch users to the backup whenever the primary server becomes unavailable, thereby eliminating the time required for an operator to detect the problem and initiate a switch. In some cases, users may not even be aware that a problem occurred or realize that they are running off the backup server.
The real reward: eliminating planned downtime
As stated above, disasters and hardware failure are rare. You still must prepare for them, but cost-justifying the technologies and other resources needed to do so can be difficult because of the events’ unpredictability and infrequency. Fortunately, the more thorough your disaster preparedness, the less need there is to cost-justify it.
That sounds like poor business sense, but it’s not because a comprehensive solution to avoiding unplanned downtime, such as from a disaster or a hardware failure, has the added benefit of also, at no added cost, preventing planned downtime, which is much more predictable and frequent. And a solution that avoids both planned and unplanned downtime will typically generate a substantial return on investment based on planned downtime avoidance alone.
Comprehensive solutions can eliminate both unplanned and planned downtime because they maintain live, fully redundant servers at a remote location and offer rapid switching capabilities. Thus, users can be switched to the backup whenever maintenance is required on the primary server. This approach can also remove some maintenance and other operations from the primary server. For example, because the backup server contains an up-to-date replica of all data, nightly tape backups can be taken from that machine, thereby eliminated the impact on business operations. In addition, it might be possible to move some “read-only” data operations, such as querying and reporting, from the primary server to the backup, thus reducing the load on the primary server and potentially deferring the need for a costly upgrade.
What’s most important? Testing and practice
All it takes to frustrate your availability efforts is one thing. Unfortunately, there is no way to know what that one thing will be until it happens. The first time you attempt to use a backup system, it might take hours to find a missing configuration setting that thwarts operations on that system. Furthermore, the best time to find out that your data replicator failed to replicate most of your business data is not when the primary system fails during the busiest time of day. And the most complete and accurate backup system in the world is no benefit if users are unable to connect to it when the need arises.
Murphy’s Law —if something can go wrong, it will— applies to availability technologies and processes as much as anything else. The only way to ensure that your backup systems are accurate, usable and accessible is to test them and practice switching over to them frequently. Only then can you be sure that you will be able to translate the theory of downtime avoidance into practice in the real-world. Because, after all, the issue is not reliability or availability—it’s downtime.
About the author: Alan Arnold is president and COO of Vision Solutions. Recognized as an expert in the field of managed availability technology, he has authored and co-authored five books on technology and business topics that have been published worldwide. He has written numerous articles for some of the leading publications in the industry, and has had his work presented to the United States Senate Technology subcommittee on the topic of Y2K. Previously Arnold was a senior technology executive in the management consulting practice at Cap Gemini Ernst & Young US LLC. He also served as the firm’s subject matter expert for IBM technology and e-commerce solutions, and was one of the founders and managers of the Cap Gemini Ernst & Young International Advanced Development Center (ADC).
http://www.visionsolutions.com/

•Date: 16th June 2006 • Region: N.America/World • Type: Article •Topic: IT continuity
Rate this article or make a comment - click here |