Is 99.999 percent availability ever a practical or financially viable possibility? Andrew Hiles explores the question.
Five nines (99.999 percent) availability: why chase it? “Because it’s there”? Is it actually achievable? Or for a sound business reason? What’s the payback? Is it some goal we strive for, like the ultimate truth or perfect beauty, that we know we are unlikely ever to attain? Or should we really be striving for six nines (99.9999 percent)?
Let us examine the math of it, first. A definition of availability may help: “The percentage uptime achieved per year.” Given this definition, the maximum downtime permitted per year may be calculated as reflected in Table 1 below. Please do not debate with me leap years, lost seconds or even changes to Gregorian Calendar. Equally let us not debate time travel! To quote a Cypriot saying: “I am from a village: I know nothing.” The figures below are sufficiently accurate to make the points this article is trying to get across.
Table 1: Uptime and Maximum Downtime
Uptime |
Uptime |
Maximum Downtime per Year |
Six nines |
99.9999% |
31.5 seconds |
Five nines |
99.999% |
5 minutes 35 seconds |
Four nines |
99.99% |
52 minutes 33 seconds |
Three nines |
99.9% |
8 hours 46 minutes |
Two nines |
99.0% |
87 hours 36 minutes |
One nine |
90.0% |
36 days 12 hours |
If you really want to throw a spanner in the works, change the definition of availability to: “The percentage of scheduled uptime per year.” But let’s not go there. We are talking absolutes.
Figure 1 summarizes components (i.e. dependencies) just for an ICT facility, excluding ICT equipment, systems and software.
Figure 1: Calculating Availability: Facility (Source: Uptime Institute)
Specification Item |
Specification |
Number of delivery paths |
3 active |
Redundant components |
2 (N+1) or S+S |
Support space to raised floor ratio |
100% |
Initial watts/ft 2 |
50-80 |
Ultimate watts/ ft 2 |
150+ |
Raised floor height |
30-36” |
Floor loading pounds/ ft 2 |
150+ |
Utility voltage |
12-15kv |
Construction $/ ft 2 raised floor |
$1100+ |
Annual IT downtime due to site |
0.4 hours |
Site availability |
99.995% |
All this implies alternate power sources: for example, mains power from separate sub-stations; dual UPS; back-up generators with automatic cut-in and capacity to cover equipment and, where necessary, end-user environments, elevators etc; adequate fuel supplies.
We also need permanent - 24x7x365 – on-site support with appropriate skills, tools etc. This falls short of the fabled five nines, but it is as high as the Uptime Institute’s Tier classification goes – and only 10 percent of organisations achieve this.
Well, let us assume we can engineer the facility (including the personnel involved in running and supporting it) to deliver a 99.999 percent availability.
The next question is: “How do we measure the availability within our service?” Now we need to include ICT equipment, operating systems, diagnostic, performance measurement and management software, middleware, applications and anything else used in the delivery of the service. Then we need to calculate the availability of each of these components on which the service depends.
It is easy to assume that replicating components halves the downtime: but, in introducing more components, we are also introducing greater complexity and more possible points of failure. A relatively simple service is illustrated at Figure 2 below. Components have been replicated: however, they also need to be kept in synch and switchable between the two parallel configurations at any point of failure, so switches are introduced. If one configuration is active and the other passive, the switches detect component failure in the primary (active) configuration and switch the load to the secondary (previously passive but now active) matched component in the second configuration, which assumes the role and identity of the primary component. The result is improved resilience – but also more complexity, more components and more than double the cost.
Figure 2:
Where there is only a single configuration, and if each component has a 99.999 percent availability, the theoretical availability of the overall service is calculated by multiplying 99.999 percent by 99.999 percent and multiplying the result by the availability (99.99 percent) for every component in the configuration. If we arbitrarily say there are ten physical components in the configuration, if my trembling finger has hit the right keys the appropriate number of times, the theoretical availability works out at “only” 99.988 percent. Even if we replicate all these components to the extent that we increase the configuration availability overall to 99.999 percent and if we manage to get the physical infrastructure (see Figure 1) to deliver 99.999 percent our overall theoretical capability will still “only” be 99.997 percent.
So far this figure represents purely infrastructure and hardware failure: we should also include the possibility of loss of operating systems, middleware, application software, databases and data. Even including these elements, the two systems need to be geographically separated since, if they are in the same data center, a fire, bomb, geophysical, meteorological or common infrastructure incident or facility failure could impact both of them. And on top of this is the security level: outage could equally well come from security breach as equipment or software failure.
Just one other point: our availability figures typically derive from manufacturer’s (or maintenance companies’) statistics on Mean Time Between Failure (MTBF). MTBF is exactly that: it conceals variations in actual performance and comes up with a ‘normalized’ performance. That is, every user of a piece of equipment with a 99.999 percent MTBF does not get 99.999 percent MTBF: some people get better performance, some worse. Microsoft claimed that its top-end Windows 2000 servers were ‘designed to deliver 99.999 percent server uptime.’ This is not the same as delivering it! An Aberdeen study of Windows 2000 customers running production systems reported that, on average, customers were achieving only about 99.964 percent uptime - about 3.2 hours of downtime per year. (1)
Another consideration is fix time, when the equipment does fail. In the case of a single component, a 99.999 percent availability implies a time from fail to fix of less than five seconds! OK, you have a redundant component. Have you ever had two flat tires on a single journey? I have! You could call it Murphy’s First Law of Availability.
So is 99.999 percent achievable? Yes. But over how long a period? A year? Two years? Five years? Seven years? Basically, unless we calculate the numbers, we are effectively walking in to a casino and betting against the bank. Sometimes we win. But over time, sooner or later, the bank always wins. Murphy’s Second Law of Availability.
But, if you are lucky, the equipment will become obsolete before it fails. Over time, technology upgrades are necessary. Upgrades mean change. Change means danger. How good is your testing and your change and configuration management? How good are your checks and balances, quality processes and management of people? A 2002 survey said that 31 percent of network downtime was due to human error.(2) Chernobyl was caused by operator error.
Your equipment vendors may claim five nines or better – but can your service suppliers deliver five nines to support you? Even if they offer it, we know of no service vendor that will accept consequential loss for failure, so the vendor’s failure is still your risk. There is almost always a weak link in the chain. Murphy’s Third law of availability.
Another issue is: When does availability happen? Even in the most demanding, time sensitive environments, there are business cycles: some times of the day or days of the year may be more critical than others. The day the markets go mad…. The end of the year…. The day the new, mission-critical web or call center service goes live…. The day the multi-million dollar advertising campaign hits the media. The billion dollar transaction or deal….. When availability is most required, that is when it is most likely to fail. Murphy’s Fourth Law of Availability.
So to our next questions:
* Why do we want it?
* What’s the payback?
Undoubtedly there are situations where there is a financial case for five nines – or as close to perfection as is possible. We have clients who have successfully operated at that level (so far for over two years) – typically in real-time financial trading situations.
An interesting case study found that, while some telecommunications vendors offered five nines (3), three nines was adequate for retailing. The study (4) evaluated costs and business benefits of 99.999 percent of scheduled availability for point-of-sale tills, as opposed to 99.9 percent, to retail operations. It identified a potential benefit of only $297 of increased revenue and $204 of reduced expense – total $501 - per store.
So, what does downtime cost? Partly it depends on what other channels customers have to access your services. If you have a branch network, a call center and a web service maybe service outage is less important than if you only have a call center or just have a web site. We hear banks talk of five nines for Automated Teller Machines: Yet they are taken down each night to replenish them with money. As long as there is another nearby, this does not cause a problem. Downtime impact also depends on your customer’s loyalty and on the effectiveness of your competition. And, as we have seen earlier, it also depends when the outage happens.
Some industry statistics may help to put a context to potential losses from downtime. The numbers differ, depending on the source, but they give some idea of possible impact.
Figure 3: Downtime losses
Industry Sector |
$000's Revenue / hour |
Energy |
2,818 |
Telecom |
2,066 |
Manufacturing |
1,611 |
Finance |
1,495 |
IT |
1,344 |
Insurance |
1,202 |
Retail |
1,107 |
Source: Meta Group
Other surveys fill in similar estimates for other industries and provide a cross-check. A 2004 survey (5), for instance, put losses on Brokerage Operations at $4,500,000/hour; Banking at $2,100,000/hour; Media at $1,150,000/hour and e-commerce at $113,000/hour. Retail trailed at $90,000/hour. There is also a possible hit on share value (e-Bay’s outages in 1999 saw shares drop by over 26 percent, while e*Trade’s similar problems saw a 22 percent drop) - but this may be only temporary.
If your industry is on the list, it is worthwhile trying to work out the relative impact on your business of downtime of, say 99.5 percent; 99.9 percent’ 99.95 percent; 99.99 percent and 99.999 percent - and comparing it with the cost of implementing availability at each level. There is probably no additional cost below 99.95 percent. Does higher availability pay back? Some of the generic costs and losses that can be incurred as a result of service downtime are illustrated at Figure 4 below.
Figure 4: Potential Causes of Loss Due to Downtime |
Impact on stock price |
Cost of fixing / replacing equipment |
Cost of fixing / replacing software |
Salaries paid to staff unable to undertake productive work |
Salaries paid to staff to recover work backlog and maintain deadlines |
Cost of re-creation and recovery of lost data |
Loss of customers (lifetime value of each) and market share |
Loss of product |
Product recall costs |
Loss of cash flow from debtors |
Interest value on deferred billings |
Penalty clauses invoked for late delivery and failure to meet Service Levels |
Loss of profits |
Additional cost of credit through reduced credit rating |
Fines and penalties for non-compliance |
Liability claims |
Additional cost of advertising, PR and marketing to reassure customers and prospects to retain market share |
Additional cost of working; administrative costs; travel and subsistence etc. |
Uptime is important. In high value, high transaction volume operations it can cost big bucks (but so can a wrong business call). In military systems, often the decision-time will exceed five minutes. In life-safety related activities five nines may be crucial – but even in Air Traffic Control, we have seen systems down for hours without a disaster. Should we chase six nines? Do the math. But probably not. There are more dangerous threats that we can address cheaper. A slow and poorly designed web site or an insensitively thought-out Interactive Voice Response system may have far more – and longer lasting - effects on customer loyalty and market share than ten minutes of downtime a year. It was not downtime that has pushed a number of airlines, Enron or WorldCom into Chapter 11. Prolonged outage has caused bankruptcy, certainly. But I still await a case of a five-minute outage causing total corporate collapse.
Andrew Hiles is president of Kingswell International, a consulting company specializing in business risk management and service management.
ahiles@kingswell.net
www.kingswell.net
REFERENCES
1 Data Center Construction Costs, by Larry Smith, President, ABR Consulting Group, Inc. for the Uptime Institute. The Uptime Institute® http://upsite.com/TUIpages/whitepapers/tuitiers.html has developed a classification approach to site infrastructure that includes measured availability figures ranging from 99.67% to more than 99.99%.
2 This excludes land and abnormal civil costs. Assumes minimum of 15,000 ft2 of raised floor, architecturally plain one story building fitted out for initial capacity but with backbone designed for ultimate capacity with installation of additional components. Make adjustments for high cost areas.
3 By Graeme Bennett (Posted July 11, 2001; updated July 9, 2002)"five-nines uptime = 99.999% reliability, or one hour out of service every 11.4 years." - dansdata.com
4Yankee Group 2002 Network Downtime Survey.
5 Business Week, 2003
6 PDMA Toolbook II, Establishing Quantitative Economic Value for Features and Functionality of New Products and New Services, by Kevin Otto (Product Genesis Inc), Victor Tang and Warren Seering (Massachusetts Institute of Technology) Seering
7 Yankee Group, 2004

•Date: 15th Nov 2005 •Region: UK/World •Type:
Article •Topic: IT continuity
Rate
this article or make a comment - click
here |