Calculating acceptable downtime

Get free weekly news by e-mailBy John Robinson.

The ideas surrounding acceptable downtime have always generated lively discussion amongst business continuity practitioners.

In one form or another it forms a central pillar of most business continuity management systems, looking to customers and other stakeholders to justify the timing and prioritisation of BCM activities and expenditure. Most of us would accept that getting it wrong can have risk and cost implications, potentially misinforming or misleading those who rely on the information.

A simple example illustrates this. We estimate tolerance to loss of a customer-facing service at 48 hours. This deadline is accepted by the business and everyone plans for it. After investing in a warm standby, the service is test-restored successfully in 36 hours and everyone is happy. Six months later we realise we overlooked a client contract that demands a tested 12 hour continuity response. We are now faced with either re-visiting our plans and spending considerably more or accepting the contractual risk. Getting it right is important.

BS 25999 defines the maximum tolerable period of disruption (MTPD) as “the duration after which an organization’s viability will be irreparably damaged if delivery of a particular product or service cannot be resumed”. It advises us to “…assess over time the impacts… if the activity is disrupted” and “…establish the MTPD of each activity”. It instructs us to identify the latest time by which an activity must be resumed, establish the minimum level to which resumption must be achieved, and set the time within which normal activity levels must be restored. It says we should “…identify any inter-dependent activities, assets, supporting infrastructure or resources that also have to be maintained”

This is sound advice, but followed blindly seems to risk error through simplification. The remainder of this article explains why.

Imagine for a moment, you are the business continuity manager of an enviably straightforward organization with just two departments and two customers. Each department delivers its service to both customers and our mission is to establish an MTPD for service A. Armed with the BS 25999 definition, it should be easy to phrase the questions we need to ask the department heads.

MPTD diagram

We interview the manager of department A and ask “…in a disaster, after how long will the organization’s viability be irreparably damaged if delivery of your service cannot be resumed?” Manager A hesitates and admits he has a problem; he asks about service B’s impact status which he suspects may already be close to tolerance under the circumstances described. He explains that if this is the case, even an hour’s delay in restoring his service could take the organization over the edge, whereas instinct tells him his MTPD should be considerably longer. He can’t answer the question in isolation and needs more information.

We try prefixing the original statement with “assuming B’s MTPD to be x…”? He points out this can’t work either since whatever we assume could be similarly incorrect. We propose taking the shortest duration MTPD for each service-customer dependency, assuming all others on that service will be covered. Again invalid, since although individually tolerable, these impacts may be material and additive and again, ignores the fact that other service impacts may have taken the organisation close to the edge. All this suggests we may be inclined to simplify and over-estimate acceptable downtime.

Circular argument
So is it a problem and won’t an expert or educated guess be enough? Take a slightly larger organization that can afford to lose £1M without lasting impact. Department A uses the standard definition and concludes that with current recovery capability it will exceed tolerance after two days. We find that within the same period department B estimates losses of £600k, department C £800k and so on. Each is individually acceptable but cumulatively tolerance is far exceeded. This basic reasoning also applies for other less tangible impacts such as reputation and for internal service levels.

Given this, it’s evident that we should know the impact contribution for all affected services before we set any service MTPD. However, we can’t do this without knowing their recovery proposition which in turn is part-defined by their MTPD. This circular condition reflects the fact that service MTPDs are inter-dependent rather than isolated, each needing to be set so it contributes acceptably to an overall impact total. It implies we can’t collect valid MTPD data in isolation and need some form of bid round, discussion or analysis to set reliable values.

The reasoning goes a step further to imply (and commonsense suggests) that even for a single worst case disruption, there may be a number of equivalent ‘optimum’ recovery patterns and not just the one we have chosen to write down, offering the possibility of reduced recovery cost if we can somehow pick the right one.

Missing link
BS 25999 says that “When assessing impacts, the organization should consider those that relate to its business aims and those of its stakeholders”. It lists seven impact types and emphasises the importance of documenting all that affect us - human, assets, regulation, reputation, financial, quality, environment. These have collective significance in the way we assess tolerance and set recovery time objectives. Again, the standard leads us to the edge but stops short of identifying an acceptable compliant approach. We need access to these missing links and as a precursor it makes sense to understand or model the value chain so we can answer at least the following questions:

* How will each pattern of service loss affect each stakeholder group?
* How will stakeholder groups react?
* How will the impacts arising from all affected stakeholders combine and accumulate?
* What different mixes and levels of impact will be tolerable for us?
* At what point(s) will we exceed this level?

These are far-reaching and imply that for even a very straightforward organization, we ought to understand our stakeholders’ tolerance, dependency and reaction characteristics before we begin to set our own recovery deadlines.

We should be able to show that the impact stakeholders inflict won’t leave lasting unacceptable damage when all our MTPDs are met. So, bearing in mind the scale of potential error illustrated in the simple example above, do we allow budget to dominate, or do we bite the bullet and present a more detailed case?

My own view is that the forces exerted by acceptable downtime, current exposure and available budget will always settle at a point of equilibrium. However, the former two components are intangible, so the quality of available information plays a vital role, preventing budget from dominating risk decisions. We should be able to provide a detailed and accurate picture of acceptable downtime so the business knows the risk it is taking when imposing budgetary constraints.

The interaction of budget and tolerance is reflected in the close relationship between MTPDs and recovery time objectives (RTOs). Whilst an RTO should not exceed a dependent MTPD for obvious reasons, it can drive multiple MTPDs out and take cumulative impact past acceptable tolerance for the reasons described in this article. For example, if a third party workarea contract is to be cancelled with a doubling of achievable recovery time, the entire web of RTOs and affected MTPDs should be reviewed. This can provide a persuasive argument.

Other practical checkpoints arising from this discussion are:

* If you use MTPDs, check they work together and that aggregate impact is acceptable;
* Invest time to understand the entire value chain, not just the supply side;
* If you don’t already have a practical statement of organizational impact tolerance, get one!

Balancing MTPDs against acceptable loss can be made easier using an iterative or experimental approach to test outcomes. INONI developed RES, a programmable web tool used to simulate outages and resolve the situation described in this paper. Running the tool generates quantified impact profiles that reflect organisational variables and dependencies. It means we can identify sets of MTPDs and RTOs that deliver acceptable combined impacts for a set of specified scenarios.

Author: John Robinson MSc FBCI is MD of INONI Limited, a supplier of business continuity software.

•Date: 26th August 2010 • Region: UK/World •Type: Article •Topic: BC planning
Rate this article or make a comment - click here

Copyright 2010 Portal Publishing LtdPrivacy policyContact usSite mapNavigation help