Tolerance testing for operational resilience: what’s the scenario?
- Published: Friday, 10 December 2021 09:54
Luke Bird has been thinking around the subject of scenario testing within operational resilience. In this article he looks at the way forward in this area and considers whether current business continuity management practices can offer a starting point.
Following on from a few conversations about this recently I decided to give operational resilience scenario testing some more thought:
“How do I develop a severe but plausible scenario to test that I have the right tolerance threshold for a key business service and also to demonstrate that I can operate within that tolerance?”
This is a loaded question with plenty of rabbit holes to go down. Like for instance, what defines a key business process, never mind how to set a tolerance for it? But let’s work on the basis that people have those two bits already figured out and they are sat thinking – ‘okay so how do I test this’?
My brain started firing up to think about how tolerance thresholds for key business services could be tested in a meaningful way. I worry, for example, that the business continuity professional might look at their exercise program and think ‘done’. However, I believe this reaches far more broadly across the business.
But we already test loads of stuff?
There are already so many ways that an organization can test to provide management with comfort about loads of stuff, such as:
- Liquidity stress testing: for example - Basel Guidance
- Risk control testing: for example - SOX Control Testing
- Systems and software quality assurance testing: for example - ISO 25010
- Business continuity testing: ISO 22398
These are the ones I can think of just off the top of my head but I'm sure there are even more. I don't claim to be an expert in all of the above but they have been part of my experiences in my career so far when I think about testing.
This list also doesn't include things like IT failover testing as well, but it may be picked up in the quality assurance testing for systems and software (such as stress loading, alerting and failover). Nevertheless, these tests already demonstrate that there are very robust and comprehensive methods utilised across the business for different domains. So the question is - do we really need to create something new to test a defined tolerance?
Scenario testing - mature like a fine wine...
The operational resilience regulations in the UK are getting closer to the March 2022 deadline for compliance and as I said in a previous article - in simple terms, the organization is expected to have:
- Identified their key business services
- Identified the tolerance thresholds for those key business services
- Begun to draw up how you might test those tolerances.
I put them simply because each one of those points can explode into a mammoth number of pages. However, I will say this:
Do we know why we exist (mostly?) The way by which the organization goes about deciding what's important is an odd one to consider. This is mainly because it’s harder to do than people might imagine – but still very much possible.
As an example, PwC has widely covered the development of operational resilience in recent years. They released a short but really useful paper on broadly how to define your key business services. It discusses how the identification process will be left to organizations as a ‘matter of judgement’. The paper also flags a potential trap one could fall into by not differentiating between an internal service, business service, and transversal process.
The business continuity manager for example would immediately point to the business impact analysis (BIA) because it lists all the functions that, if disrupted, would impact the organization and/or customer. I believe the BIA should be used as a key data point but it shouldn't be used to exclusively decide on a key business service. The key business service has to be a tangible output to an identifiable participant. I think the BIA on its own would confuse the identification process. I wrote about this briefly in my BCI article on the Bank of England paper for operational resilience.
One might think that deciding on what is a key business service should be fairly intuitive, no? I say this because surely they make up the key reasons why the organization exists. If we don't provide x to y then we are out of the game (eventually). Maybe I'm over simplifying but let's assume the banks have figured this one out already.
We know our thresholds and pain points (mostly) - broadly speaking, and at a very simplistic level, the history of the organization, the strategic direction, the finances, and some of the incidents/near misses that may have occurred can help to give a vague idea of the tolerances that could be defined.
The Bank of England regulations are looking for data points i.e. number of customers affected, volumes of interrupted transactions etc. Each organization is going to have to draw a line where they think they could tolerate against each of these types of metrics before they become non-viable.
Anyway, the one I'm having trouble with just now is how do you test those tolerances?
Okay so your organization has defined its key business processes and set tolerances against them. You have a sound and documented rationale to explain how you got to those decisions. Now you are looking to prove that the tolerances are correct by testing them against severe but plausible scenarios?
This is where I run the risk of getting over excited and trying to develop some sort of hybrid, next-level, cross-domain, data-driven scenario machine! I think I need to learn to walk before I can run…
Key word - maturity
Let’s not get ahead of ourselves, the regulations want to see us making a start on this with a view to maturing it over time. The reality is, as I've already mentioned earlier, we already test and exercise a lot of stuff.
Is this more of a mapping and unification exercise than a new requirement?
BC Managers - Assemble! *
*Said in my best Marvel voice!
I believe this is a space (and opportunity) for the business continuity manager’s testing and exercise program to really help to meet the needs of this new requirement. A desktop exercise, for example, is already an embedded event that leadership understands has to be done (whether they like it or not). Most organizations have it in their culture so the hard bit is done. You already have a vehicle in place to facilitate the requirements.
However, one might argue that the current approach to desktop exercises does not fully cover the exam question. Does your half day/ two-hour event with leadership once a year really cover specific data driven failures such as transaction losses to a specific point of non-viability? If it does, then well done: but I haven’t seen this done to any mature extent.
For the purposes of this article I sat and tried to think what could be adapted within what I understand to be the business continuity manager’s testing and exercise program. From my experience, running an exercise is hard enough logistically, not to mention the issue of buy-in! That being said, I did manage to pull out a few thoughts:
- Tolerance data points need to be woven into the fabric of the exercise (and documented) so you can demonstrate that you are at least trying to meet the requirement. Broad notional responses to high-level incident scenarios just aren’t going to cut it. This for me is where the work to mature is going to take place.
- Narrow down the scenario to the context of the organization and the specific tolerance. Complex organizations need to account for sector /service arm/jurisdiction/ strategic goals etc.
- As the regulations suggests, initially take a risk based approach to the most severe but plausible scenarios that could take you the highest impacting tolerances most quickly.
Ultimately, this specific element of the requirement is still being figured out. The regulators are leaving it to organizations to make a start. I would argue we already do a lot of this and as I mentioned earlier:
- This feels more like a mapping and unification exercise for key data points to be woven into the fabric of the desktop exercise.
- The business continuity manager’s testing and exercise program is ideally placed to facilitate and mature against this requirement.
Luke Bird FBCI CRISC is a global award-winning continuity and resilience professional with 12 years’ experience of risk management in public sector and financial services. He is currently focusing on technology. Read Luke’s blog at https://resiliencerewire.co.uk