Doron Pinhas looks at the common factors behind various high-profile technology outages in 2018 and proposes a practical approach which will help organizations reduce unplanned downtime in 2019.
Flying these days is almost never a pleasure, but in 2018, it was a downright nightmare with dozens of glitches and outages that kept planes grounded. 2018 wasn't such a great year for other industry sectors as well. Financial service customers also had a rough year accessing their funds and performing urgent financial transactions. In the UK, for example, banks experienced outage after outage. Three of Britain's biggest banks - HSBC, Barclays and TSB - all experienced outages on a single day, making online banking impossible, and there were dozens of other incidents peppered throughout the year.
And if your business lives on cloud platforms and SaaS, you might have found yourself running ragged at times trying to access your IT with all of the major cloud platforms suffering from outages throughout the year as well.
It may be 2019 now, but the fundamental gaps that led to those service disruptions haven't been resolved, so we can expect more such outages this year, and probably every year until companies figure it out – which, if you’re a business continuity or IT professional, raises the question: what should I do to avoid outages?
Simple and logical as it is – and quite surprisingly – until recently, this question wasn’t really asked. Even when improving resilience is made a business priority, the discussion would rarely receive the attention it deserves. The dichotomy is especially puzzling when you weigh the staggering global losses due to downtime (1) against the fact that the majority of businesses admit their IT resiliency is not sufficient (2). But, asking the right question is the first step in attacking any problem.
To prepare the ground for an effective discussion on outage avoidance, it may be useful to first linger on why and how outages occur in the first place.
Companies that comment on the reasons for their outages generally attribute them to something general, such as an upgrade gone awry, a hardware failure, or the notorious ‘technical glitch’. While often factually correct, these are almost never the real reasons, but rather the mere triggers (and cynics might say ‘IT scapegoats’, since modern IT is designed and built with complete redundancy). The more realistic theme is that of loss of control over design, operations and process.
Resilience today is a very complicated matter. A large enterprise will use multiple technology layers (myriad hardware platform, OS, virtualization, cloud, etc.), and have hundreds of changes a day (upgrades, expansion, patching, etc.) carried out by multiple teams (and humans are not perfect communicators). To add to this complexity, best practices are constantly changing, and each application or system has its own, so there’s always a serious knowledge gap. With such entropy, even a perfectly engineered environment (and very few are) would quickly become cluttered with an unknown quantity of misconfigurations and single-points-of-failure. It is at that point, when a failure of just one component, or a single miscalculated update, can bring the entire environment down.
So, the real questions, as I see them, have more to do with control; what (rather than if) are our technology misconfigurations, and how can we proactively address them? Do the resiliency plans that are ‘on the books’ effectively answer business requirements (minimum loss and minimum downtime)?
Building controls, and, for that matter, effective improvement processes, is not a foreign concept in business. It takes deliberate dedication, definition of KPIs, and a closed feedback-loop of measurement and improvement.
When it comes to improving IT resilience, there are a lot of details to consider and process. Ideally, we’d like to be able to form a unique resilience index, or scorecard, per business object, encapsulating more fine-grained quality and risk metrics. Tracking the resilience index over time will allow measuring improvement, as well as understanding the impact of each individual change (what’s working well for us, vs. what doesn’t, and why). As we continue breaking down the measurement process, we should start looking into defining more detailed Service Level Objectives for each business function (for example, does it need hardware redundancy? Does it need to withstand a site failure? What kind of data protection scheme is needed? How quickly should we be able to rebuild it after a successful cyber-attack? etc.). Finally, we’d like to be able to validate that critical vendor recommendations were successfully implemented in each technology layer.
This is by no means a complete blueprint, but rather a broad-brush sketch of an effective program. That’s not to say it’s just a theoretical concept. In fact, I’ve seen this very same approach successfully utilized by multiple organizations in recent years with dramatic results. Improving IT reliability comes in direct proportion to reducing downtime and its associated business losses. Being proactive and predictive, of course, has many other benefits: improved agility, reduced firefighting, increased trust (of both customers and shareholders), and more.
One can only hope that increased awareness, and deliberate efforts to improve resilience, can make 2019 a year to anticipate instead of fear.
Doron Pinhas is CTO of Continuity Software
- Research indicates that in North America alone, the yearly cost exceeds $700B USD: https://news.ihsmarkit.com/press-release/technology/businesses-losing-700-billion-year-it-downtime-says-ihs
- According to data, only 7.4% of large companies rate themselves as mature for resilience, while nearly half have suffered an unrecoverable data event in the last three years. And, 83.8% have experienced some type of business disruption because of an outage.