Initial lessons from the Eurocontrol IT failure
- Published: Friday, 06 April 2018 08:56
On April 3rd millions of European air travellers were impacted by an IT failure at the Eurocontrol air traffic control centre. In this article, Ian Aitchison, looks at what happened and what lessons can be learned from the incident.
Unlike the BA failure back in 2017, this isn’t another case of ‘someone turned it off by mistake.’ The information provided so far by Eurocontrol tells us that:
“The trigger event for the outage has been identified and measures have been put in place to ensure no reoccurrence. The trigger event was an incorrect link between the testing of a new software release and the live operations system; this led to the deletion of all current flight plans on the live system. We are confident that there was no outside interference.”
In a way the outage should be no big deal. Yes, the Enhanced Tactical Flow Management System (ETFMS) which failed is a vital system that ensures planes can be neatly slotted into tight time windows without risk or danger. But this is only an automated part of the overall process: automation technology makes decisions to enable things to happen faster, rather than waiting for slower humans. This is nothing new as automation powers our lives.
Since this tool is so vital to the smooth running of all flights, a failover standby backup system would be in place so that if the live system went wrong there would be a seamless transition to another mirrored system. Furthermore, there would also be a manual ‘worst case’ system, which almost certainly would require little cards with handwritten flight names to be stacked into plastic slots.
But there is an important message to take away from this - apparently the failover standby system didn’t work. Either the action of failover from system 1 to system 2 hadn’t been recently tested and proven, or a sequence of data and events made the failover impossible. And that means that either the IT operations for ETFMS failed in the testing of vital business continuity management processes, or that the range of events that can cause a successful failover didn’t include the incident that caused this system failure.
So: either “oops, we didn’t test recently” or “oops, we never expected that could happen”.
Both are bad (and good) news. Bad because this shines a light on someone’s planning mistakes, but good because it reminds us how our world is totally powered by automated high speed technology, and how business continuity management is incredibly important.
The key takeaways from this incident are:
- Automation is everywhere;
- Failover / continuity systems are essential for any system that has productivity or safety importance;
- Failover / continuity plans are worthless if not tested regularly against a very wide range of possible causes;
- Always have a manual system available. When safety is important, working fast is best, but working slowly is better than not working at all!
Ian Aitchison is senior product director, ITSM, at Ivanti.