Lessons that can be learned from Delta’s disaster recovery failure
- Published: Monday, 15 August 2016 07:08
We can learn a number of business continuity lessons from the recent system outage of US airline Delta, which caused in excess of over 2,100 flights to be cancelled over a three-day period and delays to thousands of others, says Peter Groucutt, managing director of Databarracks.
Investigations have revealed that the outage occurred when a critical power control module at the airline's technology command centre malfunctioned, causing a surge to the transformer and a loss of power. While power was stabilised and restored quickly after the malfunction some critical systems and network equipment didn't switch over to backups when others did.
Delta CEO Ed Bastian has said that the airline takes full responsibility for the failures and has suggested past tech investments may not ‘have been in the right place’.
“There are a number of continuity lessons we can all learn from the Delta incident. Power failure is one of the most common causes of disruption and reasons for failing over to a secondary site and should be tested regularly. Uninterruptible power supplies (UPS) and generator start-times need to be tested regularly and testing the actual failover of IT systems must be performed at least annually.
“Looking at Delta, it’s been reported that some systems failed over but others didn’t. In itself this isn’t necessarily a bad thing. Recoveries of a sub-set of systems is common. The problem comes when an incident like this forces the partial recovery of some systems that has not been planned and tested as we assume was the case here. The individual systems might be operating perfectly individually, but data transfer bottlenecks for instance can limit the performance of the systems as a whole.”
Groucutt continued: “For air travel, a significant implication of downtime is the knock-on effect. The cancellation of a single flight affects availability of aircraft and crew for the remainder of the fleet. This may not be as pronounced for other businesses but everyone needs to be prepared for the extra hard work that needs to be put in to catch up with the lost hours.
“During a disaster invocation you have several competing demands. Firstly, you need to invoke disaster recovery in order to carry on servicing your customers. You need to fix the problem that caused the invocation in this the power control module and transformer. Finally, you need to fail-back from your secondary systems to your primary. In the case of Delta couple this also with the need to work around air traffic control schedules and crew availability it is easy to see how this escalated so quickly.
“The lesson for other businesses here is the need to pull in additional capacity to assist during an incident. It is critical to have pre-planned relationships with suppliers. That could be the supplier who can maintain and replace faulty power equipment, provide additional support staff to help run the business when running over normal capacity, as well as recovery and continuity professionals to help recover back to primary systems. If rehearsed and done correctly, the impact of this to your business and customers can be kept to a minimal,” Groucutt concluded.