IT disaster recovery, cloud computing and information security news

As cloud becomes a mainstream part of organizational infrastructure, any failure of the cloud service becomes a business continuity issue. Rob Strechay provides a summary of the impacts of cloud downtime and what organizations can do to prepare for it; calling for a ‘resiliency-in-layers’ approach.

AWS. IBM Cloud. Microsoft Azure.  What do they have in common? Yes, they’re all public clouds, but there’s something else that binds them: They’ve all experienced a major outage or service interruption of some consequence to customers.

This is not to say they’re worthy of being called out and shamed on which fail was worse than the other. The fact remains that technology in all its various forms and offerings is inherently vulnerable given the massive complexity associated with virtualized IT environments.

While it’s possible to look back and calculate the effects – either in lost revenue, lost productivity, customer churn, or other measures – one thing is certain: no business can afford an outage in today’s ultra-competitive business environment. The impact of the outages brings to light the stark reality that as cloud continues to evolve, there are serious challenges to its true resiliency to withstand unforced errors.

What does this mean to customers? It means you must have ‘resiliency-in-layers’ for your cloud infrastructure.  Think how you can enable an organization to quickly avoid disruptions, minimizing impact to end-users and customers to the point that they are unaware a disruption even occurred. Business impact avoidance needs to be the main goal of a layered approach.

The effects of downtime

The effect of the AWS S3, Microsoft Azure, and other outages have a shelf life and will be pushed out of view for a time.  But in a world where we increasingly rely on IT capabilities, with much of it supported in the cloud, this shift is fraught with vulnerability. Some businesses are more vulnerable than others. Not having resiliency-in-layers will lead to this type of vulnerability.

At the time of the AWS outage, some organizations, including the Harvard Business Review, found their entire website, and consequently their business, was offline. Others such as Yahoo, Adobe, Netflix, Instagram and even Apple’s Cloud experienced problems that had a direct effect on their business. The outage even affected some smart homes and IoT services, as things such as parking meters and thermostats were reported to have issues that were directly linked to S3.  Born in the cloud, Mode 2, eggs all in one basket, business applications will be the hardest to have these multiple layers as they are designed in place and are not designed for portability.  Talking to many customers, they now look at how can they build a more hybrid cloud, leveraging a managed service provider or their own data centres / centres.   

The cost of downtime is something that many businesses cannot afford. In the case of smaller businesses, especially, the financial hit can be too great and the business can fail. And with large companies, the reputation hit can be just as detrimental. But with a plan for resiliency-in-layers, it is possible to avoid both.

Disaster recovery plans are good, but must be tested

Having a disaster recovery plan is not enough. It is imperative that businesses have resiliency-in-layers as part of their business continuity stance. This means examining your vendors, locations, and technologies to understand how to make this all heterogeneous.  Why?  Having diversity adds to the resiliency-in-layers effect by separating one action, activity, bug, or catastrophic event from impacting the rest of the business environment.  And you must test that plan regularly, to ensure that when an incident occurs you avoid downtime by having the automation muscles built into your plan.  This means that it is critical that organizations implement disaster recovery strategies that are easy to test, often at all layers of the infrastructure stack.

A dirty, but widely known secret within the IT industry, is how often disaster recovery tests fail because of unscalable, complicated manual processes, and incompatible technologies. It is because of this shortcoming that many organizations either don’t conduct regular tests, lie to themselves because a domain controller came up and they could ping it, or – worse - not at all while giving the illusion of being covered by virtue of some piece of shelfware that is not actually deployed.

A successful disaster recovery infrastructure needs to be highly automated and must continuously replicate data, allowing for applications to be quickly ‘rewound’ to the seconds just before an outage. It must be able to meet recovery point objectives defined by the business, with little or zero loss of data or lost application availability.

IT professionals and companies need to build and adopt tools and platforms with redundant, scalable, simple, and testable disaster recovery processes. The quicker a company can recover data, the less of an effect it will have on the business; with significant cost and time savings realized.

IT resilience – the case for hybrid cloud

For many different reasons, IT leaders have their own preferences and organizational requirements.  Some IT leaders have compliance challenges and others may have data locality issues.  For this reason, disaster recovery plans can sometimes seem as unique as a fingerprint in how they are built, maintained, and where they recover to. While IT is clearly moving towards cloud-based infrastructures, the focus of this trend revolves around the ability to thrive through every permutation of a disaster.

Though each element within hybrid cloud has its own associated strengths and weaknesses, it begs the larger question; what is the best way to manage against technology service disruptions? Here are the three pillars that help enterprise-class organizations achieve IT resilience by leveraging cloud-based disaster recovery infrastructure software to enable it:

  • You must have resiliency-in-layers, meaning a secondary (or more), geographically and meteorologically diverse, off-premise recovery data centers. This ensures that, should anything happen to your primary site, you will always have the redundant location to reduce the risk of an extended outage. Potential high capital costs of building or renting data center space need to be weighed. Still, some larger enterprises with high compliance mandates, such as those in financial services or healthcare, must have such a facility to guard against outages for regulatory reasons alone.

  • Using a managed service provider (MSP) or cloud service provider (CSP). This switches the financial model to OpEx and allows you to leverage service provider hired experts and an infrastructure, with a contractually obligated requirement to deliver on the defined service level agreement (SLA).  What you give up in some cases is the day-to-day administration and control at a granular level. Many customers of mine are asking themselves “Is my company in the data center and IT business or the business of healthcare” (for example).

  • Dip your toe into a public cloud infrastructure.  Many of our customers are rolling their own or leveraging MSP / CSP partners to ‘test drive’ public cloud as a second or third site. Businesses must understand and match their data and application priority with the associated target and SLA requirements.  In a ‘roll your own’ situation, you are still on the hook for the SLA.  This is why we see many companies leveraging an MSP / CSP.  A major advantage is the ability to leverage tremendous scalability.

While the more public cloud outages demonstrate that cloud is not immune to catastrophes, looking at public cloud as a part of your resiliency-in-layers plan can be a cost effective way to get a third or more site.  Add some geographical and meteorological diversity to your plan.  Augment and leverage the expertise of a managed service provider to help achieve your SLAs. And be the right answer when the CEO asks “what are we doing in the cloud”!

The author

Rob Stretchy is VP of product at Zerto.

Want news and features emailed to you?

Signup to our free newsletters and never miss a story.

A website you can trust

The entire Continuity Central website is scanned daily by Sucuri to ensure that no malware exists within the site. This means that you can browse with complete confidence.

Business continuity?

Business continuity can be defined as 'the processes, procedures, decisions and activities to ensure that an organization can continue to function through an operational interruption'. Read more about the basics of business continuity here.

Get the latest news and information sent to you by email

Continuity Central provides a number of free newsletters which are distributed by email. To subscribe click here.