Improving your IT resilience and disaster recovery capability
- Published: Thursday, 15 December 2016 08:53
In this detailed article, Bob Draper FBCI provides guidance on the effective implementation and maintenance of resilience and disaster recovery capability of IT systems, and is applicable, by scaling, to all sizes of business organization. This paper is intended for readers who are familiar with business continuity management (BCM) and disaster recovery (DR) and the processes to develop and manage related policies, standards and procedures.
Business continuity management policy
Many organizations will have a business continuity management policy for ensuring that the business can recover from a severely disruptive incident. An effective policy should clearly define the roles and responsibilities for those responsible for its oversight and implementation and the minimum requirements for compliance, including levels of IT resilience to disruption and disaster recovery capabilities.
A well designed and well defined policy will be supported by effective and tested procedures and arrangements that are in place to enable the organization to respond to, continue through and come out of any incident that may severely disrupt its ability to provide the normal level of service to its customers and stakeholders. These procedures and arrangements are the basis of the organization’s IT resilience.
IT resilience is defined as an organization’s ability to maintain acceptable service levels through, and beyond, severe disruptions to its critical processes and the IT systems which support them.
By focusing on the areas of awareness, protection, discovery, preparedness, recovery, review, and improvement, an organization will minimise the potential impact(s) of disruptions to its IT service(s) which, in the current highly competitive business environment that most of us operate in, could be extremely costly, possibly to the point of complete failure. These areas are key to effective IT resilience. None can be taken in isolation; they all overlap at some point in the overall process.
Awareness is having the knowledge of what are the normal business requirements of operational functionality; dependencies that might exist; the criticality of IT system components and elements; and the minimum acceptable operational levels. There must also be an awareness of the recovery requirements in terms of time, system capacity and performance in the event of severe disruption to, or failure of, IT systems supporting the business processes. These should be identified by an effective business impact assessment / analysis (BIA).
Protection is more than having physical and system access security controls. It can also mean reducing the risk of system failure, e.g. removing single points of failure (SPoF) by having load balancing servers or redundant systems or components. Potential exposures to systems deemed to be critical to business processes should be identified and addressed as priority.
Discovery means that the quicker the IT team knows that a system has been disrupted, the sooner they can resolve the problem. The use of effective means of alerts of problems enable the IT group to understand and address problems before they result in severe disruption.
Preparedness means having detailed plans for addressing the effects of a disruption, such as having seamless failover of systems and components, enabling essential business processes to continue to function with no, or an acceptable minimum, break of service.
Recovery focuses on returning services and operations to business as usual levels within defined timescales and with minimal acceptable data loss following an event causing disruption or failure. This will only be achieved by having an effective and tested recovery plan which meets the business requirements in place.
Review is essential to every IT resilience programme, and includes post-incident reviews to identify the root causes of disruptions. It is a continual process which aims to enable the IT team and the business to understand potential issues and to assess and implement preventative actions to remove, or at least mitigate, the risk of severe disruption.
Improvement is the process of taking the knowledge gained from all the above and taking steps to improve systems and increase resilience, and to continuously refine disaster recovery and business continuity plans.
It should be noted here that most, if not all, the information required for the above to be achieved successfully will come from effective business impact assessment/ analysis and risk assessment.
IT resilience and DR considerations
The ISO/IEC 27031:2011 standard recommends six main categories to be considered when formulating an IT DR strategy:
Key competencies and knowledge: What information is necessary to run critical IT services? Is it in-house or does it sit solely with a service supplier, or is it a mix of both? How can this information be incorporated into the organization’s ‘knowledge bank(s)’ and be made available in the event of a severely disruptive incident or event requiring IT disaster recovery processes to be activated?
Facilities: What are the criteria that installations and infrastructure should meet to minimize the risk of failure or severe disruption and eventual recovery? Where should such facilities be located?
Technology (systems): Which systems are most important to the organization’s business? Have recovery requirements been identified, e.g., RTO (recovery time objective), RPO (recovery point objective), or dependencies on other systems?
Data: Has the data required to restore / resume business activities, and the timescales within which it must be available, been identified? It is to be noted that there may be different RTO and RPO for IT services and data. The recovery, resumption or implementation of security controls to secure the data must also be considered.
Processes: Which processes are in place to deal with an incident or disaster, and how do they make the topics outlined above combine to deliver the required, and defined, business services.
Suppliers: Which service suppliers are critical to IT continuity, and how do they ensure that they can support the organization’s recovery and business continuity requirements? Are these service suppliers, in turn, dependent upon the effective responses from other third parties, internal or external to their organization?
IT recovery strategies
It makes good business sense to develop and maintain recovery strategies for IT systems, applications and data. Recovery strategies must address the above IT resilience and disaster recovery considerations and include all the elements that make up each system, e.g. networks, servers, desktops, laptops, wireless devices, data and connectivity.
Priorities for IT recovery must be consistent with the priorities for recovery of critical business functions and processes that have been identified by an effective BIA.
IT resources required to support critical business functions and processes must also be identified. The recovery time for an IT resource should be commensurate with the recovery time objective (RTO) for the business function or process that depends on that IT resource. The RTO is the time within which a business process must be restored, and a stated minimum level of service / functionality achieved following a disruption, to avoid unacceptable consequences associated with disruption to that service. For each system, this must be identified by the business area which is the ‘owner’/prime user of that system via the BIA.
The recovery point objective (RPO) should also be identified by the BIA process. The RPO is the maximum tolerable loss of data from an IT service due to a major incident. In multiple system environments, the overall recovery point must be that which returns collaborating systems to a consistent, synchronised, state.
In addition, it is always helpful to have identified the maximum tolerable outage (MTO) which is the maximum period of time that critical systems services can be unavailable or undeliverable following severe disruption or failure, after which the consequences are unacceptable or intolerable.
IT systems require hardware, software, data and connectivity. Without one component, a system may not be able to properly support the business operations for which they have been designed.
Recovery strategies should be developed to anticipate the failure, or loss, of one or more of the following system components:
- Physical environment (data center / centre building; computer rooms; facilities; utilities)
- Hardware (servers; desktop and laptop computers; wireless devices and peripherals)
- Connectivity (network links; equipment and services)
- Systems software (computer operating systems)
- Middleware (platform services, e.g., web servers or application services)
- Enabling software (shared central applications, such as electronic mail)
- Applications (data processing) software
IT disaster recovery planning (IT DRP)
Disaster recovery planning is the ongoing process of planning, developing, implementing, and testing disaster recovery management procedures and processes to ensure the efficient and effective resumption of critical functions in the event of an unscheduled interruption which might cause severe disruption.
A disaster recovery plan can only be effective if system dependencies have been identified and accounted for when developing the order of recovery, establishing recovery time and recovery point objectives and documenting the roles of required personnel. The source of this information should, again, be the BIA.
Roles and responsibilities
Each area within the business, as ‘owners’ of each system used to support their processes and services, should ensure that suppliers and hosts of IT services, both internal and external, are aware of the priority of each system and its criticality within each business area’s processes.
The organization’s IT group should ensure that systems provided in-house, or hosted by IT service providers comply with the organization’s resilience and DR capability standards and continue to meet business requirements.
IT service providers must be responsible for ensuring that their processes and procedures comply with the client’s requirements and standards, as specified in contracts and service level agreements (SLAs).
To ensure effective implementation of the BCM policy in terms of IT resilience and recovery and continuity of service to the required level, the organization should instigate a programme of reviews to establish the IT business continuity / disaster recovery (BC/DR) resilience of systems supporting its business operations, functions and processes. The reviews should be conducted by a group with the required professional knowledge and experience at regular intervals; to ensure objectivity, this should not be the internal IT function.
Reviews should be structured to assess the capability and resilience of IT systems or services supporting business areas or processes, rather than individual IT systems. Reviews may highlight potential exposures which, in the event of an incident causing severe disruption to IT services, could delay or, in a worst-case scenario, prevent, recovery of critical business processes, functions or services, with potential financial and/or reputational damage to the organization.
The review process should include evaluation of the processes, policies and procedures related to preparing for recovery or continuation of technology infrastructure, systems and applications, following an incident that may cause severe disruption to, or failure of, essential services, from whatever cause.
The outcomes of each review should include agreement on corrective actions intended to improve the resilience of critical IT systems against severe disruption, with responsibilities for each. The purpose of corrective action should be to remove identified potential exposures, or at least reduce them to an acceptable level of risk, to improve IT resilience.
To provide a level of consistency for the review process, the organization should establish a set of standards and guidelines to which its IT systems must conform to, and which will also provide information to system designers and developers as to the resilience and DR capabilities that are required.
These standards and guidelines will apply to the review of the resilience and recovery capability of critical IT infrastructure, systems and applications and other IT services which are deemed essential to the business. They must apply to IT services that may be operated in-house, or supplied by contracted and approved third parties.
Why standards and guidelines?
In simple terms, standards and guidelines provide a means to achieve order in a given context. Here, they define the criteria against which the resilience of IT systems supporting business processes and functions can be reviewed with consistency.
In a large and complex organization, standards can address, for example, the needs for interconnection and interoperability, especially where there is a mix of equipment and services.
Effective use of well thought out standards provides a solid foundation upon which to develop new, and enhance existing, practices, and helps to raise user confidence.
What is the difference between a standard and a guideline?
Standards, in this context, are the criteria against which, when reviewed, a system must be compliant to be considered to be resilient. Guidelines are intended to provide information about the levels of information that are required to be available to measure the degree of resilience.
To answer the question, it's good to think in terms of ‘above and below the line’:
A standard is a ‘must have/must do’ item and is ‘above the line’. Requirements will be stated using terms such as must/will/"shall. E.g. The organization must have an IT disaster recovery plan that is part of, or feeds into, the overall business continuity plan process. There are no gray / grey areas for standards; if something is not as specified in the standard then it is deemed to be non-compliant for the IT DR resilience review purposes.
A guideline provides a degree of flexibility in compliance and is ‘below the line’. The requirements here will be stated using such terms as should/may/could. E.g. The following areas should be considered in the BIA review… (with detail).
This is not to say that guidelines can be ignored. They are advisory statements and non-compliance or non-inclusion must be justified when a review or audit is carried out.
What are the recommended standards and guidelines?
The following is a list of the issues which are the minimum considerations for inclusion in any organization’s IT resilience and DR standards and guidelines.
The list below shows the topics only; for more information, and/or support in developing your organization’s standards and guidelines, contact Austin Risk Consultants.
IT disaster recovery
- IT disaster recovery plan
- System criticality and recovery objectives
- Testing and Review
- IT disaster recovery plan content
- IT service providers: disaster recovery plan information
- New systems
Business impact assessment
- Business impact assessment evidence
- Business area objectives/priorities
- Information Required
- IT systems planning
- Risk assessment evidence
- Risks and exposures
- Data backup
- Systems software backup
- Systems management & infrastructure services
- Network resilience
- IT processing locations/data centers
- Recovery site(s) location(s)
- Primary and secondary site security controls
- Recovery system management
- Recovery capabilities
- Service supplier internal governance
- Service supplier contracts/SLAs
- Single points of failure (SPoFs)
- System performance & capacity review
- Power supplies
- Resilience of other utilities
- System software controls
- Applications software controls
- Data security
- Compute and storage resilience
Although ISO/IEC 27301 is a very good place to start if wanting to find more about business continuity standards, a comprehensive source of information about multiple international business continuity legislations, regulations and standards which may be relevant to your country can be found here (a Business Continuity Institute resource: registration required.)
Having a business continuity management policy is not enough, even if you have supporting procedures and recovery and continuity plans. If your IT systems are important, and it would be a big surprise in this day and age if they are not, then due diligence dictates that you ensure that they are resilient. This means implementing review and risk mitigation programmes. As stated previously, reviews should be carried out by a group outside of the area directly responsible for IT systems, to retain objectivity. Risk mitigation activities are, of course, the responsibility of the area that owns the identified risk, and should be at the appropriate level to meet the organization’s risk appetite.
Principal author: Bob Draper FBCI, Associate Partner, Austin Risk Consultants. Assisted by Chris Childs MBCI, Darryl Paul MBCI, Mark Evanson
Chris Alvord, Founder Austin Risk Consultants, ISO 22301 Lead Auditor, ISO 27001 Inside Auditor, CBCP, MBCP, OCEG GRC
Paolo Cannone, Associate Partner Austin Risk Consultants, ISO 27001, ISO 22301, ISO 31000, ISO 2000, BS 10012, CSA STAR
Jayne Howe, Associate Partner Austin Risk Consultants, FBCI, MRP, CBRM, ISO 22301 Lead Implementer, ISO 22301 Lead Auditor
Don Stewart, Associate Partner Austin Risk Consultants, PMP, MBCP, MBCI
Bud Telchin, CBCP, Senior Consultant with Telchin Associates.
The original version of this article can be downloaded from www.austinriskconsultants.com/downloads