Abstract:
One of the main tasks of IT Business Continuity Planning (BCP) is to guarantee that incidents affecting the IT infrastructure do not affect the availability of IT-dependent business processes beyond a given acceptable extent. Carrying out BCP of information systems is particularly challenging, because it has to take into consideration the numerous interdependencies between the IT assets typically present in an organization. In this paper we present a model and a tool supporting BCP auditing by allowing IT personnel to estimate and validate the Recovery Time Objectives (to be) set on the various processes of the organization. Our tool can be integrated in COBIT-based (Control Objectives for Information and related Technology) risk assessment applications. Finally, we argue that our tool can be particularly useful for the dynamic auditing of the BCP.
Introduction:
Business Continuity (BC) is the discipline supporting an organization in coping with the disruptive events that may affect its IT infrastructure. The goal of BC is to guarantee that – after incidents – the infrastructure will recover operations within a predefined time. This is achieved by carrying out a Business Continuity Plan (BCP), which is part of the Risk Mitigation phase of the Information Risk Management process. In general, Risk Mitigation (RM) consists in developing and implementing a strategy to manage potential harmful threats to the information systems. Since risk may not be completely avoided because of financial and practical limitations, RM (and BCP as well) includes the evaluation and the conscious acceptance of a residual risk.
BC is quickly becoming a best practice among both enterprises and organizations also due to recent legislation such as the Sarbanes-Oxley Act (SOX) of 2002 or the Basel II [2] accord, which explicitly requires it. Until recently, no widely agreed methodology was available to carry out a BCP. The new standard BS25999 [7], published in 2006 by the British Standard Institute, has changed this situation providing guidelines to understand, develop and implement a BCP, and it aims to become a standard methodology. Notably, BS25999 requires an organization to (1) identify the activities/processes supporting the core services used by the organization, (2) identify the relationships/dependencies between activities/processes, (3) evaluate the impact of the disruption of the core services/processes previously identified (during the Business Impact Analysis, BIA).
One of the main goals of any BCP is achieving that crucial business processes should recover from disruption within a predefined Maximum Tolerable Period of Disruption (MTPD). The MTPD expresses the maximum acceptable downtime to guarantee the business continuity. As expected, the MTPD depends heavily on the organization business goals and therefore is defined on the business pro- cesses, and is determined by the business unit.
Since business processes typically depend on a variety of underlying IT assets, the MTPD has a direct and indirect impact on the maximum downtime that these assets may exhibit in practice. Indeed, the standard technical mean to realize a given MTPD is to define Recovery Time Objectives (RTOs) on all IT assets supporting business activities for which the BIA has determined that it is necessary to ensure continuity; RTOs strongly depend on the technical and organizational measures the IT department implements to deal with incidents.
Nowadays, determining RTOs that apply to the IT assets is done manually, and it is a subjective work which heavily depends on the experience of the IT personnel. This is not only error-prone, but it does not scale well (to the point that often, determining RTO’s is not even done for all entities, despite being required by the standard methodology). Moreover, it is inconvenient in case of changes in the IT infrastructure or in the business goals. In particular, new contracts and agreements can have an impact on the quality of service a business process should deliver and ultimately on the MTPD associated to it. Likewise, changes in the IT infrastructure may affect dependencies and therefore the impact of the IT assets on the business MTPDs. In both cases, adapting the BCP to these changes, usually requires a costly new analysis involving both the IT and business units of the organization.
We present a new model-based tool to support the analysis of temporal dependencies among IT assets and between IT assets and business process. The primary goals of our model and tool are:
(1) to support the IT department in setting and validating the RTOs of the IT assets of the organization
(2) to evaluate assigned RTOs w.r.t. the given MTPD to find critical points in the IT infrastructure.
Ultimately, our model allows one to put down the fine-grained set of premises and assumptions to infer that a given MTPD will be achieved, thereby obtaining a more objective assessment of the behaviour of the IT infrastructure.
While achieving these goals, we argue that our model is particularly useful for dynamically auditing the BCP in various ways:
first, the tool allows one to visualize immediately how changes in business goals or in the IT infrastructure affect the compliance with given (or modified) MTPDs; in particular, it is possible to compute whether the measures already in place continue giving enough guarantees also after the changes.
Secondly, it allows one to validate the actual response of the IT infrastructure w.r.t. the expected behaviour, promoting a continuous refinement of the model which can adapt to new external circumstances, allowing for early detection of new threats to the business continuity targets.
Technically, our model is an improvement of the one we presented in [18] for the optimization of countermeasures. The essential difference with the previous model lies in the modelling of the recovery time after disruption, which in the present situation has to be much more accurate. Notably, as we mention in Section 5, the data our model requires is collected anyhow during a BCP.
Time Dependency and Recovery Model:
Legends: Pn=Process An=Application DBn=Data Mn=Machine Nn=Network
Critical Points in the Traditional RTO Assignment:
Discussion:
Feasibility and Validation:
The main concern regarding the feasibility of our approach is whether the required set of data is easy to collect. If this was not the case, organizations would not be willing to accept it. Fortunately, the data it requires is typically available after RA and BCP:
first of all, an accurate map of the IT infrastructure is readily available after a BCP carried out following the BS25999 [7] standard (and is also after RAs).
Secondly, an inventory of possible incidents, together with their frequency has to be compiled during the RA.
Finally, a BCP should provide (accordingly to the BS25999 standard) a complete evaluation of the effectiveness of chosen incident response strategies.
To further substantiate our argument, we note that this data is also collected by tools devised to assist the RA and RM processes. For instance, the Italian branch of KPMG [11] (a worldwide company delivering also Information Risk Advisory services) has developed a customizable tool, KARISMA (Kpmg Advanced RISk MAnagement), to support their RA activities. Among the information KARISMA collects via a question-driven procedure, there is a map of the business process entities (together with their relationships) and the Business Impact Analysis values. KARISMA is based on COBIT, and it is very likely that other tools for RA, based on COBIT, would collect the same information. Our system can thus be regarded as an additional component for KARISMA or for any other COBIT-based tool for RA, supporting in particular the Business Continuity Planning activity.
We also note that most of the information required to build the TDR model is also available when applying to an organization an architectural framework, such as TOGAF [16], Zachman [17] and ArchiMate [1]. Indeed, the layers defined in those frameworks are similar to the ones we adopt for our model, though used for different purposes (e.g. architectural support, new component impact evaluation, etc.). Since those projects are widely employed (ArchiMate for instance is used by ABN Amro and the Dutch Tax Office), and are supported by several tools, they provide us an indirect confirmation of the feasibility of actually obtaining the data needed by our model.
Summarizing, our tool does not require organizations to acquire new information (i.e. to employ new resources), rather it uses in a different way the information already available after RA and BCP.
Dynamic Auditing:
Finally, we argue that our framework is particularly useful to support a dynamic auditing process. The concept of dynamic auditing is well-known among the risk management strategies, particularly in the field of software engineering [13]. The goal of this process is to continuously assess what could go wrong in projects (i.e., what the risks are), determining which of these risks are most relevant, and implementing strategies to deal with them. Even though many of the methodologies for risk management [5, 3, 8], as well as those for BC [7], include a monitoring and reviewing step, this process can be performed with different degrees of granularity, according on how flexible the methodology is. For example, a change on the IT infrastructure, involving the dismantling of a set of applications and machines and the introduction of new software and hardware components, may involve either the assessment of the new components only, or of the whole organization, depending on how much it is possible to reuse the previous assessment results.
Thanks to the fine granularity and the high degree of independence of the used information (time dependencies, assessment of incidents, importance of processes to the business), our model and tool are particularly suitable to support a dynamic assessment process.
For instance, when dealing with a change in the organization, be it the rearrangement of the IT infrastructure or a new business strategy, after a simple update of the model, the framework can be used to evaluate the new time con- straints within which incidents must be repaired to preserve the business continuity. In the case a new component is added to the information system, it is only necessary to add the new component in the TDR model and specify its functional and temporal relationships with the other components to evaluate its new RTO. On the other hand, if a process becomes more important for the organization’s business (due to changes in business strategy), it is possible to change its MTPD and automatically assess the IT infrastructure to verify if it is still able to ensure the new time constraints.
In addition, after the occurrence of an incident, our model allows us to verify if the incident response propagation is compliant to the expected behaviour. It might happen that a time dependency between two applications, that was estimated to be of one hour, is in fact of one hour and a half. Furthermore, one might observe that the response time to an incident exceeds the forecasted RTO. In those cases, the model can be easily updated with the new collected information, thereby allowing to rapidly assess the new situation and develop new and more efficient BC strategies, if needed. This feature adds quality to our solution since it enables the BC team to organically capitalize on practical experience to improve accuracy of the model and of the outcome in time.
In this perspective, the ability to easily refine the model helps at improving the way organizations traditionally deal with incidents. Instead of simply solving the problem when it happens and then forgetting about it, our solution promotes the continuous monitoring of the performances of the repair operations by collecting new information as incidents occur and then use them to improve the efficiency of the response on new occurrences.
Summarizing, our system allows one to:
(a) easily adjust the model to changes in the organization and/or its business target, without the need of a complete new assessment, and
(b) refine the model (i.e., make it more precise) in the moment in which new and more accurate information is available about the actual behaviour of the organization.