IT Service Management thread

IT infrastructure assessment

IT Infrastructures (ITI for brief) include the use of various components (hardware, software and network infrastructure) upon which IT services are provided. ITIs must be quickly adapted to support new technologies (e.g. grid services, web services, internet applications, and application integration) and new types of services (e.g. wireless, broadband media, and voice services), while enforcing stronger access control and auditing policies and keeping high degrees of flexibility and agility. In such a scenario, one of the major problems faced by ITIs is their increasing size and complexity, which may jeopardize the delivery of real business value. The size and complexity are often the result of ITIs created, designed or adapted by non ITI experts such as business decision makers, consultants, administrators, developers, software engineers, solution architects and other individuals (sometimes conflicting due to their own point of view) without ITI design guidelines and most of the times with the only purpose of responding to the requirements of a particular business application. I have used my M2DM approach to define a set of IT complexity metrics, which allowed determining the complexity of existing infrastructures. I then proposed a reengineering technique to discover the topology of a distributed IT infrastructure, based on a multinomial logistic regression model and a set of topology stereotypes, which uses those complexity metrics as regressors (independent variables). To demonstrate the feasibility of the approach I applied the model to several organizations with distributed ITIs and, among other aspects, I found that the most recurring stereotypes were the centralized and backbone ones.

IT infrastructure patterns

Designing ITIs for large organizations is a challenging task since it requires knowledge of existing organization processes, the views of different players, and the coordination of technical expertise in three ITI domains (hardware, networking and infrastructure software) that rarely reside in a single individual. The design of solutions is achieved in most engineering fields by using appropriate abstractions. Several companies such as IBM, Microsoft or HP have proposed design “blueprints” embodying proprietary components to support ITI design based upon their own ITI building blocks. The idea of raising the level of abstraction to allow ITI designers to find, share and apply standardized solutions to recurrent problems is good, but we think that should not be constrained by focusing on a specific technology or vendor. We are improving the current state of the art by proposing a supplier independent approach based on the idea of a pattern language. Our definition of a pattern language in this context is “an interconnected collection of ITI design patterns that come together to create a secure, reliable, available, performant and manageable IT infrastructure”. The use of ITI design patterns can be seen as a process to simplify the ITI design process, while reducing its risk and cost by using well-known solutions for recurrent problems. We have already proposed a set of ITI design patterns that were validated by peer researchers and published. Still we have to go through a practitioners validation step in a set of controlled experiments, to raise more evidence on the soundness of our approach, namely by facilitating communication, sharing of ideas, building complex and heterogeneous solutions, identifying recurrent problems, and providing a guided approach to solve those problems.

Assessing process models

Process metrics can be used to establish baselines, to predict the effort required to go from an “as-is” to a “to-be” scenario or to pinpoint problematic ITSM process models. Several metrics proposed in the literature for business process models can be used for ITSM process models as well. However, we could not find in the literature a systematic and replicable approach to measure the complexity of process models. To mitigate this problem we formalized some of those metrics and proposed some new ones, using the M2DM approach, upon a lightweight BPMN metamodel that we have developed, which is much easier to handle than the OMG’s MOF-based “official” counterpart . Our metamodel was instantiated with several case studies, encompassing several thousand meta-instances. We analyzed the collinearity of the formalized metrics and were able to identify a smaller metrics set, which will be used to perform further research work on the complexity of ITSM processes. Our approach is generic, since ITSM process models are directed graphs, such as those underlying computer network diagrams or sequence flow diagrams representing source code. In the near future, we plan to adapt it for assessing software process models defined upon OMG’s SPEM metamodel . Other research topics that now became easier to tackle due to our pioneering work are experimental studies on how ITSM process complexity affects process operation. In particular, we are interested in exploring how these process complexity metrics are related to some operational metrics such as “mean time to restore service”, “calls to second-tier resolver teams” or “% of incidents/problems resolved within service targets”. We also plan to research if the proposed process complexity metrics can be used in the formulation of Critical Success Factors (CSFs) and Key Performance Indicators (KPIs) for ITSM processes. Another open topic is the formulation of effort estimation models for ITSM process reengineering actions, based on the distance between “as-is” and “to-be” scenarios. If we find a good solution to this problem, we will then use these process complexity metrics as input in identifying improvement opportunities for each process. A final interesting topic that we envisage to research is how process maturity relates to process complexity. While maturity is usually defined in a finite number of levels (typically 5), process complexity can be arbitrarily large. Although we have not collected sufficient supporting evidence, we believe that the corresponding transfer function (process maturity versus process complexity) will be somehow trapezoidal or parabolic (convexity pointing upwards). In the beginning, as organizations move from ad-hoc through defined levels of process maturity, an increase in process maturity is reflected by an increase in process complexity, as we observed in a case study, but that increase tends to stabilize. As organizations move into high maturity settings, their process complexity decreases as they learn and improve their processes, by applying continual innovation techniques.

Service level management

Among the concerns of ITSM, namely within the service level management process, are the requirements for services availability, performance, accuracy, capacity and security, which are specified in terms of service-level agreements (SLA). We identified several problems regarding SLA definition and monitoring: SLAs for ITSM are informally specified in natural language, SLA specifications are not grounded on models of ITSM processes and SLA’s compliance verification in IT services is not performed at the same level of abstraction as service design, i.e., there is a gap between the business-oriented customer perspective and implementation-oriented perspective of the service provider. To mitigate those problems we are developing a model-based approach for IT services SLA specification and compliance verification. We have already proposed the abstract syntax of a domain specific language (DSL) named SLALOM (SLA Language for specificatiOn and Monitoring) to bridge the aforementioned gap. SLALOM’s abstract syntax is a composition of a BPMN (the language we use to model IT services) metamodel with that of the SLA life cycle, as described in ITIL . As such, it will be possible to ground SLA definition on the corresponding IT service model constructs. The next step will be to write concrete syntaxes targeting different aims, such as SLA representation in process models. We expect to improve the efficiency of building SLA contracts and to reach the adequate accuracy to allow the automatic validation of SLAs and monitor their compliance in real time at model-level, the one that is understood by all stakeholders involved in service specification.

IT incident management

Understanding causal relationships on incident management can help software development organizations in finding the adequate level of resourcing, as well as improving the quality of services they provide to their end-users and/or customers. We have conducted an empirical study upon a sample of incident reports recorded during the operation of several hundred commercial software products, over a period of three years, on six countries in Europe and Latin America to find out which were the influencing factors affecting the incidents management lifecycle. Nonparametric analysis of variance procedures were used for testing hypotheses. We obtained statistically significant evidence that several independent variables (Impact, Priority, Country, Zone and Category) have an influence on incidents lifecycle, as characterized by three dependent variables ( TimeToRespond, TimeToResolve and TimeToConfirm). We were not surprised on the influence of incident’s business criticality (the Impact) and incident’s correction prioritization recorded by the support (the Priority) on incidents lifecycle. After all, those incident descriptors were proposed with that same aim. Not so obvious was the observed fact that either the country or the geographical zone of an organization reporting an incident, had influence on all descriptive variables that characterize incidents lifecycle. This means that organizations from different countries (or geographical zones) do not receive the same kind of support, although they are using the same products and, in principle, paying approximately the same for it. Although we did not explore it further yet, we believe this phenomenon may be due to national distinctions in the exigency to formalize SLAs and verify its compliance, to cultural differences that cause a distinction on the tolerance to failure by final users (e.g. not complaining because an incident was yet solved) and/or language differences that somehow influence the relationship between final users and the international support team that is provided by the software vendor worldwide. Another apparent surprise was the fact that the proportion of critical incidents is not the same across countries. In UK and Spain, the actual number of critical incidents was above the expectation. This may indicate that end users in those countries are causing an over-grading in incidents critically assessment by the support. Sometimes, end-users/customers tend to think that their incidents have always higher impact, simply because it affects the way they do their work and not based on the impact the incident has on the business. Again, this issue deserves further study before sensible conclusions can be drawn.