ASPLOS2014 Tutorial

System Analytics in the Cloud 

Canturk Isci and Vasanth Bala
IBM T.J. Watson Research Center, NY


With cloud computing, proliferation of virtualized systems, and intercontinentally-distributed, warehouse-scale computers, the IT systems are pushing new frontiers in terms of complexity, scale, application diversity and dynamism. Emerging myriad techniques for cloud provisioning, scripted automation, formal validation, deployment, monitoring, anomaly and misconfiguration detection, software defined infrastructure and various others help organizations keep up with the ever-increasing infrastructure complexity and shrinking development cycles. While these solutions dramatically mitigate the effects of complexity and scale, most IT system problems do not magically disappear. Servers still crash, systems can still be compromised and applications still misbehave, leading to profoundly obscure failures and anomalies in these highly-complex, large-scale compute systems. Traditional practices of characterizing, diagnosing and remediating systems do not fare well in this environment of rapid change and growth. These complex IT systems encapsulate a wealth of operational data, and tapping into this potential opens up promising new opportunities. Similar to the recent growth of analytics over user data, there is an emerging parallel trend for designing analytics for system operational data.

In this tutorial we describe an alternative approach to systems management in the Cloud that takes a data-centric—rather than the traditional system-centric—view of the IT operational environment. The core principle behind this approach is to treat systems in the cloud similar to the way we treat documents in the Web: periodically “crawl” systems to extract features of relevance into a central index, over which different systems management services can then be built. In essence, this allows a system to be viewed as a series of point-in-time snapshots for which crawled feature data has been collected and centrally indexed. We discuss how these system snapshots can be simply represented as data frames—similar to web documents—and how this data can be analyzed for insights. We present an entire monitoring and analytics framework built around this "systems as data" principle, which we showcase via both actual product-grade solutions and real-system prototypes highlighting some of the promising opportunities. In particular we present: (i) novel, non-intrusive methods for "crawling" virtual systems—both online and offline—in the cloud; (ii) a cloud knowledge base for across-time and across-the-cloud system analytics; (iii) techniques for enriching the semantic context of raw feature data through annotations; and (iv) existing and potential applications of system analytics, based on real industry use cases. 


This tutorial will describe the basic principles of system analytics and methods for interpreting systems as data. It will include technical presentations, real-world use cases and actual demos for most of the covered aspects throughout the tutorial. We will describe the data-centric representation of compute systems, potential approaches for accessing, ingesting and processing this data. This discussion will draw from our experiences over several iterations of a system analytics platform.


1. Methods for Crawling Systems: 
Existing and emerging methods for accessing and interpreting persistent and volatile system state. We particularly describe how we can "crawl" virtual systems in the cloud in a similar fashion as crawling documents on the web. We discuss the limitations of existing approaches, and present new techniques for providing a complete system representation in a non-intrusive, scalable and maintainable way in the cloud without imposing any end-user cooperation and causing any system interference.

2. The Cloud Knowledge Base: 
Cloud knowledge base plays one of the most critical roles in systems analytics, where both realtime and forensic system information is managed. Here we first describe a generic system data format, i.e., a "system frame", which is generated as the result of crawling compute systems. We discuss various frame representations, the design trade-offs with different knowledge base frameworks, the pros and cons of different architectural choices in terms of efficiency, scale and provided services. 

3. Annotating and Linking Systems Data: 
Semantic enrichment of the raw system data is essential to improving the usability. Here we present key examples of (i) how systems data can be annotated to improve the semantic context for analysis; (ii) how the use of linked-data concepts can use external information sources as inputs to the annotation process; (iii) example annotators for configuration analysis, application discovery and system vulnerabilities.

4. System Analytics Applications: 
This part of the tutorial showcases applications of system analytics from the enterprise applications, such as image sprawl management and compliance analysis, as well as from recent research, such as realtime, out-of-band system monitoring, log and configuration analysis.


The objectives of this tutorial are to present the trends and opportunities in the growing system and operational analytics fields, and to describe tangible, concrete techniques for accessing and analyzing systems data. The tutorial covers the entire spectrum of system analytics processes (accessing and interpreting system state, building and extending a cloud knowledge base and designing analytics to infer and improve data center operations) to garner interest across different communities, from systems and architecture to machine learning and data analytics, and to underline concrete open problems and research directions in each layer of the system and analytics abstractions.


Vasanth Bala leads the Scalable Datacenter Analytics team at in IBM T.J. Watson Research Center, which, in collaboration with various groups in academia, has been exploring the many dimensions of large scale systems monitoring and analytics. His current research interests focus on mining the configurations of (virtual) machines in a data center / cloud environment, in order to learn configuration patterns that have a high correlation with performance problems and system outages. Vasanth's past work spans computer architecture, compiler optimization, parallel computing, virtualization, dynamic binary translation, and digital rights management. Prior to joining IBM Research, he co-founded Liquid Machines Inc., focusing on virtualization to enable document protection and security, which was later acquired by Check Point Software Technologies. Vasanth is an ACM Distinguished Scientist for seminal contributions to the field of dynamic binary translation, a member of the IBM Academy of Technology and an IBM Master Inventor. He was also recently nominated to the IBM Data Science team which assists customers in solving analytics problems related to very large data sets. Vas received an M.S. and Ph.D. in Computer Science from Rice University.

Canturk Isci is a Research Staff Member in IBM T.J. Watson Research Center. His research interests include cloud computing, virtualization, data center energy and thermal management, microarchitectural and system-level techniques for energy-efficient and adaptive computing. Prior to joining IBM Research, Canturk was a Senior Member of Technical Staff at VMware, where we he worked on distributed resource and power management, performance and scalability of virtualized systems. He is the recipient of a best paper award in ICAC 2011, best research poster in VMworld 2008 and academic fellowships from British Council, Princeton and Bilkent University. He serves as the industry chair in IEEE Computer Society, Special Technical Community on Sustainable Computing. Canturk has a B.S. in Electrical Engineering from Bilkent University, an M.Sc. with Distinction in VLSI System Design from University of Westminster, and a Ph.D. in Electrical Engineering from Princeton University.