Motivation and Challenges

Connect the attack dots

Advanced Persistent Threat (APT) attacks have plagued many well-protected businesses with significant financial losses (e.g., eBay, Target, Home Depot). Unlike conventional attacks, these advanced attacks are sophisticated (involving many individual attack steps across many hosts and exploiting different types of vulnerabilities) and stealthy (each individual step is not suspicious enough). Thus, to counter sophisticated and stealthy attacks, enterprises have a strong need for solutions to "connect the suspicious dots" across multiple system activities (this is also inspired by the DARPA Transparent Computing program). This requires (1) large-scale collection and storage of "the suspicious dots" (attack provenance) and (2) attack investigation that "connects the dots" for identifying risky system behaviors.

In order for enterprises to investigate advanced attacks, it is crucial to understand the activities of hosts at a fine-grained level. Recent approaches based on ubiquitous system monitoring have emerged as an important solution for monitoring system activities and performing attack investigation. However, in order to build a fully operational query system to support security analysts in attack investigations, two challenges must be addressed:

Attack behavior specification: The system needs to provide a query language with specialized constructs for expressing various types of attack behaviors, including (1) multi-step attacks such as APT, (2) dependency tracking of attacks such as backward tracking and forward tracking, and (3) abnormal system behaviors such as network access spikes.
Big-data analytics: System monitoring produces a huge amount of daily logs (~50GB per day for 100 hosts), and investigation of advanced attacks typically requires enterprises to keep at least 0.5~1 year worth data. This big data challenge requires the system to have efficient data storage design as well as efficient query execution scheduling.

To support timely attack investigations, the query systems/tools should be (1) concise to use, (2) expressive in describing a wide variety of attack behaviors, and (3) efficient to retrieve the desired provenance from the massive monitoring data.

Limitations in existing query systems

Unfortunately, existing query systems do not address both of these inherent challenges in attack investigation.

Limited expressiveness: Relational databases based on SQL, graph databases such as Neo4j, and NoSQL databases such as MongoDB, Splunk and ElasticSearch lack explicit constructs for high-level system concepts (e.g., system entities such as files and processes, system events, event attribute relationships and temporal relationships, event depenency paths, sliding time windows, history state accesses), and therefore the queries constructed using these tools often involve large number of low-level joins of constraints. This makes the queries very verbose, which are cumbersome to write and inefficient to execute. The performance tuning process is tedious and time-consuming, distracting security analysts from effective attack investigations.
Semantics-agnostic design: System monitoring data is generated with a timestamp on a specific host in the enterprise, exhibiting strong temporal and spatial properties. However, existing general database query systems do not provide semantics-based optimizations that exploit the domain-specific properties of the system monitoring data, missing opportunities to optimize the system.