Research: Cyber Security

Botnet Detection via Machine Learning
Botnets are groups of machines that have been infected with a virus or trojan and are then controlled by a botnet operator and commandeered in to launching most of the malicious attacks in the internet including spam, DDoS attacks, phishing, etc. Detecting botnets via network traffic inspection is challenging since botnets use a wide variety of protocols (HTTP, IRC, p2p) to communicate with their command-and-control server and moreover, they constantly keep changing the protocol itself. Moreover, recent botnets have begun using common protocols such as HTTP which makes it even harder to distinguish their communication patterns from that of legitimate clients that may be browsing to well-behaved web sites. In this research project, the goal is to distinguish botnets’ command-and-control traffic from legitimate traffic by taking a generalized view. In that, while botnet owners will continuously come up with more clever ways to hide their bots from anomaly detection systems and anti-virus products, there is one pattern that they can never hide, which is the co-ordination inherently present in the commands that are exchanged by the bots with a command-and-control server. In this research project, I am working on an end-to-end system that can automatically learn new behavior patterns being exhibited by botnets in contrast to legitimate clients by making use of machine learning algorithms. Some of the challenging research problems here relate to trade-off between accuracy and scalability, rates at which new models must be learnt, impact of non-linearity in the model space, etc.

Detecting DNS Fast-Flux Botnets
In the latest episode of the constant cat-and-mouse game between botnets and security professionals, botnets such as Conficker, Kraken and Torpig came up with a highly sophisticated way of controlling their bots. Each bot generates random domain names using a pseudo-random string generator every day, while the botnet owner has to register only one such domain name at a DNS registrar. While studying this problem, I realized that a novel way of detecting such botnets would be by looking at the distribution of characters being used in the domain names they generate; since the fast-fluxed domain names are usually unpronounceable and not well-formed. I developed a rigorous approach by combining machine learning and information theory. In the approach, domain names are classified as legitimate or domain-flux depending on the distribution of the unigrams (alphanumeric characters) and n-grams under the assumption that fast-flux domain names have different characteristics than well formed words. This work has received a lot of interest from media (Dark Reading, The Register) for the reason that it is generic and can be applied to detect future domain fast-flux botnets in a zero-day fashion. In fact, our methodology was one of the first to observe that the latest version of Storm botnet is using a combination of two words and registering them with DNS. For more details please refer to our ACM Internet Measurement Conference (IMC) 2010 paper, "Detecting Algorithmically Generated Malicious Domain Names".

Behavioral Anomaly Detection
Behavioral anomaly detection systems that are based on large matrix computations require a significantly large amount of resources. In this research project, I developed a streaming algorithm to detect behavioral anomalies such as DDoS attacks, port- and network-scans and zero-day worms based on summarizing traffic structure by information entropy and time-series trending to track changes in traffic entropy. Our proposed algorithm scales to large deployments at Tier-1 ISPs that has detected many high-profile cyber attacks (Cyber-war against Estonia in 2007 and Georgia in 2008, Chinese attacks on CNN in 2008, Amazon service derailed by DDoS in 2008). For more details please refer to our IEEE/ACM INFOCOM 2007 paper titled "Dowitcher: Effective Worm Detection and Containment in the Internet Core".

IP-address based Traffic Classification
A common reaction of a network security analyst on finding network traffic corresponding to a trojan, worm or DDoS attack is about who is behind that IP-address. In particular, is this attack coming from a network infamous for being behind other attacks? I frequently found myself searching for the attackers’ IP-addresses on search engines. And that is when I realized the value in building a semantics driven search to classify IP-addresses and network traffic by leveraging publicly available information. I built a vertical search engine to search and classify IP-addresses in to various categories e.g., as DNS, HTTP, SIP, SMTP, or FTP server on the basis of the keywords that occur next to an IP-address on the web. We were also the first to propose using IP-address labels for classifying network-traffic in to application protocol categories. This work has spawned a new field with a variety of future work on traffic classification by using IP-address labels. For more details please refer to our ACM SIGCOMM 2008 paper titled "Googling the Internet: Unconstrained Endpoint Profiling" or the journal version: ACM ToN 2010 paper "Googling the Internet: Profiling Internet Endpoints via the World Wide Web". 

Routing Security: Detecting BGP Prefix Hijacking
Border Gateway Protocol (BGP) is the de facto inter-domain routing protocol of the Internet. However, the BGP system has been built based on an implicit trust among individual administrative domains and no countermeasure prevents bogus routes from being injected and propagated through the system. Attackers might exploit bogus routes to gain control of IP prefixes to either hijack the relevant traffic or launch stealthy attacks. Attackers can directly originate the bogus routes of the prefixes, or even stealthier, further spoof the AS paths of the routes to make them appear to be originated by others. Although several secured extensions of BGP, such as S-BGP and soBGP have been proposed, their comprehensive deployment is still unforeseeable. In this research project, we designed a system to detect prefix hijacking and path spoofing attacks in real-time. The proposed system extracts and learns route information objects from historical BGP routing data and then uses heuristics based on detecting relationships between ASes (peers, customer-provider) as well as correlation with data traffic to improve the detection rate. The deployed system has correctly detected and alerted on many prefix hijacking attacks in real-time, e.g., YouTube’s prefix hijacking by Pakistan Telecom in 2008. More details can be found in our IEEE SECURECOMM 2007 paper on "Detecting Bogus BGP Routes - Going Beyond Prefix Hijacking".
Comments