I enjoy building distributed systems to analyze streaming real-time data such as network- or web-traffic and then developing innovative algorithms using data mining, graph theory, and advanced statistics for analyzing the huge data sets. One of the underlying themes in my research efforts is to uncover correlations and patterns in network traffic by either applying unsupervised or supervised machine learning algorithms. An overarching goal is to be able to distinguish Botnets Command-and-Control traffic from legitimate traffic by taking a generalized view. In that, while Botnet owners will continuously come up with more clever ways to hide their Bots from anomaly detection systems and AV vendors, there is one pattern that they can never hide, which is the co-ordination inherently present in the commands that are exchanged by the Command-and-Control server and its Bots. A related problem that the network security community has a better handle on is detecting DDoS attacks, Port/Network scans, zero-day Worms via behavioral anomaly detection (refer to: DoWitcher: Effective Worm Detection and Containment in the Internet Core). However, we are still miles away from developing a zero-day behavioral Botnet detection solution.
Speaking of zero-day solutions, one related effort that I recently concluded is on detecting DNS "domain flux" based Botnets such as Conficker, Kraken or Torpig in real-time by looking at DNS query names and their corresponding reply IP-address in network traffic. The hypothesis for detecting domain flux queries is based on the observation that Botnet owners are more likely to use un-pronounceable words while registering domain names for the simple reason that most of the pronounceable and English-language alike words are already registered (by valid Domain owners or domain parking web sites). (refer to: Detecting algorithmically generated malicious domain names)A common reaction of a network security analyst on finding network traffic corresponding to a trojan, worm or DDoS attack is about who is behind that IP-address? In particular, is this attack coming from a network infamous for being behind other attacks? I frequently found myself searching for the attackers' IP-addresses on search engines. And that is when we realized the value in building a semantics driven search to classify IP-addresses and network traffic by leveraging publicly available information (refer to: “Googling the Internet: Unconstrained Endpoint Profiling”)
In another interesting research project, I looked at correlating users' mobility patterns with their application usage in a 3G network by using unsupervised learning techniques such as association rule mining and spectral clustering (paper presented at IMC 2009). The findings range from intuitive ones, e.g., people are more likely to stay stationary in one location for the next 6 hours of looking up weather of that location on their mobile device to a few that were interesting for being non-intuitive, e.g., we found that there exist physical locations where people are more interested in one particular application than others, for instance there are some hot spots where people only download music or some where people are more likely to use social networking applications on their mobile devices. (refer to: Measuring Serendipity: Connecting People, Locations and Interests in a Mobile 3G Network).
A detailed list of my past Cyber Security projects is as below:
Mobile Cellular Networks (2.5G, 3G)
Back-haul of 3G networks:
Current 3G base-station backhaul still relies on T1/T3 lines, which quickly get overwhelmed due to recent exploding smartphone data usage. In this project, we designed a novel resource provisioning approach for reducing bandwidth consumption at the back-haul of 3G base-stations by utilizing user mobility and data usage patterns.
Location Based Services:
Led a large measurement study of network traffic at a major 3G carrier (terabytes/day, 300K users) to derive user movement patterns (via association rule mining), correlate mobility patterns with application usage, and predict location-hotspots (via spectral clustering).
Designed graph partitioning algorithms to prevent the spread of worms in a 3G cellular network.
Botnet Detection via Machine Learning:
Designed an end-to-end architecture for detecting Botnets via the different traffic patterns exhibited by them compared to legitimate clients. The architecture comprises of supervised models (Logistic Regression, Logistic Model Trees) learnt offline and classification done in real-time. Proposed several features (e.g., PageRank, neighborhood metrics) and conducted feature selection (LASSO).
Behavioral Anomaly Detection:
Designed and implemented behavioral anomaly detection algorithms to detect DDoS attacks, Network and Port Scans and zero-day Worms using time-series forecasting and information-theory (Entropy). Developed and delivered as the integral engine in Narus Cyber Protection product. This product has detected many high-profile cyber attacks (Cyber-war against Estonia and Georgia, Chinese attacks on CNN, Amazon service derailed by DDoS).
Designed and implemented algorithms to detect DNS domain-fluxing as used by Botnets such as Conficker, Kraken and Torpig using information theory (Entropy), Natural Language Processing (Edit distance) and data mining (Logistic Regression).
Led a large measurement study of E-mail Spam collected at a Honeypot to discover and classify Spam Campaigns.
VoIP Anomaly Detection:
Developed and implemented a system to detect VoIP anomalies via analysis of SIP traffic (call fraud, call spam, DDoS on SIP proxy or registrar servers).
Searching IP-addresses on the Web:
Designed a vertical search engine to search and classify IP-addresses in various categories, e.g., DNS-,HTTP-,SIP-,SMTP-,FTP-server, etc. on the basis of the keywords that occur next to an IP-address on the web. Leveraged this engine for classifying network-traffic in to application protocol categories.
BGP Prefix Hijacking:
Designed and implemented a system that utilizes BGP routing tables and updates to detect prefix-hijacking and path-spoofing attacks in real-time. Delivered as an integral engine component for Cyber Protection product. The deployed system has correctly detected and alerted on many prefix hijacking attacks in real-time, e.g., YouTube’s prefix hijacking by Pakistan Telecom.
BGP Traffic Engineering:
Designed and implemented a system to detect BGP traffic engineering issues (sudden loss/gain of prefix reachability) and routing failures (route flaps, convergence issues). Designed a novel way to assign event severity by correlating the traffic engineering alert with change in traffic volumes.
Compressing Network Traffic:
Designed real-time and batch compression plan algorithms to compress network traffic. Developed schemes to compress traffic flow meta-data as well as actual pcap including headers and payloads.
Internet Data Centers, Server Migration, Content Distribution, and Layer-7 Security:
Wide-Area Dispatcher of Database Queries:
Implemented an event-driven wide-area database dispatcher, written in C. The dispatcher receives database queries from the Apache/PHP front-end and forwards them to a MySQL server. The dispatcher collects network-and server-load measurements from various MySQL servers, using which it estimates the “best” server to forward the query to.
Utility Computing Enhancement at Database-Tier:
Implemented a utility-computing module into the database-dispatcher described above. The module collects measurements of client workloads and MySQL server loads and uses a novel resource-allocation algorithm to calculate the number of MySQL servers needed. If more servers are active than needed, then the dispatcher stops sending queries to the surplus servers, and removes them from the database tier. If additional servers are needed, then the dispatcher selects them from the last-removed MySQL servers and starts sending queries to them, after making them consistent.
Web-proxy based Detection of Bots vs. Humans:
Developed a reverse web-proxy server (C) using event-driven programming, non-blocking I/O and sockets, that can distinguish between bots and humans via behavioral profiling and then rate-limit bots automatically. This eliminates the need for CAPTCHAS that are known to prevent users from signing-in.