Some links:
The project's GitHub page: https://github.com/oberljn/iot_botnet_detection.
A related, ongoing project of mine on replaying and stream processing packet capture: https://github.com/oberljn/pcap_streaming. Eventually, the streaming project will be implemented with the analytics project in this course.
The Internet of Things is a domain of various networking and messaging protocols across edge devices, gateways, and data systems, with applications from remote weather stations to smart cities. IoT's hopeful capacity for a near all-pervasive knowledge gives rise to self-aware environments, of which artificial intelligence and machine learning have a central role. And their role is not only for the obvious in-context decision making but also for detecting cyber attacks. The IoT architecture has three not-always-so-distinct levels: devices at the edge, gateways at minimum managing device messaging, and data systems for storage and heavy compute. The farther "out" to the edge, the more the variables of device power, compute, data rate, and connection reliability become factors. The hardware constraints affect the use of resource-dependent intelligent systems out at the edge, while the later network constraints affect how timely a decision from a cloud-based intelligent system reaches the IoT device layer (Chaabouni, Mosbah, Zemmari, Sauvignac, and Faruki, 2019). Latency is an important factor for semi-autonomous vehicles, and other IoT applications in which the sensor-compute-actuator loop requires low latency for safety (Malik, 2019).
Aside from physical, man-made and natural disasters, attacks within the cyber realm exploit vulnerabilities within the protocols and the operating systems of devices. As in other IT domains, IoT systems can fall victim to denial of service (DoS), ransomware, and man-in-the-middle attacks. They are also targeted in false data injection attacks, which may trigger undesirable or even lethal effects in the environment, such as in industrial sites where sensor devices control actuators. These distributed IoT networks are also targeted in order to infect groups of devices for use in distributed DoS (DDoS) attacks against other networks. This paper investigates the detection of the initial phase of establishing and growing a botnet: host discovery and credential collection via bruteforce dictionary attacks. The second phase is employing the botnet in DDoS attacks on targets, the detection of which is outside of this paper. Although the initial phase could exploit various protocols for establishing a botnet, malware such as Mirai and BASHLITE exploit Telnet on IoT devices (Antonakakis, April, Bailey, Bernhard et al., 2017; Kumar and Lim, 2019).
One of the primary data sources for detection is packet capture, often stored as PCAP files. The primary features of packet capture are timestamp, source IP, destination IP, protocol, packet length (size), packet payload data, and more. While packet capture can include every packet that crosses a network switch, its meta-level netflow is aggregated to the source and destination sessions. Detecting attacks by analyzing traffic is a network-based approach. However, host-based data streams may include device system calls and system stats such as CPU and memory usage (#add). A combination of data sources may be desirable, but this must be balanced with IoT's more constrained storage. System logs and packet capture analysis is a streaming big data problem. Kumar and Lim (2019) estimated that a 10,000-device IoT system transmitting at 10% of the peak data rate of 250 kbits/s for five minutes would need more than 9GB of storage. (Also see IEEE 802.15.4 standard for low rate personal wireless devices.) Thus, sampling and filtering must be considered in a detection solution. There must be a balance between keeping storage space costs down and retaining enough data for cyber analysts and forensic teams to investigate attacks. Data sampled for processing must have an upper bound, according to the compute resources available. The sampling of traffic should be distributed across devices. But priority may be given to particular devices, by first filtering out non-IoT device types via the agent-user property of the HTTP header, as done by Kumar and Lim (2019) when applying detection in large IoT networks. For example, Chrome on iPhone devices display the header:
Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/83.0.4103.88 Mobile/15E148 Safari/604.1
Sivanathan, Gharakheili, and Sivaraman (2020) applied inference models on netflow to classify IoT devices from non-IoTs and even identify individual types of IoT devices. As Kumar et al. (2019) point out, Filtering also helps reduce false positives.
Network-based attack detection can use whitelist or blacklist signature approaches. Whitelist signatures are models of how a protocol is expected to perform. For example, it models a successful TCP three-way handshake. Anomalous behavior, such as hundreds of open, hanging handshakes may be evidence of a SYN (synchronize) packet flood DoS attack. It would be of interest to research and build whitelist signatures for IoT protocols, such as LoRaWAN, MQTT, and CoAP, that can be employed in concert with blacklist signatures. Blacklist signatures model known malware. This paper investigates the signatures of the initial phases of the botnet malware class that includes Mirai, BASHLITE, Remaiten, Hajime, Okiru, and Torii. These malware use Telnet, a protocol used to log into and configure devices remotely (Herwig, Harvey, Hughey, Roberts et al., 2019). Required is a quick overview of the initial, cyclical stage of a botnet (Antonakakis, April, Bailey, Bernhard et al., 2017; Kumar and Lim, 2019):
A device (or machine) becomes infected with the malware, either through a phishing campaign that tricks a user into downloading the malware or via another infected device.
The infected device connects to the command and control (CnC) server.
The device scans the network and probes random IP addresses by sending TCP SYN packets.
If it finds another device with an open Telnet port, it attempts to create a connection and log in by trying a dictionary of default or common passwords (bruteforce attack).
If it gets access, it sends the device IP and credentials to a server that logs into the newly discovered vulnerable device and uploads the botnet malware binary, deletes the file, and obscures the system process name.
The newly infected device begins the same process of discovery and credential collection.
Meanwhile all infected devices wait for the daily config file, from which the CnC can direct the network to launch attacks.
Two main botnet network-based detection angles exist: identifying peer-to-peer (P2P) botnet activity that looks at the malicious signatures between devices, and identifying activity between a device and the CnC server. With this in mind, botnet signatures include:
TCP SYN packets between devices used in the scanning stage.
Telnet communications between devices would include SYN packets with destination port 23, the standard Telnet port (Internet Assigned Numbers Authority, 2020). Port 2323 has also been observed in Mirai (Alvarez, 2016). And Herwig et al. (2019) list 5258 as a Telnet alternative.
Keep-alive exchanges using PSH/ACK (push and acknowledgement) packets between the bot and the CnC server.
Kumar et al. (2019) claim the scanning signature is enough to provide certainty, but the keep-alive is helpful for malware that does not use Telnet port scanning.
The research on AI/ML approaches to attack detection is diverse. And some solutions do not require ML. For example, Joshua Saxe provides examples of traffic pattern visualizations that aid the cyber analyst in identifying anomalous behavior (Saxe, 2015). Machine learning solutions continuously learn traffic patterns and can identify variations in attacks, whereas heuristic detection can be evaded by altering attack patterns (Chaabouni et al., 2019). Nguyen, Ngo, Nguyen, and Le (2020) layout a number of detection approaches and describe their novel approach: a subgraph classifier. They point out that while deep learning has shown success in detection, the approach during its modeling stage may be too compute-intensive, often needing GPUs, at the edge. However, other researchers, projects (uTensor), and services (AWS IoT Greengrass) have explored keeping the compute-heavy training at the data center level (cloud or local) and pushing inference models optimized for smaller devices, such as the device gateway or even the sensor devices themselves. uTensor project takes models trained in Tensorflow and generates a C++ file for inferencing. The minimum device requirements are a meager 128 KB RAM and 512 KB flash memory. Researchers have also explored quantization of deep learning models. They realized nearly the same precision in deep learning applications while reducing the precision of weights from 32-bits to 16, or even 8, bits (Qi, Xuan, and Liu. 2018). And natural language processing (NLP) has also been applied to the packet, in which it is treated as an unstructured language problem. This paper will first look at using decision tree algorithms for detecting initial stage botnet, determining if an IP (a device) is infected. The J4.8 classifier produced the highest detection predictions when compared against other ML approaches (Mckay, Pendleton, Britt and Nakhavanit, 2019). The J4.8 algorithm is preferred over random forest, as the former has human-readable output, which is important for the cyber analyst to understand how the detection system is making decisions.
Security data science requires a few lab tools for the lab. PCAP files can be analyzed manually (point-and-click) via Wireshark, or via its command line version tshark. tcpdump is a common command line tool for packet capture and filtering. Being in reality a streaming data problem, network traffic can be replayed by a tool called tcpreplay. It replays the packets on the interface while one of the aforementioned tools capture the traffic and sends it on for processing, for example, via Kafka topics. The datasets that this paper has thus far collected that include IoT benign and labeled malicious traffic include:
IoT POT
VirusShare
IoT Network Intrusion Dataset at IEEE DataPort
Aposemat IoT-23
Each claim to include simulated botnet activity. At first look, the Mirai example in the IoT-23 dataset does not seem to use Telnet. It does however, have example of heartbeat communications, which are used by a CnC server to keep track of an infected host by the CnC server (Parmisano, Garcia, and Erquiaga, 2020). This paper will also attempt to apply a decision tree classifier to multiple botnet datasets to see how successful they perform.
The best packet features for detection are expected to be those mentioned in the signatures: port, protocol, and TCP flags. However, this paper will attempt to get an array of features to examine via covariance matrix and principal component analysis. The IP address itself is not expected to be useful, as it more so functions as an identifier of the device/machine. And in the case of IoT-23 Mirari botnet dataset, there is only one infected device/IP. While capture tools like tshark have separate fields for IP source and destination, protocol, and packet length, the port and flag features must be extracted from the "info" field using regular expressions. An info field example from Wireshark:
443 → 54811 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1440...
A hex stream representation of the packet can also be outputted which while may reduce processing time, will not be human readable.
d053492f5cf6b02a43cb20b4080045000034647140007306d75d34728046c0a8569401bbd61b0957193970399ecc8012200079b20000020405a00103030801010402
Above: First 25 rows of one of the IoT-23 Mirai PCAP files converted to CSV and imported into Python Pandas.
Below: The same set filtered to the infected IP as source and destination, including counts of packets per protocol.
Alvarez, Michelle. "Consequences of IoT and Telnet: Foresight Is Better Than Hindsight." Security Intelligence. (November, 2016). https://securityintelligence.com/consequences-of-iot-and-telnet-foresight-is-better-than-hindsight
Antonakakis, Manos, Tim April, Michael Bailey, Matt Bernhard, Elie Bursztein, Jaime Cochran, Zakir Durumeric, J. Alex Halderman, Luca Invernizzi, Michalis Kallitsis, Deepak Kumar, Chaz Lever, Zane Ma, Joshua Mason, Damian Menscher, Chad Seaman, Nick Sullivan, Kurt Thomas and Yi Zhou. “Understanding the Mirai Botnet.” USENIX Security Symposium (2017).
Chaabouni, Nadia, Mohamed Mosbah, Akka Zemmari, Cyrille Sauvignac and Parvez Faruki. “Network Intrusion Detection for IoT Security Based on Learning Techniques.” IEEE Communications Surveys & Tutorials 21 (2019): 2671-2701.
Herwig, Stephen, Katura Harvey, George Hughey, Richard Roberts and Dave Levin. “Measurement and Analysis of Hajime, a Peer-to-peer IoT Botnet.” NDSS (2019).
Internet Assigned Numbers Authority. "Service Name and Transport Protocol Port Number Registry". (June, 2020). https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml
Kumar, Ayush, and Teng Joon Lim. 2019. "Early Detection Of Mirai-Like IoT Bots In Large-Scale Networks Through Sub-Sampled Packet Traffic Analysis." Department of Electrical and Computer Engineering, National University of Singapore.
Malik, Bill. "Attack Vectors in Orbit: The Need for IoT and Satellite Security." RSA Conference 2019. (May 2019). https://youtu.be/HxBhTXN2LcQ
Mckay, Rob, Brian Pendleton, James O Britt and Ben Nakhavanit. “Machine Learning Algorithms on Botnet Traffic: Ensemble and Simple Algorithms.” Proceedings of the 2019 3rd International Conference on Compute and Data Analysis. (2019).
Nguyen, Huy-Trung, Quoc-Dung Ngo, Doan-Hieu Nguyen, and Van-Hoang Le. 2020. “PSI-Rooted Subgraph: A Novel Feature for IoT Botnet Detection Using Classifier Algorithms.” ICT Express 6 (2): 128–38. doi:10.1016/j.icte.2019.12.001.
Parmisano, Agustin, Sebastian Garcia, and Maria Jose Erquiaga. "A labeled dataset with malicious and benign IoT network traffic." Stratosphere Laboratory. (January 2020). https://www.stratosphereips.org/datasets-iot23
Qi, Xuan, and Chen Liu. 2018. "Enabling Deep Learning on IoT Edge: Approaches and Evaluation." 2018 IEEE/ACM Symposium on Edge Computing (SEC), January. doi:DOI: 10.1109/SEC.2018.00047.
Saxe, Joshua. "Why Security Data Science Matters And How It's Different." Black Hat. (December 2015). https://youtu.be/PQSuIfcu-So
Sivanathan, A., H. Habibi Gharakheili, and V. Sivaraman. 2020. “Managing IoT Cyber-Security Using Programmable Telemetry and Machine Learning.” IEEE Transactions on Network and Service Management 17 (1): 60–74. doi:DOI: 10.1109/TNSM.2020.2971213.