Teammembers
This is our video presentation:
If the video cannot play, please check the "cookie setting" on your browser, let it be "Allow cookies"
As we are trying to classify IoT devices and use federated learning to improve the model, we find that federated learning uses a centralized way to aggregate all the models from clients. An idea comes to us that why not try a decentralized way. In this project, we will explore how p2p network effects the efficiency of federated learning in the aggregation process.
Federated learning is widely used in mobile data processes, for the advantage of high security and outstanding performance on distributed systems. In the meantime, people put great efforts to improve it, which mainly focuses on overcoming statistical challenges, improving security as well as personalizing the model. We try to improve it in a different way, using a p2p network to aggregate the models rather than centrally gather models to the cloud.
As mentioned before, most work to improve federated learning is about the statistical challenges, security and personalization. They have their own pros and cons, but the network framework may be a not-fully explored area for federated learning. Several papers are in Medical Imaging, healthcare informatics, generic area, etc. Some attempts have been made in IoT. However, one focuses on certain situation like industrial IoT devices, and another improves the learning partially. Our project focuses on the overall performance of decentralized federated learning in classification of IoT data flow.
The final deliverable product will be a report with charts and analysis. If necessary, we will submit our code.
Mar 8: finish and submit the project update (updated project proposal)
Mar 22: implement the codes and do the experiments
Apr 5: finish the codes, and show the demo
Apr 12: submit the project report
Finished the project report in Apr 12.
[1] E. Lear, R. Droms, and D. Romascanu, “Manufacturer usage description specification,”
[2] S. Marchal, M. Miettinen, T. D. Nguyen, A. Sadeghi, and N. Asokan, “Audi: Towards autonomous iot device-type identification using periodic communication,” IEEE Journal on Selected Areas in Communications , pp. 1–1, 2019.
[3] Roy AG, Siddiqui S, Pölsterl S, Navab N, Wachinger C. Braintorrent: A peer-to-peer environment for decentralized federated learning. arXiv preprint arXiv:1905.06731. 2019 May 16.
[4] S. Savazzi, M. Nicoli and V. Rampa, "Federated Learning With Cooperating Devices: A Consensus Approach for Massive IoT Networks," in IEEE Internet of Things Journal, vol. 7, no. 5, pp. 4641-4654, May 2020, doi: 10.1109/JIOT.2020.2964162.
[5] Y. Lu, X. Huang, Y. Dai, S. Maharjan and Y. Zhang, "Blockchain and Federated Learning for Privacy-Preserved Data Sharing in Industrial IoT," in IEEE Transactions on Industrial Informatics, vol. 16, no. 6, pp. 4177-4186, June 2020, doi: 10.1109/TII.2019.2942190.
Links of reference paper:
The questions come from Brightspace's Discussion.
(update to April 20, 2021)
Question: I have a few questions. You mentioned "security standard"... it brings me some confusion. Will you do any survey on the standard? A standard is "a published specification that establishes a common language, and contains a technical specification or other precise criteria and is designed to be used consistently, as a rule, a guideline, or a definition".
Or you would only focus on training a detection model?
The model you talked about seems based on the communication pattern. May I ask what's your security goal? what are the risks and attacks in your research, or we say you would like to detect?
Do you have any background references (some related works and existing models)?
Answer: 1. For the "security standard", we mean that the hardware congfigeration (CPU) and the protocal of an IoT devices, such as Modbus, CoAP, MQTT, and XMPP.
2. Now, we are focus on training a model.
3. After updated the proposal, we narrow down the topic to classification. The classification is the early step of anomaly detection.
4. There are 5 background references and related papers in the updated proposal.
Question: What type of parameters are you going to base your evaluation on?
Answer: The parameter could be a trained model.
Question: I wonder if you are going to use some test machine and software to simulate all the models in order to retrieve the results to compare or you want to apply a different simulation technique?
Answer: For current time point, our plan is using the online pre-collected data and some VM or simulator. But the approach might be changed slightly, depend on the situation after starting coding.
Question: What extra steps would be taken to implement anomaly detection once classification is up and running?
Answer: The anomaly detection is the final goal. After classification, based on classified data flow and compare the different devices data in same type of devices. Classification helps to improve the accuracy. After classification, the extra steps are FL ,training the model and collecting IoT devices' data to do the anomaly detection.
Question: what parameters are you planning to consider for training the model? Regarding the model refining, how a node will trust that the parameters it's learning from is better than itself?
Answer: The parameter could be the IoT devices produced data flow. The model refining will not be considered at this time point, because this project is just considering classification, the model will be shared randomly among IoT devices, and there will be a counter for the time of transection.
Question: I am a little bit confused about your goals. Based on "An idea comes to us that why not try a decentralized way", you are going to design a new classification method in a decentralized way? Then you said "we will explore how p2p network effects (should be "affects") the efficiency of federated learning in the aggregation process." So you will focus on the efficiency of the classification rather than the accuracy right?
"we will compare the efficiency of centralized federated learning and decentralized one" -- does that means you will adopt an existing centralized algorithm from others' work, and then achieve it in a decentralized way?
Besides, federated learning is not just decentralized, it's for protecting the privacy of the data. It seems that your proposal doesn't describe any privacy issues so I feel confused here. Can you please explain a bit about it?
Answer: We are not going to design a new classification method in a decentralized way, because there already has been similar ideas published.
Our current plan is to focus on both efficiency and accuracy. We are going to compare these two aspects in decentralized and centralized approach.
The decentralized and centralized approach are independent from each other.
For the privacy issues, actually, this project is an early stage for another big topic, which is "Anomaly Detection". Therefore, this project's goal is to protect IoT devices and user's privacy. For more details, the parameters past among the nodes will be models, but not user's information. Therefore, user's privacy will not be violated during the classification process.
We initially aim to find a more efficient and accurate method to detect the anomaly. And in the whole project, the classification of the IoT devices is the basis of anomaly detection. Meanwhile, these two processes share the same improvement algorithm, Federated Learning, which helps protect the users' privacy and aggregate the information to produce a more robust model. Therefore, we narrow it down to improve the classification process as it is more feasible than anomaly detection.
After studying the p2p network, we find that the original federated learning is centralized, shown in the graph's left part. And we come up with the idea of decentralization of Federated Learning, shown in the graph's right part. This project aims to compare two aggregation methods of Federated Learning, the centralized way and the decentralized way.
As for the privacy issue, the choice of Federated Learning is made for privacy protection. In the big picture, the raw data from IoT devices are from users' homes or offices and sensitive to privacy. So those data are trained locally and produce a local model, in our case, the classification model. And model parameters are gathered by a centralized host or sent to the neighbors.
Question: Does a typical IoT device have enough resources to run the processes required in a decentralized system AND do its "normal" job at the same time? If so, how close is it to maxing out its resources (cpu, memory, etc) when running in a decentralized system?
Answer: For your first question, this is good article to talk about the new IoT devices:
https://www.vxchnge.com/blog/iot-statistics
According to their words, there will be over millions of new IoT devices in the world, every day.
For your second question, currently, we have not found the previous work about applying federated learning on the decentralized IoT (low-end) devices system. Based on our experiments and implementation, we believe the workload is depended on the amount of dataflow. No doubt that a security camara or a webcam will have higher data-flow than a temperature sensor.
Therefore, as we mentioned in the presentation, the differences among IoT devices are very big. For the low-volume data flow devices, if the algorithm and code simplified enough and light-weighted enough, the detection and “normal job” can work together.
For the high-volume data-flow devices, the high-end hardware is required. After all, a powerful hardware is the foundation of everything. And we can see, in recent years, many famous companies create IoT devices with high-end components, they are also expensive.
For your third questions, it depends on the hardware of IoT devices. The size, function, and performance of IoT devices will not have big difference. The system will try not to max out the resources. I the system effects the “normal job”, the system will decrease its performance or shut down.
Question: I have question regarding the your suggested approach that communication between nodes in de-centralized is either one way or two ways as picture in presentation depicts one-way? one more thing is how accuracy is measured like which parameters are taken to measure it as I could see time efficiency in graph?
Answer: For your first question, it is one way, but it will loop through all the nodes multiple times to do the machine learning.
For your second question, in theory, most of it would be accurate under our condition set. However, real world condition would be much complicated. In the future, if we could be more resources and equipment, we would do some real world test, that would be much more accurate than this.
Question: Can you give a bit more info about the dataset, so others can understand results and impacts better?
Answer: Yes, I also find that we miss the dataset part in the presentation and more details will be added to the report.
In short, We choose a data set containing dataflow characteristics and labels and train the classification with perceptron. Then the federated learning will aggregate the parameters from models trained by local nodes. With different aggregation methods, the final models will have different performances. These are the result graphs in the presentation.
Question: I would like to know regarding the decentralised network topology. Like, why you have selected ring type topology not mesh type topology? and if a link fails then how the model will handle such condition?
Answer: For your first question, we just want to test a decentralized situation, and a ring topology is easy to implement. There are also huge amount of topologies can be tested. This project is not aimed to test the performance of different topologies, but it is testing the different between centralized and decentralized approach.
For the second question, it is the way, the ring topology is just a simple example for decentralized system. But, you give us a good idea about future work. We could apply different topologies on the system.
Question: I am wondering on why federated learning is traditionally done with a central server? Could it have to do with security? What are the chances that a decentralized approach be more vulnerable to security risk or tampering? For example if an IOT device in the pool is told to do malicious things and mess with the process? Would a centralized server be more resistant to these risks or no? Or am I just way off mark?
Answer: Centralized server has its advantage, such as it is easy to manage and control each node that connect to the center, and the center will take most of the computation, less work load on nodes.
There could be some security consideration.
As you mentioned decentralized approach is just an approach, it depends on how people use it. There could be some drawbacks for decentralized approach, such as hard to manage.
In this project the decentralized approach is used to classify the IoT devices by their data flow. This process is a precondition for malware detection, which is larger scale topic.
This project is to test centralized and decentralized approach to classify IoT devices by the data traffic. As professor taught us in class, centralized server has its drawbacks, such as if it goes down, the whole system would be down.
Question: In your video, Tianming is saying that there are privacy issues because of centralized learning, however, Zheng’an in his part of the video said that privacy was protected because the model was sent, rather than the data. Can you explain how the privacy is worse with centralized learning? In my mind, pooling updates at a central server would be great for diluting use patterns (even single interactions) that may be captured in the update.
Answer: The privacy is protected by "Federated Learning", which is a machine machine learning technique. The point of "Federated Learning" is protect privacy . It shares/exchanges trained models, but not data. Federated Learning is a very big scale topic, there are also many other variants under Federated Learning. To help you understand Federated learning, this paper might be helpful:
https://arxiv.org/abs/1907.09693
As you mentioned, the privacy issue is caused by centralized system. Comparing to decentralized system, the centralized system has a lower privacy, after all the center can have all the models. Just like in the lectures, decentralized approach would have better privacy.
Question: could you please describe your contributions more clearly? Based on the Introduction and Result sections, you are "practicing" controlled group, centralized and decentralized federated learning algorithms on classification systems. What are the exactly name of the algorithms? (You see there are so many different federated learning algorithms in publications)
Are the codes all achieved by your team? if some parts are reused, which parts?
Answer: More specifically, the contributions in this project are:
1. Researching and investigating on the development of IoT, the structure and attack pattern of Mirai malware, and machine learning algorithms. For the machine learning algorithms, we used the concept of Federated learning and using "Perceptron" for classification.
2. Practicing the machine learning algorithm and the concept of IoT devices classification, based on IoT devices’ network data traffic. We use real network data traffic and using "Perceptron" algorithm to classify the data, then we compare the algorithm's result with real data's label, which is the accuracy.
3. Practicing the centralized and decentralized topology and comparing them in accuracy and time efficiency.
The algorithm we are using for classification is "Perceptron", which can be imported from "sklearn.linear_model" in Python. "Federated learning" is a bigger/higher concept. You can understand "Federated learning" just likes a "frame". The main idea for Federated learning is training a machine learning algorithm/model among multiple devices, but they don't share there local data, for security and efficiency reasons. People can use different machine learning algorithms or different topologies to achieve Federated learning, that is the reason why there are so many different using/applications for federated learning in publications. We are using "Perceptron" in this project, because "Perceptron" is a very good algorithm to do the classification job. "Perceptron" is also easy to use, we can import it from "sklearn.linear_model" in Python then it is ready to use, and there are a lot of documentations online. "Perceptron" also produces its "predict" results for us to compare with pre-labeled data, which is convenient.
In the code, there are few reused functions for load data and data pre-processing. The experiment and the rest parts are originally implemented by us, in this time. The there are few reused codes, and they are labeled. In the code, there are the comments to show which part is the "reused code".
Thank you for your questions and comments.
Now, we have already updated the project report and code in the project website.
The federated learning algorithm is a general framework for distribution systems. Here is the paper, Communication-efficient learning of deep networks from decentralized data, http://proceedings.mlr.press/v54/mcmahan17a.html, where federated learning is first brought up. Although there are so many federated learning algorithms, they are all derivatives for different use of this paper.
In this project, we use perceptron to train the model, get the parameters, and use federated learning to gather parameters and generate a new model. Perceptron is like this picture; it adjusts the parameters when training the labelled data.
Most of the codes are original, while three functions are from my previous project codes. I have commented on them and upload the new version to google drive. I hope this answers your questions.
Question: nice website. some writing and formatting issues in the report. can give a bit more info about the dataset (e.g., columns/features) than the url, and explain the learning results (which features are dominating, etc). similarly the reason for the speedup of distributed learning. trust you will push further after the class
Answer: Thank you for your questions and comments.
In short, for the dataset, there are many columns, such as source port, destination port, flow duration, total forward packets, total backward packets, the length of packets, packet size, etc. The content of the dataset are integer and decimal numbers, which is also required by "Perceptron" algorithm's input. For classification, the packet size, duration has a little higher domination than the ports.
In short, for the speedup of distributed learning, it shows that each nodes do not have to wait for the center server. Therefore, distributed learning has higher time efficiency than the centralized learning.
Sure, we will push further based on this project, and try to achieve better work.
Now, the project report and website have been modified based on your questions and comments.
This is our experience code and dataset:
In the code, the "reused" part is labeled