Learning Machine Learning
Problem solving in data analysis and machine learning (6 minute read)
Learning Machine Learning
Problem solving in data analysis and machine learning (6 minute read)
Thuy Do is an assistant professor of Computer Science.
How would you summarize your research for curious non-experts?
I currently have two directions in my research topics: optimization problems in decentralized computing environments (e.g., Federated Learning and Edge Computing) and machine learning applications.
Federated Learning (FL) is a recent Machine Learning method for training with private data separately stored in local machines without gathering them into one place for central learning. It was born to address the following challenges when applying Machine Learning in practice: (1) Communication cost: Most real-world data that can be useful for training is locally collected; to bring all to the data one place for central learning can be expensive, especially in real-time learning applications when time is of essence, for example, predicting the next word when texting on a smartphone; and (2) Privacy protection: Many applications must protect data privacy, such as those in the healthcare field; the private data can only be seen by its local owner and as such the learning may only use a content-hiding representation of this data, which is much less informative. To fulfill FL’s promise, I addresses two important problems regarding the need for good training data and system scalability:
1. The effectiveness of FL depends critically on the quality of the local training data. We should not only incentivize participants who have good training data, but also minimize the effect of bad training data on the overall learning procedure. Here, my research is to determine a score to value a participant’s contribution.
Photograph by Dominique Stringer.
2. On scalability, FL depends on a central server for repeated aggregation of local training models, which is prone to become a performance bottleneck. A reasonable approach is to combine FL with Edge Computing: introduce a layer of edge servers to each serve as a regional aggregator to offload the main server. The scalability is thus improved, however at the cost of learning accuracy. In this topic, my research is to optimize this tradeoff. That is, this cost can be alleviated with a proper choice of edge server assignment: which edge servers should aggregate the training models from which local machines.
For machine learning applications, one project I have done in this direction is to understand public opinion on using Hydroxychloroquine for COVID-19 treatment (H4C) via social media. I studied the reactions of social network users on H4C by analyzing the reaction patterns and sentiment of tweets posted in 2020. Data analysis and visualization tools are used to explore the reaction patterns while machine learning algorithms are used to predict sentiment of each tweet.
How did you get started on this project?
When COVID-19 pandemic hit us, there were some discussions on the effectiveness of using HCQ in treating COVID-19 in some cases, but no clinical trials with sufficiently large cohorts provide concrete evidence on that effectiveness. However, using HCQ for COVID-19 treatment quickly became a hot topic dominating social media and news. This misleading information may put pressure on healthcare systems and society. On one hand, high demand for the drug may be escalated, making it unavailable for prescribed patients. Understanding the viewpoints of the community on H4C would help the public health policymakers to develop preventive measures and policy to guide and provide safety to society. This therefore is the focus of the project.
What are the ethical implications and impacts of your work?
For the need of good training data, I proposed an efficient method, using Shapley Values, to compute the rewards for data providers. My method is 100 times faster than state-of-the-art methods in the literature and results in a paper being revised for submission.
On the scalability in FL, I proposed an edge assignment which improves 10-15% accuracy of the global model, compared with state-of-the-art methods. This work resulted in a paper submitted to IEEE Transactions on Emerging Topics in Computing and under review.
In the H4Q project, we collected 164,016 tweets in 2020 and used a text mining approach to identify social reaction patterns and opinion change over time. Our descriptive analysis identified an irregularity of the users’ reaction patterns associated tightly with the social and news feeds on the development of HCQ and COVID-19 treatment. Further, our tweet sentiment analysis reveals that public opinion changed significantly over time regarding the recommendation of using H4Q: high support in the early dates but significantly declined in October. This work also results in a paper, published in 15th International Conference on Health Informatics, 2022/02.
What advice would you give to people interested in computer and data science research?
CS is not only about programming languages but also problem solving. As a computer scientist, you work as a mathematician when you computationally formulate a real world problem into the CS field, you also work as an engineer when you design and model a system and you work as a scientist as well when you observe the world, form a hypothesis and perform experiments. These three things bring opportunities for you to leverage your intelligence and work with talented people. However, the key to success in CS isn't intelligence, but patience and hard work.
Data science lies in the intersection of mathematics, computer science, statistics and business knowledge. You will learn how to explore insights from data and how to teach a computer to learn from examples. Training a machine learning model is not easy and even harder in the federated learning scheme. It is about science and art. Sometimes it is painful when you have to try different model architectures and parameters but still interesting since we will understand the data (the world) better and gain new things eventually.
"CS is not only about programming languages but also problem solving...the key to success in CS isn't intelligence, but patience and hard work."
I would also like to share my story of majoring in Computer Science. I have been very passionate about mathematics since I was in high school. Every time I got problems asking me to prove something, I always felt like there would be interesting insights in them. I was therefore first admitted in mathematics. When I was in my third semester at my university, I started to learn algorithms. I recognized that CS was so useful and practical. I didn’t feel the same way about my major of mathematics so I decided to switch to CS. I was so happy with my decision and continued to do my PhD. in CS. However, I actually had to study more mathematics than CS to fulfill the research in my PhD. program. I realized the beauty of mathematics when I approached the end of the program. I am now happier since mathematics and CS have brought me to my passion for Data Science.
What are your upcoming research projects?
I continue working in the two above directions. The work of valuing clients’ data contributions can be extended in the edge federated learning scheme. To the best of my knowledge, this problem has not been studied yet. Due to the edge assignment, data valuation in edge federated learning is more challenging than that in federated learning scheme: different assignments would result in different contribution rewards for the same data provider. An up-coming project is about text sentiment analysis in the federated learning scheme. This is a machine learning application for my students joining my research.
Published on: February 2, 2023
By: Andy Hageman