Algorithms, systems, and theories for exploiting data dependencies in crowdsourcing 

NSF-2007941

Principle Investigator 

Qi Li, Iowa State University

Students

Nasim Sabetpour. PhD student (graduated 2022)

Adithya Kulkarni. PhD student

Qiao Qiao. PhD student

Mohna Chakraborty. PhD student

Yuepei Li. PhD student

Kang Zhou. PhD student

Yonas Sium. PhD student


Award Information

This website is based upon work supported by the National Science Foundation under Grant No. IIS-2007941, collaborative with NSF IIS-2008155 . Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Project Summary

Data are abundantly available to encode knowledge in many domains, such as biomedical research, online commerce, open government, education, and public health. Machine learning is a powerful tool to discover novel knowledge from data and to help individuals and organizations make informed decisions. However, machine learning needs to be bootstrapped by human-annotated knowledge, which can be expensive to obtain and also contain human errors. The team of researchers discovers and exploits the dependencies in the data, via novel methodologies to significantly reduce the cost and noises when providing critical knowledge for machine learning. The research outputs, including algorithms, systems, and theories, are sufficiently generic to benefit many domains where machine learning is applicable. By conducting the fundamental research, the team will train undergraduates and graduates for the STEM workforce in the nation.

The researchers will collaborate to develop algorithms, systems, and theories for reducing costs and noises when annotating dependent data, termed as “structured annotations”, to provide supervision knowledge for machine learning. While the dependencies can make data annotations costly and error-prone, the researchers view the dependencies as a useful inductive bias for selective and accurate annotations. In particular, they propose a human-in-the-loop system to aid the construction of proper probabilistic graphical models to encode the dependencies. They combine contextual and multi-armed bandits with scalable graph inference algorithms to reduce labeling costs. Based on the graphical bandits, the team addresses the budget allocation when querying labels of the same data point repetitively for robustness. With noisy human annotations, the team formulates optimization problems and algorithms to jointly infer the annotator competences and the ground truth labels of the data. From the theoretical perspective, the project will advance the active learning in crowdsourcing settings with more realistic noise distributions and will analyze the regrets in structured annotations. The project will result in datasets, algorithms, and a testbed system that benefit not only the core machine learning research community but also many domains that use machine learning.

Publications

[UAI23] Adithya Kulkarni, Mohna Chakraborty, Sihong Xie, and Qi Li. Optimal Budget Allocation for Crowdsourcing Labels for Graphs. Proc. of  Uncertainty in Artificial Intelligence (UAI'23), 2023.

[ICASSP23] Yonas Sium, Georgios Kollias, Tsuyoshi Ide´, Payel Das, Naoki Abe, Aurelie Lozano, and Qi Li. Direction Aware Positional and Structural Encoding for Directed Graph Neural Networks, Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'23), 2023. 

[PAKDD23] Qiao Qiao, Yuepei Li, Kang Zhou, and Qi Li. Relation-Aware Network with Attention-Based Loss for Few-Shot Knowledge Graph Completion, Proc. of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'23), 2023.

[SDM22] Adithya Kulkarni, Nasim Sabetpour, Alexey Markin, Oliver Eulenstein, and Qi Li, CPTAM: Constituency Parse Tree Aggregation Method. Proc. of 2022 SIAM Int. Conf. on Data Mining (SDM'22), 2022.

[ICDM21] Nasim Sabetpour, Adithya Kulkarni, Sihong Xie, and Qi Li, Truth Discovery in Sequence Labels from Crowds. Proc. of 2021 IEEE Int. Conf. on Data Mining (ICDM’21), 2021. 

[WebConf21] Minghong Fang, Minghao Sun, Qi Li, Neil Zhenqiang Gong, Jin Tian and Jia Liu, Data Poisoning Attacks and Defenses to Crowdsourcing Systems. Proc. of the Web Conference 2021.

[EMNLP20] Nasim Sabetpour, Adithya Kulkarni and Qi Li. OptSLA: an Optimization-Based Approach for Sequential Label Aggregation. EMNLP'20 Findings, 2020. [code]


Courses

COM S 661: advanced database systems

COM S 561: Database Design, Management, and Research