Studying the impact of Automatic Indexing on Semantic Search Performance
Trained multi-class classification models based on Logistic Regression and BERT to assign 183 MeSH (Medical Subject Heading) terms to paper abstracts based on semantics.
Applied the models to index 1.6 million PubMed articles and evaluated the performance of different semantic indexing models on 27 real-world Boolean queries.
Demonstrated the negative impact of imperfect automatic indexing and proposed a human-machine collaborative indexing strategy that achieved 95% precision and recall with 21.33% human indexing effort.
Published in HealthNLP 2023. [pdf]
Explainable Prediction of Text Complexity
Trained and evaluated various machine learning models, including Naive Bayes, SVM, and Random Forest for text
complexity classification using bag-of-words features, lexical features, and syntactic features.
Generated complexity explanation of model predictions using LIME.
Published in ACL-IJCNLP 2021. [pdf]
ProblemExplorer: Human-Machine Collaboration System for Modeling Analysis
Identified the multi-objective nature of the machine learning problem formulation process by interviewing data scientists and literature review
Proposed to leverage autoML to automatically evaluate and recommend problem formulations.
Interface Development for Structured Audio Data Analysis
Conducted formative interview studies to understand audio analysts’ workflows and design requirements.
Developed a web-based interactive text editor using JavaScript, Bootstrap, and quill.js supporting advanced text annotation and analysis, integrating a SQLite database for storing audio analysis data.
GRAFS: Graphical Faceted Search System
Extract and summarize informative concepts from search results to support sensemaking and learning in exploratory search.
Published in ACM TiiS [pdf]
Using Geospatial Data To Predict PFAS Contamination
Performed geospatial analysis, e.g. spatial auto-correlation and clustering, to uncover PFAS detection patterns.
Extracted geospatial features, e.g. proximity to PFAS emission points, land cover, and hydrology, from multiple data sources. Trained Random Forest models to predict PFAS detection across US, achieving 80% recall and 73% precision.