Research Experience

NLP and Information Retrieval

Studying the impact of Automatic Indexing on Semantic Search Performance

Trained multi-class classification models based on Logistic Regression and BERT to assign 183 MeSH (Medical Subject Heading) terms to paper abstracts based on semantics.
Applied the models to index 1.6 million PubMed articles and evaluated the performance of different semantic indexing models on 27 real-world Boolean queries.
Demonstrated the negative impact of imperfect automatic indexing and proposed a human-machine collaborative indexing strategy that achieved 95% precision and recall with 21.33% human indexing effort.
Published in HealthNLP 2023. [pdf]

Explainable Prediction of Text Complexity

Trained and evaluated various machine learning models, including Naive Bayes, SVM, and Random Forest for text

complexity classification using bag-of-words features, lexical features, and syntactic features.

ProblemExplorer: Human-Machine Collaboration System for Modeling Analysis

Identified the multi-objective nature of the machine learning problem formulation process by interviewing data scientists and literature review
Proposed to leverage autoML to automatically evaluate and recommend problem formulations.

Interface Development for Structured Audio Data Analysis

Conducted formative interview studies to understand audio analysts’ workflows and design requirements.
Developed a web-based interactive text editor using JavaScript, Bootstrap, and quill.js supporting advanced text annotation and analysis, integrating a SQLite database for storing audio analysis data.

GRAFS: Graphical Faceted Search System

Extract and summarize informative concepts from search results to support sensemaking and learning in exploratory search.

Published in ACM TiiS [pdf]

Using Geospatial Data To Predict PFAS Contamination

Performed geospatial analysis, e.g. spatial auto-correlation and clustering, to uncover PFAS detection patterns.
Extracted geospatial features, e.g. proximity to PFAS emission points, land cover, and hydrology, from multiple data sources. Trained Random Forest models to predict PFAS detection across US, achieving 80% recall and 73% precision.

Page updated

Google Sites

Report abuse