Research

Here are my research projects published in early career. My previous work at Duke University is summarized on a different page.

Comparison of Text Preprocessing Methods

Published paper in Natural Language Engineering 2023

(Direct quote from the paper abstract)

Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.

Robust Bayesian Nonnegative Matrix Factorization with Implicit Regularizers

Authors: Jun Lu and Christine P. Chai

arXiv preprint arXiv:2208.10053 in 2022

(Direct quote from the paper abstract)

We introduce a probabilistic model with implicit norm regularization for learning nonnegative matrix factorization (NMF) that is commonly used for predicting missing values and finding hidden patterns in the data, in which the matrix factors are latent variables associated with each data dimension. The nonnegativity constraint for the latent factors is handled by choosing priors with support on the nonnegative subspace, e.g., exponential density or distribution based on exponential function. Bayesian inference procedure based on Gibbs sampling is employed. We evaluate the model on several real-world datasets including Genomics of Drug Sensitivity in Cancer (GDSC IC50) and Gene body methylation with different sizes and dimensions, and show that the proposed Bayesian NMF GL22 and GL22,∞ models lead to robust predictions for different data values and avoid overfitting compared with competitive Bayesian NMF approaches.

Word Distinctivity – Quantifying Improvement of Topic Modeling Results from N-Gramming

Published paper in REVSTAT-Statistical Journal 2022

(Direct quote from the paper abstract)

Text data cleaning is an important but often overlooked step in text mining because it is difficult to quantify the contribution. Therefore, we propose the word distinctivity to measure the improvement of topic modeling results from n-gramming, which preserves special phrases in a corpus. The word distinctivity evaluates the signal strength of a word’s topic assignments, and a high distinctivity means a high posterior probability for the word to come from a certain topic. We implemented the latent Dirichlet allocation for topic modeling, and discovered that some special phrases show an increase in word distinctivity, reducing uncertainty in topic identification.

Guidelines in Selecting Appropriate Text Preprocessing Methods

Concurrent session presentation at CSP 2021, Virtual Conference

Slides on GitHub

(Direct quote from the abstract)

Statisticians and data scientists spend a large amount of time preprocessing the data for analysis, and unstructured text data are no exception. Many text preprocessing methods are available, but there is not a one-size-fits-all procedure in preparing text corpora for the model. The appropriate steps depend on not only the application goals, but also the nature of the corpus. For instance, separating each sentence in a document may not be important in topic modeling, but essential in end-user applications like machine translation and question answering.

Therefore, we provide some guidelines on how to select the appropriate text preprocessing methods for a new dataset. We evaluate the pros and cons of methods such as removing punctuation, removing stopwords, stemming and lemmatization, and n-gramming to retain word order. We also review examples of text analysis to demonstrate the need of particular text preprocessing methods, empowering statistical practitioners to make better preprocessing decisions. This talk assumes the audience has a basic understanding of natural language processing, and probably has performed simple analysis of text data before.

Improving Accessibility in Data Visualizations Created by ggplot2

Poster at useR! 2020 (The R User Conference), Virtual Conference

Code on GitHub, and poster video on YouTube

To improve accessibility, we demonstrated how to customize the line types and the point shapes in a graph using the R package ggplot2. We can assign different line types for each trend using the scale_linetype_manual() function, and different point shapes using the scale_shape_manual() function. In this way, each trend is distinguishable in the absence of color. We can use color, but we should not solely rely on color. Otherwise, people with color blindness won't be able to read the graph. This is about 300 million people, or 4.5% of the global population. If we include these people in our audience of data visualizations, we would have a larger size of potential audience.

Data Visualization and Accessibility

Contributed refereed presentation at SDSS 2020, Virtual Conference

Code on GitHub, and speaker video on YouTube

(Direct quote from the abstract)

The term “accessibility” is often associated with assistive technology, but the data visualizations created by technology also need to be accessible. Graphs are powerful tools to communicate the message to the audience, but the audience has to be able to read the graph first, including people with disabilities. Therefore, we demonstrate two examples of improving accessibility in data visualizations generated by the R package ggplot2, and the concepts are technology agnostic because they apply to other software as well. The first example shows three trends of different colors in the same plot. However, people with color blindness cannot see color, so they would have difficulty identifying which trend belongs to which group. A solution is to change the line types and the point shapes, so that each trend can be distinguished even in the absence of color. The second example contains two aligned barplots that compare the precipitation of Seattle and Phoenix. Nevertheless, some tiny precipitation bars for Phoenix can hardly be seen, especially for people with low vision. As a result, they may be concerned about missing data. We can fix the issue by adding the exact numbers on top of the tiny bars, so people know that the data exist. In addition, the default unit for precipitation is inches, but some people outside the US are not familiar with this measurement scale. By adding a secondary y-axis in millimeters (mm), we reduce their mental efforts of unit conversion. These revisions to the graph improve not only accessibility, but also readability. Thus, accessibility benefits not only people with disabilities, but also improves the overall user experience. Accessibility in data visualizations increases the size of the audience pool, translating to a greater business impact.

The Importance of Data Cleaning: Three Visualization Examples

Published paper in CHANCE 2020 (Taylor & Francis online)

This article is included in the 2020 Most Read Collection from the American Statistical Association (ASA).

(Direct quote from the introduction)

This article provides three examples of common data issues and explains how to identify and fix them quickly. Visualizations compare data quality before and after cleaning. These demonstrations make it easier for data scientists to justify their data processing time to stakeholders without using too much jargon. The first example, “Missing Values Encoded as 99,” shows how missing data encoded as invalid values would affect a regression. The second example, “Dollars vs. Thousands of Dollars,” describes how an unreasonably small dollar amount can be due to respondents confusing “dollars” with “thousands of dollars.” The third example, “Invalid Dates in Records,” explains how invalid dates can be corrected by additional records for the same person. These examples do not represent all possible methods of data cleaning, but adequate exploratory data analysis can uncover many data issues before any modeling is performed. In other words, identification of data problems is often supported by data explorations, not just by advanced statistical methods. (Note that all data used in this article are fictitious and for demonstration purposes only.)

Automated Survey Text Analysis -- Supervised Latent Dirichlet Allocation (sLDA)

E-Poster presented at SDSS 2019, Bellevue WA

(Direct quote from the SDSS abstracts)

Open-ended questions are becoming more common in surveys, due to the diverse responses they can capture. However, the analysis of survey text is often conducted manually, which can be expensive and prone to subjectivity. Therefore, we would like to automatically analyze text and numerical data using the supervised latent Dirichlet Allocation (sLDA), a topic modeling approach that assigns each word a probability distribution of topics. The example we used is an employee satisfaction survey, and each record contains a numerical rating along with a free text response as the reason. Then the sLDA algorithm selects key words of each rating as a topic, and outputs the corresponding credible intervals. Since the R package lda is available for this approach, using sLDA to identify topics for each rating is a start for automated survey text analysis, with little technical knowledge required for implementation.

Text Mining in Survey Data

Published paper in Survey Practice 2019

(Direct quote from the paper abstract)

Free text responses in surveys contain important information and should be analyzed by researchers. However, human coding of survey text is not only expensive, but also vulnerable to subjectivity. An automated text mining approach can solve these problems. Therefore, we demonstrate using the supervised latent Dirichlet allocation (sLDA) to jointly analyze text and numerical data in an employee satisfaction survey. For each rating, the algorithm outputs selected words as the "topic" and estimates the credible interval. Finally, we discuss future applications and advantages of utilizing survey text.

Modeling Community Structure and Topics in Dynamic Text Networks

Published paper in Journal of Classification 2019

Authors: Teague R. Henry, David Banks, Derek Owens-Oas, and Christine Chai

(Direct quote from the paper abstract)

The last decade has seen great progress in both dynamic network modeling and topic modeling. This paper draws upon both areas to create a bespoke Bayesian model applied to a dataset consisting of the top 467 US political blogs in 2012, their posts over the year, and their links to one another. Our model allows dynamic topic discovery to inform the latent network model and the network structure to facilitate topic identification. Our results find complex community structure within this set of blogs, where community membership depends strongly upon the set of topics in which the blogger is interested. We examine the time varying nature of the Sensational Crime topic, as well as the network properties of the Election News topic, as notable and easily interpretable empirical examples.