Research at Duke University

From 2013-2017, I conducted text mining and Bayesian statistics research during my PhD in statistical science at Duke University. My advisor is Dr. David L. Banks.

PhD Dissertation: Statistical Issues in Quantifying Text Mining Performance

Available on DukeSpace, Duke University Libraries

(Direct quote from the abstract)

Text mining is an emerging field in data science because text information is ubiquitous, but analyzing text data is much more complicated than analyzing numerical data. Topic modeling is a commonly-used approach to classify text documents into topics and identify key words, so the text information of interest is distilled from the large corpus sea.

In this dissertation, I investigate various statistical issues in quantifying text mining performance, and Chapter 1 is a brief introduction. Chapter 2 is about the adequate pre-processing for text data. For example, words of the same stem (e.g. “study” and “studied”) should be assigned the same token because they share the exact same meaning. In addition, specific phrases such as “New York” and “White House” should be retained because many topic classification models focus exclusively on words. Statistical methods, such as conditional probability and p-values, are used as an objective approach to discover these phrases.

Chapter 3 starts the quantification of text mining performance; this measures the improvement of topic modeling results from text pre-processing. Retaining specific phrases increases their distinctivity because the “signal” of the most probable topic becomes stronger (i.e., the maximum probability is higher) than the “signal” generated by any of the two words separately. Therefore, text pre-processing helps recover semantic information at word level.

Chapter 4 quantifies the uncertainty of a widely-used topic model – latent Dirichlet allocation (LDA). A synthetic text dataset was created with known topic proportions, and I tried several methods to determine the appropriate number of topics from the data. Currently, the pre-set number of topics is important to the topic model results because LDA tends to utilize all topics allotted, so that each topic has about equal representation.

Last but not least, Chapter 5 explores a few selected text models as extensions, such as supervised latent Dirichlet allocation (sLDA), survey data application, sentiment analysis, and the infinite Gaussian mixture model.

Quantifying Uncertainty in Latent Dirichlet Allocation

Partially funded by NSF SES 11-31897

Speed Presentation and E-Poster at JSM 2017, Baltimore MD

(Direct quote from the JSM abstracts)

In statistics, measuring uncertainty is equally important as getting the point estimate. For text datasets, latent Dirichlet allocation (LDA) is one of the most commonly used topic modeling algorithms. I discovered that keeping special phrases in text cleaning improves the topic distinctivity at the word level. In addition, I also used a synthetic dataset with known proportions to test how LDA performs under different settings. No matter what the number of topics is pre-set to, LDA tends to "spread out" the topic assignments, making it difficult to remove excessive topics.

Combined Analysis of Numerical and Text Data in Surveys -- Supervised Latent Dirichlet Allocation

Poster at ISBIS 2017, Yorktown Heights NY

(Direct quote from the ISBIS poster abstracts)

Many surveys contain both numerical and text data, so it is important to analyze both parts to utilize the whole dataset. The numerical data are rating scores, and text data are answers to free response questions, which often contain useful information. The supervised latent Dirichlet allocation (sLDA) is a commonly used method to perform topic modeling on hybrid datasets, and the model determines which topics are associated with which ratings. For example, in an employee satisfaction dataset, I discovered that higher scores are associated with positive words (e.g. "opportunity" and "challenge"), while responses which mention work-life balance are usually associated with lower score. In addition, a higher number of "nots" per comment also indicates a lower rating because people say they are "not happy", but they do not say they are "not sad". Last but not least, sLDA can also be used to predict the rating score given the text comment, and one application is error correction -- some people may get confused by the scale and provide a low score to indicate a positive response. By analyzing the text answers, this kind of error can be discovered and corrected.

Three Perspectives on Error Correction in Data

Poster at WiSE 2017, Durham NC

I presented three perspectives on error correction in data because data errors are inevitable. First, exploratory data analysis reveals unreasonable values, so they should be flagged and corrected. In a mortgage survey, if the loan amount is 102, and the purchase price is 99,000 dollars, then the loan amount should be corrected to 102,000 dollars. Second, numerical and text data should be combined in statistical modeling. In an employee satisfaction survey, one employee wrote “Love my work - very varied.” but rated his/her company 1 (least satisfied), and the rating should be corrected to 10 (most satisfied). If the text comments had not been included in the model, I would not know that this employee actually loves his/her work. Last but not least, record linkage also helps in error correction. When a matching record pair contains an invalid value (such as birth month greater than 12) in one column, and the corresponding value is valid, the invalid value can be corrected to the valid one.

Quantifying Improvement of Topic Modeling Results by N-Gramming

Presentation at AISC 2016, Greensboro NC

(Direct quote from my AISC 2016 abstract)

The improvement of topic modeling results by n-gramming can be quantified using the increase of word distinctivity. When two words form a bi-gram with special meaning, the maximum posterior probability of being in a certain topic given the bi-gram is higher than the probability for each word separately. For example, the word ''black'' has maximum posterior probability 48.4% being in one of the five topics, and the word ''panther'' has 52.5%. Interestingly, the bi-gram ''black_panther'' has probability 90.1% because the phrase refers to the Black Panther Party.

Topic Modeling of Employee Satisfaction Data

Joint work with David Banks, Min Jung Park (presenter), and Jessica Wang

Poster at WiSE 2015, Durham NC

A typical employee satisfaction survey dataset contains overall ratings between 1 and 10 and text responses. The goal is to reveal the association between certain words in the free-response question and the corresponding rating. We used topic modeling to identify 10 topics from the text response corpus, and also developed a regression model by using sLDA (supervised Latent Dirichlet Allocation) to predict the score intervals associated with each topic.

Text Data Pre-Processing

Poster at WiSE 2015, Durham NC

Text mining is the process of obtaining high-quality information from large amounts of text, and it is a type of statistical pattern learning. Christine’s contribution is mainly on text data pre-processing, i.e. prepare the text data for statistical analysis. The first step of text pre-processing is tokenization – combine words of the same root such as “work” and “worked”, then identify them by a unique token. Next, the words with low-variance tf-idf scores are removed, such as “the” and “for”. To retain word order as part of the semantic information, phrases (e.g. “New York” and “White House”) are handled by n-gramming – each sequence is replaced with a token, and phrases started by a negation is replaced with the corresponding antonym. The polysemy problem, i.e. a word with multiple meanings, remains open, and we have explored use of Latent Semantic Indexing, with mixed success.

Networks of Text -- Pre-Analysis of Text Data

Presentation at AISC 2014, Greensboro NC

(Direct quote from my AISC 2014 abstract)

It is now becoming common to analyze text as data. But text requires different kinds of preparation than is needed for numerical data. This talk describes issues in scraping data from on-line sources, then tokenizing it, and finally n-gramming it. Popular methods for text analysis rely upon bag-of-words models, and this loses semantic information, especially negation, but proper processing can recover some of this. We also describe methods for reducing the number of tokens, to expedite computation. These ideas are illustrated in the context of mining a corpus of political blogs.