Research

Xiaoyi Yang

Khoury College of Computer Sciences

Northeastern University

The Big Bang Theory network: What we can learn from a TV drama scripts

As one of the most successful TV drama series, the 12 season The Big Bang Theory scripts are worth to study. For example, we may be interested in how to extract the storyline from the scripts and visualize the corresponding change on the sentiment and social relations. We are wondering whether those changes are associated with the different script writers and online review score. Most of the similar studies require researchers to manually read the scripts and plot a subjective visualization. In our work, we want to use NLP techniques, graphical model and dynamic network to create more objective visualization, which may also reveals the secret of creating successful TV drama.

Thesis: Learning social networks from text data (Co-advisors: Rebecca Nugent, Nynke Niezink)

Describing and characterizing the impact of historical figures can be challenging, but unraveling their social structures perhaps even more so. Historical social network analysis methods can help and may also illuminate people who have been overlooked by historians, but turn out to be influential social connection points. Text data, such as biographies, can be a useful source of information to learn the structure of historical social networks but can also introduce challenges in identifying links. One approach is a local Poisson Graphical Lasso model incorporating a conditional independence structure using the number of co-mentions in the text to measure relationships between people. This structure will reduce the tendency to overstate the relationship between ``friends of friends'', but given the historical high frequency of common names, without additional distinguishing information, we can still introduce incorrect links.

In this work, we extend the Local Poisson Graphical Lasso model with a (multiple) penalty structure that incorporates covariates giving increased link probabilities for people with shared covariate information. We propose both Greedy and Bayesian approaches to estimate the penalty parameters for each potential covariate. We present results on data sets simulated using historical information and characteristics from the Six Degrees of Francis Bacon project (SDFB) and show that this type of penalty structure can improve network recovery as measured by precision/recall. We also illustrate this approach on a subset of the SDFB network targeting 1500 to 1575.

Proposal link

Paper : DOI: 10.1007/s10260-021-00586-2

Historical Record Linkage in Ohio state during early 20th century, in collaboration with University of Michigan LIFE-M Project (PI: Martha Bailey, UM Economics)

As a part of Life-M project, my work focuses on creating links between Ohio birth certificates in 1920s and Ohio 1940 census record. We develop a new model called "Highlander Probability Model (HPM)", which divided the traditional record linkage problems into two steps and focus more on distinguishing the difference between highly similar entities. Similar methodology is also applies on North Carolina data around the same period. The relative paper is on going. Here is a Shiny App demo for the data and the model.

Conferences and Workshops:

The 3rd North American Social Networks Conference (NASN), 2021. Jan 25th-28th, 2021
- - Contributed Talk: Learning Social Networks from Text Data with Covariate Information (Oral Presentation Award Finalist)
The Network Science Society 2020. Sep 17th-25th, 2020
- - Poster: Learning Social Networks from Text data
Statistical Inference for Network Models (SINM): Sep 20th, 2020
- - Contributed Talk: Learning Social Networks from Text data
JSM: Aug 1st - 6th, 2020
- - Poster on Aug 4th: Assessing the resources and requirements of statistics education in forensic science
2020 Symposium on Data Science & Statistics: Jun 3rd-6th, 2020
- - Invitied talk: Learning a social network from text data
LIFE-M Project Broad Meeting: June 1st-2nd, 2017 and Oct 25th - 26th, 2018
- - Meeting the collaborators and discuss the current process on LIFE-M project.
Working Group on Model-Based Clustering Summer Session: Ann Arbor, July 15th - 20th, 2018
- - Poster: Historical Record Linkage with Highlander Probability Model

Previous Undergraduate Research (Details):

The performance of different initialization in social network data clustering (Instructors: Karl Rohe, Norbert Binkiewicz, 2015)
Detecting origin of expansion in the biased voter model (Instructor: Wai-Tong (Louis) Fan, Collaborator: Craig Knuth, 2016)

Google Sites

Report abuse