Mining Personal Traits in Social Media

SIAM SDM 2016 Tutorial

Aron Culotta  (Illinois Inst. of Technology) and Dongwon Lee (Penn State)




There has been rapid growth in applications informed by social media data in domains such as public health, marketing, and political science. A critical component in such applications is the need to “understand social users better.” For example, knowing the demographic information of a set of users allow for improved user experiences, more personalized content, as well as more robust analyses of population-level trends.

However, due to many reasons, social users in general do not willingly share their demographic information with others. It is known that only a small fraction of users in popular social networks fill out their personal profiles upon registration.

To address this challenge, in recent years, in both data mining and machine learning communities, many novel and innovative solutions have been proposed, covering diverse personal traits such as gender, age, personality, occupation, location, religion, etc and ranging from supervised to semi-supervised and unsupervised methods. Despite the recent active developments, however, there have been few attempts to give a themed and cohesive presentation on this important topic.

In this tutorial, therefore, we attempt to present the following:

  • Introduction of the personal traits mining problem in social media
  • Landscape of recent developments toward the problem
  • Presentation of representative supervised vs. semi/unsupervised solutions
  • The iPython based hands-on demonstration or exercise
  • Implications and open issues


The tutorial consists of parts as follows:

  • Prelude
    • Problem introduction
    • Overall landscape of solutions
  • Part 1: Supervised Solutions
    • Taxonomy
    • Representative solutions
    • Summary
  • Part 2: Semi-supervised and Unsupervised Solutions
    • Taxonomy
    • Representative solutions
    • Summary
  • Part 3: Hands-on Demo
  • Postlude
    • Implications and opportunities
    • Open issues

Bio of speakers

Aron Culotta, Assistant Professor of Computer Science, Illinois Institute of Technology, USA (culotta@cs.iit.edu)

Aron Culotta is an Assistant Professor of Computer Science at the Illinois Institute of Technology in Chicago, where he leads the Text Analysis in the Public Interest lab (http://tapilab.github.io/). He obtained his Ph.D. in Computer Science from the University of Massachusetts, Amherst in 2008, advised by Dr. Andrew McCallum, where he developed machine learning algorithms for natural language processing. He was a Microsoft Live Labs Fellow from 2006-2008, and completed research internships at IBM, Google, and Microsoft Research. His work has received best paper awards at AAAI and CSCW.

Dongwon Lee, Associate Professor, Penn State University, USA (dongwon@psu.edu)

Dongwon Lee is an associate professor of the Pennsylvania State University, College of Information Sciences and Technology, USA. From Jan. 2015 to Dec. 2016, he has been also serving as a rotating program director at National Science Foundation (NSF). He obtained his Ph.D. in Computer Science from UCLA in 2002. Since joining Penn State in 2002, working mostly on the issues arising in the management and mining of data, he has (co-) authored over 130+ scholarly articles in selective publication outlets in Databases and Data Mining fields.