Text as Data

An Introduction to Natural Language Processing for Social Scientists


  • This seminar provides a dive-in into natural language processing (NLP) to social science researchers using the user-friendly open-source software Orange.

  • In this endeavor, we are supported by two influential papers: Matt Gentzkow et al's. (2019) Text as Data paper and Jackson et al's (2021) From Text to Thought: How Analyzing Language Can Advance Psychological Science.

  • As Gentzkow et al (2019) are using formal notation, I am centering the workshop around Jackson et al (2021) who are much more user-friendly (note that they also accompany the paper with R scripts).

  • However, you are also invited to eventually consult the formalized versions of the concepts we will discuss today in the Text as Data paper as it is very useful for writing the methods section of your paper.

  • I would also recommend you to read Salganik's (2017) Bit By Bit: Social Research in the Digital Age book which provides an overview of almost all relevant literature and also outlines possible use cases.


This workshop has five main parts

  1. High-level overview of NLP, brief history, introduction to Orange and Twitter

  2. Methods for handling text ( journey from unstructured to structured data), descriptive analysis of text (distilling lingustic information)

  3. Tracking sentiment and emotions in text

  4. Finding topics of text (topic modeling)

  5. Finding meaning of words (word embeddings)

As an extra I also include text classification and text clustering

Presentations are available in corresponding blocks below. This workshop has been recorded and the recording is available for the participants (contact me if you are having trouble finding it).

Part 1: High-level overview of NLP, install Orange, add-ons, example workflows for NLP, Twitter

Presentation for the high-level overview of NLP can be found here.

Orange is freely available on the website. It also comes as a part of Anaconda Distribution but the version available as standalone from the website is usually more stable.

After you install Orange, open it, click on Options - Add-ons and install Text, Textable , and Timeseries (to visualize evolution of sentiment over time).

In this workshop, we will work with book text that is available in Orange and also with Twitter data which are now freely available for researcher under Twitter Academic Access. Note that this means access to all Tweets published since March 2006. You have to write a short application to get the access, you can the consult the application with me via e-mail. For now we will use free access to Tweets published in the last 7 days, as I can't share my academic credentials. List of other textual resources is at the Jacksons' paper, for news you can use for example Guardian API. Everything we are doing here today can be done using Python, you can find all the code at GitHub repository NLP town notebooks.

Credentials for the free Twitter access are ->

key: m3XqVGgCUGHSMZwMit4tVIUt8

secret: zp6KG1SgGz0i57UxJClFEoNp6X2r3GMTDVXwNxJPhWhmwhHiGo

(I will revoke these keys after the workshop, you will afterwards put in your own credentials which you will get at the Twitter Developer Portal)

Orange provides some example workflows for NLP, you can find repository of the examples here

You can naturally also import your own text into Orange,can be Word, PDF, plain text. The instructions are here, there is also video

Part 2: Methods for handling text ( journey from unstructured to structured data), descriptive analysis of text (distilling lingustic information)

Info about preprocessing is in the presentation 1

Everythings starts with tokenization, that's how you cut the text into pieces (usually words, sometimes sentences).

Then we remove stop words, afterwards, if we don't have large text data, we usually perform normalization of words, which is usually in the form of lemmatization. Lemmatization means that we put all the words into their basic (dictionary) form - i.e. doing, doable becomes do.

We are using this Text Preprocessing workflow, video that explains all preprocessing steps (tokenization, lemmatization,..) is here

We have also seen how meaningless the results are if we don't perform stop words removal.

Part 3: Tracking Sentiment and Emotions in Text

Presentation for part 3 is available here

In this exercise, we are using the Story Arcs (evolution of sentiment) workflow , there is also a blogpost about this exercise. Please note that to look at the cluster of sentences with negative (or positive sentiment) in the corpus viewer, you need to click on the blue cluster in the heat map module. Then you will see the subset of negative sentences in the corpus viewer.

We can also filter out other words by creating a custom stop-words list (to filter out for example could, would verbs).

We also apply the sentiment analysis on Twitter data.

Part 4: Finding Topics of Text (Topic Modeling)

Presentation for part 4 is available here

Comprehensive guide on how to use topic models in management research can be found in Hannigan's et al (2019) paper Topic Modeling in Management Research: Rendering New Theory from Textual Data

In this exercise, we are using Orange workflow Twitter Data Analysis (topic modelling), there is also a video on how to use this exercise.

Part 5: Finding the Meaning of Words (Word Embeddings)

Presentation for part 5 is available here

As word embeddings ar not yet implemented in Orange, I do the demonstration of word analogies using this Python notebook.

You can also explore this Python notebook to look deeper into the detection of gender bias in text using word embeddings.

Extra: Text Classification and Text Clustering

  • Please note that text classification uses supervised learning so you need to know the labels and we try to predict the labels based on words in the text. Here is a Text Classification workflow where we predict whether the tale is about animal or magic , video here. You can check the tale label (animal vs magic) in the corpus viewer.

    • But we already now the labels of tales in our corpus! So this exercise is not too useful! What is all this actually for? Now we can actualy do out-of-sample prediction for tales that the model didn't see yet (andersen.tab). You see the probabilies for tale type in the predictions module (0.99 vs 0.01 etc).

    • You can use the nomogram to see which words have the highest impact on the classifier (this is called feature importance).

  • Text Clustering clusters documents into groups (clusters) based on their similarity, video here

    • This is not topic modeling as TM looks inside the documents for topics

    • Plug-in corpus viewer after hierarchical clustering and show clusters that are mixed - both mention animals!