TRIGRAMS

Text Classification: Trigrams

We originally proposed to use topic modeling using Latent Dirichlet Allocation (LDA), where “each topic is a distribution of all observed words in the texts such that words that are strongly associated with the text’s dominant topics have a higher chance of being included” (Ignatow & Mihalcea, 2018, p. 210). Topic modeling is used for discovering abstract topics from a collection of documents. We conducted LDA in our preliminary analyses of these data; however, the results produced topics that were almost all procedural words and did not provide the more substantive phrases that might speak to signaling. Therefore, we instead employed a similar technique—discussed as trigrams—that did produce more substantively interpretable phrases.

Within the text classification methods implemented in scikit-learn instead of single-word representations, we explore predictive phrases in the text via two words phrases (bigrams) and three words phrases (trigrams). Trigrams were most informative, as they gave more contextual information. Given the highly procedural way that police reports are written, using trigrams, we removed the trigrams that appear in more than 50% of the text, as these are procedural phrases that provide little signaling information. We then trained several supervised machine-learning methods to classify the police reports.

Method 1

The best-fitting method is logistic regression (Method 1). Logistic regression is a linear model for classification, also known as logit regression, maximum-entropy classification (MaxEnt), or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

Method 2

The second-fitting method (Method 2) uses the Multinomial Naive Bayes algorithm (ComplementNB or CNB, which is a Bayesian learning approach commonly used in Natural Language Processing (NLP). The program guesses the tag of a text, such as an email or a newspaper story, using the Bayes theorem. CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets. Specifically, CNB uses statistics from the complement of each class to compute the model’s weights. The parameter estimates for CNB are more stable than those for MNB. Further, CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.

Human-Detected Sentiment and Themes

Given that the sentiment lexicon in these analyses is based on noncriminal justice text, and text classifications frequently require a qualitative exploration and interpretation (Ignatow & Mihalcea, 2018), we also conduct human-detected sentiment and thematic analyses on portions of the data. These human-detected analyses provide a contextual understanding and validation of the machine-detected findings. For example, what does it mean to have “negative” text? What is being described? How is it being described? And who is being described negatively (if applicable)?

One Example of Interpreting Text Classification Results

Case Outcome: Investigation Stalled

The logistic regression method (Method 1) in text classification models generates both a predictively positive and negative cluster of trigrams. For the outcome Investigation Stalled, trigrams with predictive value (positive, 1) indicate heavy prosecutorial involvement (“prosecutor [person name],” “ruled no papers,” “no papers issued”) (see below, Raw Results). While not formally forwarded for review, a sizable portion of the stalled cases include conferment with a prosecutor or mention that a case would be forwarded (versus was forwarded) to a prosecutor, juvenile court, or child welfare—thus, an informal involvement of a prosecutor. These results suggest the prosecutor’s pre-screening and/or conferment serve as a gatekeeping process for these cases.

The inverse (negative, 0) of Investigation Stalled cases highlights the importance placed on victim involvement during investigations. The actions of the victim and the assigned sex crimes officer are in focus. Most trigrams involve the victim as subject (“victim come forward,” “victim has not,” “until the victim”) or sex crimes officer names or assignments (“received an assignment,” “lieutenant [name] officer,” “in charge detective”). The next group of trigrams contain phrases in reference to leads, or lack thereof (“investigative leads at,” “leads at this,” “no further investigative”). (See below for trigram categories).

Text classification Method 2 (CNB) highlights the mention of the top three predictive words “door,” “one,” and “operable,” with the number of times the word is mentioned in the predictive (positive, 1) provided in parentheses for Investigation Stalled cases. The top three words for the inverse (negative, 0) are “crime,” “narrative,” and “sex.” (See below for the top three most frequent unigrams resulting from Method 2.)

Reference

Ignatow, G., & Mihalcea, R. (2018). An introduction to text mining: Research design, data collection, and analysis. SAGE Publications, Inc.

Page updated

Google Sites

Report abuse