Topic modelling for problem solving
Data and files used in this guide available to download from following link:
https://www.dropbox.com/s/0x7k8h7gy2qojvl/boston_fio.zip?dl=0
1. Text mining
Vast amounts of records generated within databases, phones, surveys, e-mails and other communications are made up of unstructured data (including qualitative, free text). Text mining describes processes that enable us to extract phrases or keywords from texts so that we can organise their content and uncover key concepts. We're able to leverage text mining to quickly discover patterns, emerging topics or links and group them together without having to read entire collections of crime reports and investigation details for example. This can significantly reduce (or eliminate in some instances) the time that an analyst or researcher may spend manually processing, reading, coding, theming and categorising text information. Below are some examples of when text mining processes might be used in public safety:
Criminal patterns and detection, crime linkage using modus operandi information,
Developing profiles from modus operandi, keywords and outcomes using decision trees,
Gauging public perception from survey data (sentiment analysis),
Understanding why reporting victims choosing not to support formal investigations (summarisation and categorising),
Capturing threats from the web (news, chatrooms, social media, forums etc).
The examples above are situated mainly within the remit of tactical, investigative and strategic crime analysis. Text mining techniques may be less likely to find use in some areas of real-time intelligence threat and risk analysis. Expertise in this area is often driven by practical experience and focuses through a narrower lens of specific problems where intelligence volumes might be sparse enough to manually process, or requires specific knowledge and experience (i.e. organised immigration crime).
2. Topic modelling
As a beginner to text mining processes, my interest was first drawn to topic modelling. Topic modelling is a process used to find similar or related themes across documents to enable grouping. This could enhance the ability to perform problem-solving analysis which often requires us to break down something broad (i.e., knife crime) into something more specific that is suitable for problem-solving (i.e., knife-enabled robberies of teenagers in the after school period, knife-carrying among young adults in specific geographical areas where group violence persists). Traditionally analysts and researchers have performed this technique through manual reading, coding and theming of records, a practice that can be extremely resource-intensive and prone to errors or inconsistencies. In a bid to save time it's common for analysts or researchers to prioritise searches of keywords or specific problems that are already known. This presents bias and risks not identifying new or emerging trends, particularly when dealing with large volume problems. Where lots of unknowns exist it can also force a situation where the analysis is superficial - this limits our understanding of a problem in specific ways and subsequently our options for responding.
One example is physical violence - anyone who has ever tried to manually derive topics from physical violence records will no doubt have extracted, broadly, topics of intimate partner violence, familial violence, night-time economy and weapon-enabled violence. Then they will have been left with a very large 'other' category that becomes extremely difficult to manually code and theme. It might contain topics like vehicle-related violence (i.e. road rage resulting from poor driving or parking), bullying among school pupils that have escalated to fistfights, random rude encounters escalating over very trivial matters, retaliatory disputes and so on. Due to data volume, it's not feasible to continue the task to determine whether there are other problems that require our attention, that might be addressable. Similarly, we may not further delve into sub-topics such as familial violence, which could encompass adult children inflicting violence on elderly relatives, chastising of young children by parents or violence among siblings.
3. Getting started
In my search for unstructured policing text data openly accessible I stumbled upon Boston Police Field, Interrogation and Observation (FIO) reports, available at Analyze Boston. I made some transformations to this data, joining the person/subject details alongside the FIO details, merging data covering 2015-2018 and applying basic geocoding (Google Earth) to records where address data were available. This transformed version is available from the link at the top of this page.
To give an idea of the type of text entries the dataset contains, a few examples are shown below:
BOTH INDIVIDUALS OBSERVED ON THE SIDEWALK FACING THE DOORS OF THE BOSTON POLICE ACADEMY ON THE WILLIAMS AVENUE SIDE. THE INDIVIDUAL IN THE RED T-SHIRT WAS TAKING PHOTOGRAPHS OF THE ACADEMY DOORS WITH WHAT APPEARED TO BE A SMARTPHONE. BOTH INDIVIDUALS WALKED UP WILLIAMS AVENUE AND THEN TOOK A LEFT ON SUMMIT STREET.
LARGE GROUP OF ORCHARD PARK LENOX AND VNF ASSOCIATES SMOKING WEED IN FRONT OF XXX ON SIDEWALK AND INSIDE MA/XXX (XXX'S CAR) AND MA/XXX (XXXS CAR). NEITHER MALE HAS A LICENSE BUT CARS WERE PARKED ON THE SIDE OF THE ROAD.
PROSTITUTION INVEST - OFFICERS OBSERVED XXX PICK UP XXX IN MA REG XXX AT DORCHESTER AVE AND ORCHARDFIELD STREET. OFFICERS PULLED BEHIND VEHICLE WHICH STOPPED AND XXX EXITED THE VEHICLE. BOTH XXX AND XXX STATED THAT XXX WAS GIVING XXX A RIDE TO INTERVALE STREET. XXX HAS MULTIPLE SEX FOR FEE CONVICTIONS. FIOD AND RELEASED.
We're now just going to jump straight in with the sequence of code chunks and outputs for performing topic modelling using what is known as Latent Dirichlet Allocation (LDA) using R. LDA is a probability based approach used to identify clusters by observing word frequencies and distributions among all documents (in our case each FIO text record) to define topics. It is considered latent as we're identifying concealed topics that have not been explicitly predefined. Dirichlet distributions are used statistically in this case to understand multi-word probability distribution.
The goal of this guide is to analyse Boston Police FIO records to understand which types of police activities and encounters make up FIO. A specific problem could be what type of activities or problems create the largest FIO demand in Boston?
4. Loading libraries and data
Using RStudio the first requirement is to retrieve the packages and read in the data being used.
# Libraries
library(tidyverse)
library(tidytext) # convert and tidy text data
library(tm) # term matrixes
library(topicmodels) # topic modelling lda
library(reshape2) # reshape matrixes
# Read in data
# download, extract and read in csv from dropbox zip file
temp <- tempfile()
download.file("https://www.dropbox.com/s/0x7k8h7gy2qojvl/boston_fio.zip?dl=0",temp)
unz(temp, "boston_fio.csv")
bpdfio <- read.csv("boston_fio.csv", sep=",", header=T)
# data has duplicate records where multiple persons fio'd, create a distinct event version
bpdfio_distinct <- bpdfio %>%
dplyr::select(fc_num, date, timestamp, streetaddr, zip,
searchperson, stop_duration, contact_reason,
xcoord, ycoord) %>%
distinct()
The bpdfio_distinct dataset will look like this (image left), containing a unique ID (fc_num), date, timestamp, street address (streetaddr), zip code, search person flag, stop duration, contact reason and geographic coordinates for where the FIO record occurred. This data is now ready to use.
5. Text organisation and tokenisation
Every separate body of text (here an FIO contact reason) is known as a document, every unique word is known as a term and every occurrence of a term is a token. Creating a 'bag-of-words' from the documents is known as tokenising.
Disaggregating the document into terms uses the argument 'unnest_tokens'. We can then view begin to look at the word frequencies, as demonstrated in the code below.
# unnest_tokens assigns each document term to a row
tidy_fio <- bpdfio_distinct %>%
unnest_tokens(word, contact_reason)
# view unnested
tidy_fio
# view wordcount
tidy_fio %>%
count(word) %>%
arrange(desc(n))
This outputs are shown below - an un-nested dataset where each document is now represented by multiple rows for each word, and a word frequency table of all terms. At this stage, it's not particularly useful as we need to remove stop words (the, and, a etc.) and there may be other words we choose to remove. An example visible here is the use of XXX which denotes redacted personal details.
Un-nested creates one word per row for each document, see far right column, "while on random patrol officers..."
Word frequencies across all topics, not particularly helpful at this stage as they include stop words
We can create a table of custom words we would like to remove using the code below.
# custom stop words, use tribble() to create a data frame, create column headers word and lexicon, then enter our custom words to be removed
custom_stop <- tribble(
~word, ~lexicon,
"x", "CUSTOM",
"xx", "CUSTOM",
"xxx", "CUSTOM",
"xxxx", "CUSTOM",
"xxxxx", "CUSTOM",
"xxxxxx", "CUSTOM",
"xxxxxxx", "CUSTOM")
It is then possible to remove the stop words and our custom stop words using the code below, followed by a recount of the word frequencies. They look a little more useful but still might not be suitable for our purpose.
# remove stop words and custom stop words
tidy_fiov2 <- bpdfio_distinct %>%
unnest_tokens(word, contact_reason) %>%
filter(!word %in% stop_words$word) %>%
filter(!word %in% custom_stop$word)
# count again
tidy_fiov2 %>%
count(word) %>%
arrange(desc(n))
6. N-grams
Sometimes we might want to try breaking our documents down into n-grams, sequences of words rather than single words. The code below shows how to tokenise n-grams specifying the n=2, and then how we can separate those to remove stop words before reuniting them. The output is a different set of high-frequency terms this time with the first two relating to vehicle linked FIOs (traffic stops, ma reg which I presume to be a reference to Massachusetts vehicle registration plate).
# n-grams
ngram_fio <- bpdfio_distinct %>%
unnest_tokens(bigram, contact_reason, token = "ngrams", n = 2)
# tidy - seperate and remove stop words, reunite
tidy_ngram <- ngram_fio %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word1 %in% custom_stop$word) %>%
filter(!word2 %in% custom_stop$word) %>%
unite(bigram, c("word1", "word2"), sep=" ") %>%
filter(bigram != 'NA' & bigram != 'NA NA')
# count bigrams
tidy_ngram %>%
count(bigram) %>%
arrange(desc(n))
More useful terms using bigrams, can identify potential topics relating to traffic stops, firearms and gang violence.
7. Extract features and model/analyse
Before we can run some Latent Dirichlet Allocation (LDA) models on our bigram data, we first need to cast a Document Term Matrix (DTM). This is a large matrix where each row represents a document and each column represents a word in the dataset. It is too large for us to view. For the Boston FIO bigram data, it is over 30,000 rows and 106,000 columns. The code for creating the dtm is shown below. A dtm is used to run an LDA model.
# cast a document term matrix
dtm <- tidy_ngram %>%
count(fc_num, bigram) %>%
cast_dtm(fc_num, bigram, n)
To run the model, we pass the document term matrix (we've called dtm) to LDA() and set the parameters, this has been assigned to the name mod. K is the number of clusters or topics we have set in this instance below.
# assign your dtm to an LDA model and set parameters; k = number of topics we want to set
mod <- LDA(dtm,
k = 6,
method = "Gibbs",
control = list(alpha = 0.5,
iter = 500,
seed = 1234))
We can run multiple models each time changing the value of k. With each run we can make a note of aspects of model performance including coherence and perplexity. We may look at the perplexity scores with each model to help us determine how many topic clusters to set for k.
The chart left shows the perplexity score for each model ran on the Boston FIO data, ranging from K2:10 and K20. A lower perplexity score is preferred. The chart line represents what is known as an "elbow", which helps us select the optimal number of clusters or k. Assume the line is an arm, then the elbow is where the point of inflection lies on the curve. Where the improvement in perplexity scores becomes marginal then we select k from here.
We can also visualise the top bigrams in each topic and begin to make sense of them. The charts below visually show our topics using k=6. Higher scoring terms are more frequently found together. As an example, if we take topic number 6 we can see that 'traffic stop' and 'ma reg' are the most probable terms in documents found within this topic.
Topic 2 looks to be drug-related offenders and investigations, including multiple terms with the words drug and crack. Topic 6 looks to be largely vehicle related FIO records, including traffic stop, motor vehicle, stop vals and front passenger. Topic 5 appears to be gang-related violence, mentioning gang names/neighbourhoods in Boston including Heath Street, Orchard Park and Lenox Street, the term gang associate and yvsf which refers to Boston Youth Violence Strikeforce.
The code below demonstrates how to assign our LDA model output to a matrix, tidy it up and then visualise the terms most associated with each topic as a chart, as shown above.
# assign your lda model output as a matrix
mod_matrix <- mod6 %>% tidy(matrix = "beta")
# create a data frame capturing the top 10 most probable words for each topic
wordprob <- mod_matrix %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(term2 = fct_reorder(term, beta))
# visualise this information as a horizontal bar chart
ggplot(wordprob, aes(term2, beta, fill = as.factor(topic)))+
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales="free") +
coord_flip()
8. Using your findings for analysis
Once you have completed your LDA modelling task there are a few options on where to go next. You may wish to use the key terms and words you've uncovered to perform some pattern matching of your free text data in order to extract topics that you wish to analyse further. Say that you were interested in exploring 'shots fired' and 'parking lots' uncovered under topic 3 as it is particularly specific. You would be able to search those terms and retrieve the records that relate to them. This would be using your findings in a traditional way via searching keywords, strings and terms that we've identified as belonging to a more narrowly defined problem.
If this were burglary data, it might be that our topics relate to different combinations of modus operandi, entry points, household and building types (see Birks et al, 2020).
We can also use our LDA model to assign the topic number probability to each term, and then assign a topic to each of our original documents based on the maximum probability. We can use this to view how our topics are distributed, statistically or geographically for example, or begin to use the topics for problem analysis. The code steps for achieving this are shown below.
# assign probability of topics to each document using your mod and dtm
post_probs <- topicmodels::posterior(mod, dtm)
#classify documents by finding topic with max prob per doc
top_topic_per_doc <- apply(post_probs$topics, 1, which.max)
#create a matrix and add to original data
# turn to a matrix
new <- as.matrix(top_topic_per_doc)
# then to a data frame
copy <- as.data.frame(new)
# bring in the id number - fc_num, as a column
copy <- tibble::rownames_to_column(copy, "fc_num")
# join the topic probability to your original data, this will automatically join on fc_num
topics <- inner_join(bpdfio_distinct, copy)
# continue to use topics for further analysis in R, or export as csv to explore elsewhere
write.csv(topics, "topics.csv")
Due to a personal interest in the gang violence topic found within the FIO documents, I choose to begin exploring this alongside Boston PD shootings/homicide data, and a gang territories layer which was found at an unofficial open-source (Map of Boston Gangs).
I created some basic outputs visualising hotspots where Boston PD field, interrogation and observation reports in the gang violence topic were concentrated. They were particularly focusing around three areas labelled as gang areas of Heath Street, Orchard Park and Lenox. A second visual shows statistically significant volume hotspots for shootings in the city of Boston, and a final visual showing arbitrary grid areas with high homicide volumes (all data covers the period 2015-2018).
These maps show just two neighbourhoods of Boston, where these problems are most concentrated, Roxbury and Dorchester.
I will add to this page as my learning and knowledge of topic modelling develops, I hope this is a useful guide to pique newcomer interests and help get started.
References
Birks, D., Coleman, A. and Jackson, D. (2020). Unsupervised identification of crime problems from police free-text data. Crime Science, 9(1). https://crimesciencejournal.biomedcentral.com/articles/10.1186/s40163-020-00127-4
Silge, J. and Robinson, D. (2017). Text mining with R : a tidy approach. Bejing ; Boston: O’reilly. Available online https://www.tidytextmining.com/topicmodeling.html
Tang, F. (2019). Beginner’s Guide to LDA Topic Modelling with R. [online] Medium. Available at: https://towardsdatascience.com/beginners-guide-to-lda-topic-modelling-with-r-e57a5a8e7a25.