Topic modelling for problem solving

Data and files used in this guide available to download from following link:

https://www.dropbox.com/s/0x7k8h7gy2qojvl/boston_fio.zip?dl=0 

1. Text mining 

Vast amounts of records generated within databases, phones, surveys, e-mails and other communications are made up of unstructured data (including qualitative, free text).  Text mining describes processes that enable us to extract phrases or keywords from texts so that we can organise their content and uncover key concepts. We're able to leverage text mining to quickly discover patterns, emerging topics or links and group them together without having to read entire collections of crime reports and investigation details for example. This can significantly reduce (or eliminate in some instances) the time that an analyst or researcher may spend manually processing, reading, coding, theming and categorising text information. Below are some examples of when text mining processes might be used in public safety:

The examples above are situated mainly within the remit of tactical, investigative and strategic crime analysis. Text mining techniques may be less likely to find use in some areas of real-time intelligence threat and risk analysis. Expertise in this area is often driven by practical experience and focuses through a narrower lens of specific problems where intelligence volumes might be sparse enough to manually process, or requires specific knowledge and experience (i.e. organised immigration crime). 

2. Topic modelling

As a beginner to text mining processes, my interest was first drawn to topic modelling. Topic modelling is a process used to find similar or related themes across documents to enable grouping. This could enhance the ability to perform problem-solving analysis which often requires us to break down something broad (i.e., knife crime) into something more specific that is suitable for problem-solving (i.e., knife-enabled robberies of teenagers in the after school period, knife-carrying among young adults in specific geographical areas where group violence persists). Traditionally analysts and researchers have performed this technique through manual reading, coding and theming of records, a practice that can be extremely resource-intensive and prone to errors or inconsistencies. In a bid to save time it's common for analysts or researchers to prioritise searches of keywords or specific problems that are already known. This presents bias and risks not identifying new or emerging trends, particularly when dealing with large volume problems. Where lots of unknowns exist it can also force a situation where the analysis is superficial - this limits our understanding of a problem in specific ways and subsequently our options for responding.

One example is physical violence - anyone who has ever tried to manually derive topics from physical violence records will no doubt have extracted, broadly, topics of intimate partner violence, familial violence, night-time economy and weapon-enabled violence. Then they will have been left with a very large 'other' category that becomes extremely difficult to manually code and theme. It might contain topics like vehicle-related violence (i.e. road rage resulting from poor driving or parking), bullying among school pupils that have escalated to fistfights, random rude encounters escalating over very trivial matters, retaliatory disputes and so on. Due to data volume, it's not feasible to continue the task to determine whether there are other problems that require our attention, that might be addressable. Similarly, we may not further delve into sub-topics such as familial violence, which could encompass adult children inflicting violence on elderly relatives, chastising of young children by parents or violence among siblings. 

3. Getting started

In my search for unstructured policing text data openly accessible I stumbled upon Boston Police Field, Interrogation and Observation (FIO) reports, available at Analyze Boston. I made some transformations to this data, joining the person/subject details alongside the FIO details, merging data covering 2015-2018 and applying basic geocoding (Google Earth) to records where address data were available. This transformed version is available from the link at the top of this page.

To give an idea of the type of text entries the dataset contains, a few examples are shown below:

We're now just going to jump straight in with the sequence of code chunks and outputs for performing topic modelling using what is known as Latent Dirichlet Allocation (LDA) using R. LDA is a probability based approach used to identify clusters by observing word frequencies and distributions among all documents (in our case each FIO text record) to define topics. It is considered latent as we're identifying concealed topics that have not been explicitly predefined. Dirichlet distributions are used statistically in this case to understand multi-word probability distribution. 

The goal of this guide is to analyse Boston Police FIO records to understand which types of police activities and encounters make up FIO. A specific problem could be what type of activities or problems create the largest FIO demand in Boston? 

4. Loading libraries and data

Using RStudio the first requirement is to retrieve the packages and read in the data being used.

# Libraries

library(tidyverse)

library(tidytext) # convert and tidy text data

library(tm)          # term matrixes

library(topicmodels)    # topic modelling lda

library(reshape2)    # reshape matrixes


# Read in data

# download, extract and read in csv from dropbox zip file

temp <- tempfile()

download.file("https://www.dropbox.com/s/0x7k8h7gy2qojvl/boston_fio.zip?dl=0",temp)

unz(temp, "boston_fio.csv")

bpdfio <- read.csv("boston_fio.csv", sep=",", header=T)


# data has duplicate records where multiple persons fio'd, create a distinct event version

bpdfio_distinct <- bpdfio %>% 

  dplyr::select(fc_num, date, timestamp, streetaddr, zip, 

                searchperson, stop_duration, contact_reason,

                xcoord, ycoord) %>%

  distinct()

The bpdfio_distinct dataset will look like this (image left), containing a unique ID (fc_num), date, timestamp, street address (streetaddr), zip code, search person flag, stop duration, contact reason and geographic coordinates for where the FIO record occurred. This data is now ready to use.

5. Text organisation and tokenisation

Every separate body of text (here an FIO contact reason) is known as a document, every unique word is known as a term and every occurrence of a term is a token. Creating a 'bag-of-words' from the documents is known as tokenising.

Disaggregating the document into terms uses the argument 'unnest_tokens'. We can then view begin to look at the word frequencies, as demonstrated in the code below.

# unnest_tokens assigns each document term to a row

tidy_fio <- bpdfio_distinct %>%

  unnest_tokens(word, contact_reason)

# view unnested

tidy_fio

# view wordcount

tidy_fio %>%

  count(word) %>%

  arrange(desc(n))

This outputs are shown below - an un-nested dataset where each document is now represented by multiple rows for each word, and a word frequency table of all terms. At this stage, it's not particularly useful as we need to remove stop words (the, and, a etc.) and there may be other words we choose to remove. An example visible here is the use of XXX which denotes redacted personal details.

Un-nested creates one word per row for each document, see far right column, "while on random patrol officers..."

Word frequencies across all topics, not particularly helpful at this stage as they include stop words 

We can create a table of custom words we would like to remove using the code below. 

# custom stop words, use tribble() to create a data frame, create column headers word and lexicon, then enter our custom words to be removed

custom_stop <- tribble(

  ~word, ~lexicon,

  "x", "CUSTOM",

  "xx", "CUSTOM",

  "xxx", "CUSTOM",

  "xxxx", "CUSTOM",

  "xxxxx", "CUSTOM",

  "xxxxxx", "CUSTOM",

  "xxxxxxx", "CUSTOM")

It is then possible to remove the stop words and our custom stop words using the code below, followed by a recount of the word frequencies. They look a little more useful but still might not be suitable for our purpose. 

# remove stop words and custom stop words

tidy_fiov2 <- bpdfio_distinct %>%

  unnest_tokens(word, contact_reason) %>%

  filter(!word %in% stop_words$word) %>%

  filter(!word %in% custom_stop$word)


# count again

tidy_fiov2 %>%

  count(word) %>%

  arrange(desc(n))

6. N-grams

Sometimes we might want to try breaking our documents down into n-grams, sequences of words rather than single words. The code below shows how to tokenise n-grams specifying the n=2, and then how we can separate those to remove stop words before reuniting them. The output is a different set of high-frequency terms this time with the first two relating to vehicle linked FIOs (traffic stops, ma reg which I presume to be a reference to Massachusetts vehicle registration plate).

# n-grams

ngram_fio <- bpdfio_distinct %>%

  unnest_tokens(bigram, contact_reason, token = "ngrams", n = 2)

# tidy - seperate and remove stop words, reunite

tidy_ngram <- ngram_fio %>%

  separate(bigram, c("word1", "word2"), sep = " ") %>%

  filter(!word1 %in% stop_words$word) %>%

  filter(!word2 %in% stop_words$word) %>%

  filter(!word1 %in% custom_stop$word) %>%

  filter(!word2 %in% custom_stop$word) %>%

  unite(bigram, c("word1", "word2"), sep=" ") %>% 

  filter(bigram != 'NA' & bigram != 'NA NA')

# count bigrams

tidy_ngram %>%

  count(bigram) %>%

  arrange(desc(n))

More useful terms using bigrams, can identify potential topics relating to traffic stops, firearms and gang violence.

7. Extract features and model/analyse

Before we can run some Latent Dirichlet Allocation (LDA) models on our bigram data, we first need to cast a Document Term Matrix (DTM). This is a large matrix where each row represents a document and each column represents a word in the dataset. It is too large for us to view. For the Boston FIO bigram data, it is over 30,000 rows and 106,000 columns. The code for creating the dtm is shown below. A dtm is used to run an LDA model.

# cast a document term matrix 

dtm <- tidy_ngram %>%

  count(fc_num, bigram) %>%

  cast_dtm(fc_num, bigram, n)

To run the model, we pass the document term matrix (we've called dtm) to LDA() and set the parameters, this has been assigned to the name mod. K is the number of clusters or topics we have set in this instance below. 

# assign your dtm to an LDA model and set parameters; k = number of topics we want to set

mod <- LDA(dtm,

            k = 6,

            method = "Gibbs",

            control = list(alpha = 0.5,

                           iter = 500,

                           seed = 1234))

We can run multiple models each time changing the value of k. With each run we can make a note of aspects of model performance including coherence and perplexity. We may look at the perplexity scores with each model to help us determine how many topic clusters to set for k. 

The chart left shows the perplexity score for each model ran on the Boston FIO data, ranging from K2:10 and K20. A lower perplexity score is preferred. The chart line represents what is known as an "elbow", which helps us select the optimal number of clusters or k. Assume the line is an arm, then the elbow is where the point of inflection lies on the curve. Where the improvement in perplexity scores becomes marginal then we select k from here.

We can also visualise the top bigrams in each topic and begin to make sense of them. The charts below visually show our topics using k=6. Higher scoring terms are more frequently found together. As an example, if we take topic number 6 we can see that 'traffic stop' and 'ma reg' are the most probable terms in documents found within this topic.

Topic 2 looks to be drug-related offenders and investigations, including multiple terms with the words drug and crack. Topic 6 looks to be largely vehicle related FIO records, including traffic stop, motor vehicle, stop vals and front passenger. Topic 5 appears to be gang-related violence, mentioning gang names/neighbourhoods in Boston including Heath Street, Orchard Park and Lenox Street, the term gang associate and yvsf which refers to Boston Youth Violence Strikeforce. 

The code below demonstrates how to assign our LDA model output to a matrix, tidy it up and then visualise the terms most associated with each topic as a chart, as shown above.

# assign your lda model output as a matrix

mod_matrix <- mod6 %>% tidy(matrix = "beta")

# create a data frame capturing the top 10 most probable words for each topic

wordprob <- mod_matrix %>%

  group_by(topic) %>%

  top_n(10, beta) %>%

  ungroup() %>%

  mutate(term2 = fct_reorder(term, beta))

# visualise this information as a horizontal bar chart

ggplot(wordprob, aes(term2, beta, fill = as.factor(topic)))+

  geom_col(show.legend = FALSE) +

  facet_wrap(~ topic, scales="free") +

  coord_flip()

8. Using your findings for analysis

Once you have completed your LDA modelling task there are a few options on where to go next. You may wish to use the key terms and words you've uncovered to perform some pattern matching of your free text data in order to extract topics that you wish to analyse further. Say that you were interested in exploring 'shots fired' and 'parking lots' uncovered under topic 3 as it is particularly specific. You would be able to search those terms and retrieve the records that relate to them. This would be using your findings in a traditional way via searching keywords, strings and terms that we've identified as belonging to a more narrowly defined problem.

If this were burglary data, it might be that our topics relate to different combinations of modus operandi, entry points, household and building types (see Birks et al, 2020).

We can also use our LDA model to assign the topic number probability to each term, and then assign a topic to each of our original documents based on the maximum probability. We can use this to view how our topics are distributed, statistically or geographically for example, or begin to use the topics for problem analysis. The code steps for achieving this are shown below.  

# assign probability of topics to each document using your mod and dtm 

post_probs <- topicmodels::posterior(mod, dtm)

#classify documents by finding topic with max prob per doc

top_topic_per_doc <- apply(post_probs$topics, 1, which.max)

#create a matrix and add to original data

# turn to a matrix

new <- as.matrix(top_topic_per_doc)

# then to a data frame

copy <- as.data.frame(new)

# bring in the id number - fc_num, as a column

copy <- tibble::rownames_to_column(copy, "fc_num")

# join the topic probability to your original data, this will automatically join on fc_num

topics <- inner_join(bpdfio_distinct, copy)

# continue to use topics for further analysis in R, or export as csv to explore elsewhere

write.csv(topics, "topics.csv")

Due to a personal interest in the gang violence topic found within the FIO documents, I choose to begin exploring this alongside Boston PD shootings/homicide data, and a gang territories layer which was found at an unofficial open-source (Map of Boston Gangs). 

I created some basic outputs visualising hotspots where Boston PD field, interrogation and observation reports in the gang violence topic were concentrated. They were particularly focusing around three areas labelled as gang areas of Heath Street, Orchard Park and Lenox. A second visual shows statistically significant volume hotspots for shootings in the city of Boston, and a final visual showing arbitrary grid areas with high homicide volumes (all data covers the period 2015-2018). 

These maps show just two neighbourhoods of Boston, where these problems are most concentrated, Roxbury and Dorchester.

I will add to this page as my learning and knowledge of topic modelling develops, I hope this is a useful guide to pique newcomer interests and help get started.

References