AI for Research
Using machine learning and natural language processing to analyze body-worn camera footage at scale
Using machine learning and natural language processing to analyze body-worn camera footage at scale
Body-worn camera footage has long been used as evidence in individual cases. This study uses it as data.
By applying machine learning and natural language processing to thousands of NYPD recordings, the research team identifies key indicators of constitutional compliance in what officers say and how they say it. These methods can systematically assess aspects of police-civilian interactions that are captured in officer and civilian language but challenging to evaluate encounter by encounter.
How It Works
The analysis begins with automatic speech recognition tools that convert body-worn camera audio into text transcripts of officer and civilian speech. From there, machine learning models analyze those transcripts to identify patterns in officer language. These models do not apply preset rules about what officers should or should not say. Instead, they are trained to recognize patterns on their own, learning from thousands of transcripts paired with expert judgments.
For example, to build a model that can distinguish a Level 3 stop from a lower-level encounter, the team provided the model with transcripts from interactions that retired judges in the ISLG study and members of the NYPD's Independent Monitor team had already reviewed and classified.
The model then learns which words, phrases, and linguistic features are most strongly associated with each category. Once trained, it can apply what it has learned to new transcripts, assigning each a probability that the encounter reflects a stop rather than a lower-level encounter.
FIGURE Precision-recall curve for Monitor-Assessed Sample using De Bour Classification model
Model-targeted sampling (the curve) identifies undocumented stops at substantially higher rates than the NYPD Monitor team's current random sampling approach (the horizontal line at 18.7%). Reviewing the 300 recordings the model flagged as most likely to be misdocumented yielded a hit rate of 36.3%, finding half of all undocumented stops the Monitor team ultimately identified with only one-quarter of the review effort.
The team used two different modeling approaches in this study. The first is traditional statistical machine learning, in which models identify word and phrase features (called "n-grams") that distinguish one category from another. These models are valued for their interpretability, meaning researchers can directly examine which terms drive a given prediction, and their calibration, meaning the probabilities they produce are meaningful and can be relied upon for auditing.
The second approach involves fine-tuning pre-trained large language models, which can capture more nuanced linguistic patterns but are less transparent in how they reach a given classification. Performance is measured using cross-validation, a technique in which models are repeatedly trained and tested on different subsets of the data to ensure their predictions generalize beyond the specific examples they learned from.
FIGURE Explicit mentions of “consent” and “search” in any recording associated with a consent search interaction, by civilian race.
Explicit mentions of "consent" and "search" in any recording associated with a consent search interaction, by civilian race.
A Research Foundation
The methods applied in this report build on nearly a decade of research using body-worn camera footage as a source of data rather than only as evidence.
A 2017 study applied computational linguistics methods to nearly 1,000 traffic stops and found that officers spoke less respectfully to Black community members even after controlling for multiple contextual factors. Subsequent research has extended these methods to detect whether officers state reasons for stops, ask consent for searches, or offer reassurance to drivers, and has shown that linguistic patterns in an officer's first 45 words, roughly the first 27 seconds of a stop, can predict whether those stops will escalate to searches, handcuffing, or arrests.
A 2024 study comparing footage before and after a police department's procedural justice training found that officers employed more of the techniques recommended in the training, demonstrating that these methods can also measure the effectiveness of reform efforts. This line of research by the Stanford team provides the foundation for the methods applied here.
Beyond This Study
Computational methods complement rather than replace human review. When paired with legal expertise and supervisory oversight, these tools can help focus attention on encounters that warrant closer scrutiny, assess whether training is changing officer behavior in practice, and track compliance trends over time. Rather than reviewing BWC footage through random sampling alone, supervisors could use model probability scores to prioritize the recordings most likely to contain compliance issues, directing expert time to the interactions that would benefit most from review.
The same approach could extend to other aspects of police-civilian encounters. Similar models might be able to identify encounters that share linguistic or procedural features with those that have generated civilian complaints, supporting earlier intervention before misconduct escalates.
Computational analysis also offers a way to evaluate the effectiveness of training efforts, by comparing officer language before and after a training to assess whether desired changes are actually taking place in real encounters. The Stanford team has applied this approach in two large cities in California, analyzing over 1.3 million videos representing more than 300,000 hours of interactions, demonstrating the potential for computational analysis to support compliance assessment at department-wide scale.