QGIS for CRIME analysis

In the United Kingdom (at least) many jurisdictions employing crime analysts, either in policing or in related roles (i.e. Community Safety Partnerships/CSPs), began to do away with commercial licences for GIS software. This was certainly the case for many community safety teams across England and Wales as austerity measures saw councils reduce overheads.

Unlike the US, whereby ArcGIS seems to have cornered the law enforcement market, the UK was a patchwork of MapInfo, Arc and Northgate in policing. In some areas, CSP analysts moved to lower-cost desktop and web mapping options such as Cadcorp, Maptitude and Earthlight/StatMap. Others ventured towards open-source desktop software like QGIS.

Between 2017-2019, I found myself having to move to QGIS after having spent over a decade using commercial GIS desktop software. I learned to use QGIS for strategic, tactical and other forms of crime and public safety analytics in government and policing roles. To date, as far as I am aware, there is little in the way of guidance about what QGIS offers to police and public safety analytical roles. This article aims to highlight the functions available along with brief examples of how to apply them with crime analysis in mind.

This is not intended to be an exhaustive guide.

I will explore a variety of techniques detailed in the excellent book Understanding Crime: Analyzing the Geography of Crime, by Spencer Chainey, and how you can apply them in QGIS. This book is the perfect resource for anyone wishing to develop skills in geographical crime analysis. Some of the content can also be found online in various open-access publications from Spencer Chainey, including Examining the extent to which hotspot analysis can support spatial predictions of crime. I signpost to both the book and open-source articles throughout this page. I will also highlight some additional functions available in QGIS that are useful.

Front cover of book Understanding Crime by Spencer Chainey

2. Getting started with QGIS

2.1 Download and Install

To download and install QGIS, I'd recommend choosing the most recent Long Term Stable Release (LTR). The download page states what the current LTR release version is called:

https://www.qgis.org/en/site/forusers/download.html#.

2.2 Plugins

Whilst QGIS has a large number of processing toolboxes and functions built-in (core features), there are a few additional plugins that are required for using QGIS as crime analysis software. These can be downloaded by navigating the dropdown menu 'Plugins > Manage and Install Plugins...'. Plugins that are mentioned on this page are listed below:

H3 Toolkit - Nested Uber hexagons (see Section 9, installation of dependencies required)
Multi Ring Buffer
QuickOSM - download Open Street Map data with query builder
Spatial Deviational Ellipses
Visualist - spatial point pattern functions for crime analysis

2.3 User Manuals

See documentation and manuals for QGIS here https://docs.qgis.org/3.28/en/docs/index.html

2.4 Online courses

There are many online courses to get you started with QGIS. I have completed and can recommend the following as being extremely informative:

Map Academy: get mapping quickly with QGIS https://www.udemy.com/course/mapacademy/
Map Academy: taking QGIS to the next level https://www.udemy.com/course/map-academy-taking-qgis-to-the-next-level/
QGIS 3.XX LTR for GIS Professionals https://www.udemy.com/course/qgis-for-gis-professionals/

Alasdair Rae who produces the Map Academy courses on Udemy also regularly blogs guides on mapping techniques on his website http://www.statsmapsnpix.com/ and shares techniques via Twitter @undertheraedar and Youtube @automaticknowledge.

3. Sample Dataset

I have provided sample datasets below:

Jack the Ripper sample datasets, link to DropBox. This data is for sections 4.1 and 4.4.
Cambridge UK built-up area data, link to DropBox. This data is for sections 4.2, 4.3, 5.1-5.7, and 6.1-6.4.
Cambridge solutions, outputs with saved symbology, link to DropBox. These are map outputs for sections 4.2, 4.3, 5.1-5.7, and 6.1-6.4.
New York Police Department dataset, link to DropBox. This data is for sections 7.1 and 7.2.

Other open datasets are listed towards the bottom of the Online Resources page.

Function names and drop-down menu option names in QGIS throughout this page are highlighted in purple bold italic: 'Vector > Analysis Tools > Mean coordinate(s)...'

4. Descriptive spatial statistics

First, we will look at QGIS functions that can be used to examine general spatial patterns in your data. These consider:

Spatial behaviour of offenders - how an individual offender operates within relatively short distances
Spatial autocorrelation - measuring how much of something (i.e., crime) in one area is most similar to areas nearby
Testing whether or not hotspots are present - are the crime points in your data clustered?

Chainey, S. 2021 pp21-36.

4.1 Spatial mean and standard deviational ellipse (SDE)

Spatial mean (SM) provides an indication of how an offence series changes and its consistency. The SM is a measure of central tendency, also referred to as a centroid or mean centre, of your point distribution. To use, navigate the dropdown menu to 'Vector > Analysis Tools > Mean coordinate(s)...'. The example shown is the spatial mean of Jack the Ripper offence locations, and how those changed over time with each offence.

Standard Deviational Ellipse (SDE) indicates the dispersal and direction of a group of offences around the spatial mean. This function can be added as a Plugin Tool. Once added, it can be accessed via the 'Vector > Standard deviational ellipse' drop-down menu. Read more about SM/SDE here: Chainey, 2014, p.54.

Map of spatial mean and standard deviational ellipse of Jack the Ripper offences

Spatial mean and standard deviational ellipse, Jack the Ripper offences (QGIS)

4.2 Spatial autocorrelation

Moran's I statistic, a measure of spatial-autocorrelation, observes similarities between values at each location with its neighbours. Where high values are surrounded by similarly high values (or low by low), positive spatial autocorrelation exists. Where areas of low values are surrounded by high values or vice versa, this is negative spatial autocorrelation. This method can be helpful in identifying significant and contiguous areas of 'hot' and 'cold'.

A Moran's I result varies between -1.0 and +1.0, where higher values indicate positive and lower values indicate negative spatial autocorrelation. Read more about Moran I and spatial-autocorrelation here: Chainey, 2014, p.115.

Calculating in QGIS requires point data to be aggregated into polygons. An example is shown using criminal damage offences within Cambridge UK Built-Up Area (BUA).

Moran I's spatial autocorrelation map, Cambridge UK

Moran's I Cambridge criminal damage offences (QGIS)

In this example, point data was aggregated to hexagons using Uber's Hexagonal Hierarchical Spatial Index resolution 9. The Moran's I value for the distribution is 0.49, indicating positive spatial autocorrelation in the data.

The first step in achieving this output is to generate a polygon layer of Uber hexagons within a study area boundary (here Cambridge BUA). This requires installing the H3 Toolkit plugin found from the 'Processing > Toolbox' dropdown menu. Search the Processing Toolbox once installed selecting 'H3 > Create H3 grid inside polygons', specifying the boundary layer and desired resolution (see information on hexagon sizes by resolution). The second step is to establish the number of points in each hexagon. This can be calculated using either the core plugin 'Vector > Analysis Tools > Count points in polygon(s)...', or using the Visualist plugin. Functions from the Visualist plugin are accessed via the processing toolbox. A count of points in polygons can be achieved using the Visualist function 'Choropleth Map'. The Choropleth Map method processes this calculation faster than the core tool. Whichever method you opt for, the next step is to search for 'Spatial Autocorrelation Map' (a Visualist plugin function) from the processing toolbox.

The Cambridge criminal damage example is displaying Moran's I categories. The hexagons are filtered to only display where the p-value was less than or equal to 0.05 (see here for MapAcademy playlist including spatial filtering options).

Efficiency and reproducibility tip: If you were going to replicate this process numerous times, there are options to generate automated workflows combining multiple functions in QGIS. One option is joining the steps together in sequence using the Graphic Modeler (similar to ArcGIS ModelBuilder). The second is to generate a script using the Python Console. If you create a Graphic Model, you can also use Batch Processing and run multiple iterations at the same time (i.e., if you wanted to replicate a process for every district in your police force area). Graphic models built can also be exported as Python scripts.

4.3 Nearest neighbour index

Nearest Neighbour Index (NNI) is a test used to compare the spatial distribution of point data against random variation. NNI tests produces a value between 0 and 2.15, where lower values indicate evidence of clustering (less than 1), mid-values indicate random patterning (1) and higher values indicate uniform distribution (2). Read more about the nearest neighbour index here: Chainey, 2014, p.116.

Above: Example of point distribution at NNI intervals

NNI can be used as a preliminary test to determine whether or not there are enough data points to produce a hotspot map. An exploration of NNI values by Spencer Chainey found that the number of days of data required, before the clustering of offences became significant, varied between crime types and study areas. For example, in Camden (London, UK) three days of data (66 offences) were needed before Theft from Motor Vehicle offence clustering became evident, but for Theft of Motor Vehicles, 10 days of data (64 offences) was needed. In Newcastle upon Tyne, evidence of clustering in Theft of Motor Vehicle offences was not found until 17 weeks had passed!

In practice, if you were responsible for producing tactical assessments (TAs) then there would be little value in producing 2, 4, 6 or 12-week hotspots (time periods typically selected for TAs) for the problem of Theft of Motor Vehicles for Newcastle.

In QGIS the NNI test can be navigated to using 'Vector > Analysis Tools > Nearest neighbour analysis...' from the dropdown menu. The outputs can be viewed from a link appearing in the Results Viewer panel as an HTML (see image). The example here shows that our Cambridge criminal damage data has an NNI of 0.14, and is highly clustered. Unlike the equivalent Nearest Neighbor function in ArcGIS Pro, the QGIS version does not allow the user to choose a distance method (euclidean, Manhattan, network distance), and does not provide a p-value in the output.

Above: NNI output Cambridge criminal damage offences (QGIS)

4.4 Journey to crime

Crime opportunities seized by offenders on average tend to be more frequent within shorter distance journeys. The likelihood that journeys to crime are short is an important principle - for instance, if there is a crime increase in a small area, it is more likely that those involved live locally. Read more about the journey to crime here: Chainey, 2014, p.290.).

QGIS example, Euclidean and Network Distance between Jack Ripper anchor point and offences

Data containing start and end points of journeys (point data), for example, an offender's home address (or another anchor point, such as their workplace) and the location of their offences, can be used to calculate the distances travelled. This distance is often referred to as the journey to crime (JTC). We can aggregate this information to understand the distribution of JTC distances for all known offenders across different categories of crime. It is recommended that you familiarise yourself with the literature and observe the data limitations that may be present (i.e., the assumption that the home address is the anchor point before going off to offend).

There are three measurement methods used when calculating distances:

Euclidean Distance - the straight line distance between two points
Manhattan Distance - the shortest distance between two points constrained by horizontal and/or vertical directions
Network Distance - the shortest path following a street network

QGIS offers several methods for measuring distances between start and end points (origin and destinations). These include:

Network Analysis tools are available within the Processing Toolbox. The 'Shortest Path (point to point)' algorithm was used to generate the Network Distances from Jack the Ripper's anchor point to the location of his offences. A street network multi-line dataset is required to generate the shortest path distances. Street network data can be downloaded using the QuickOSM Plugin (YouTube walkthrough on QuickOSM).
Distance to the nearest tools are available within the processing toolbox. The 'Distance to nearest hub (points)' algorithm was used to generate the Euclidean Distance from Jack the Ripper's anchor point to the location of his offences.
Distance Matrix. Another tool available in the processing toolbox that will calculate the distances between each set of origin and destination coordinates and return as a long-list table with the distance in metres.
ORS Tools Plugin - open route service, API required. An online tutorial is available here.

5. Hot spot analysis

Next, we will look at QGIS functions that can be used to analyse spatial point patterns and identify places that have higher relative crime concentration in your study area. In the text Understanding Crime (Chainey, S. 2021 pp37-64), a number of key theoretical principles and terms are introduced which you should familiarise yourself with: what is meant by a hotspot, neighbourhood and situational characteristics, the modifiable areal unit problem and spatial heterogeneity. You can also read more about these here: Chainey, 2014, pp.51-59.

Here we are only going to highlight the options available in QGIS to perform the following:

Crime concentration
Hot Routes
Kernel Density Estimation
Choropleth and Grid Maps
Getis Ord GI*
Sampling raster values for composite maps
DBSCAN

5.1 Crime concentration

The law of crime concentration at place (Weisburd, 2015) asserts that a small number of micro-geographic places (i.e., addresses, street segments or small grids typically no larger than 150-200m in UK studies) account for a disproportionate percentage of all incidents in any given larger study area, such as a neighbourhood or city. A large and growing volume of research consistently has found that as much as 25-50% of all crime is found to occur in typically less than 5% of all micro-spaces. Increasingly, and particularly in the design of hot spot policing studies, analysts and police practitioners have taken to calculating crime concentration in order to focus efforts on the small proportion of micro-locations making up the majority of incidents (or harm if using a weighting system like the Office for National Statistics Crime Severity Score or Cambridge Crime Harm Index). See my page on Designing a hotspot study for more details.

Calculating crime concentration involves the aggregation of point data to micro-geographic units (see 5.2 Hot Routes, 5.4 Choropleth and Grid Maps), and then rank-ordering each unit by the total number of points from highest to lowest. You then must calculate the percentage of all crimes for each unit, and a cumulative total (running total). This can be done within QGIS attribute tables and field calculator, or you can export from QGIS as a CSV and calculate these in spreadsheet software (i.e., Excel, Calc, Sheets).

The example image shows all acquisitive crimes in Cambridge BUA aggregated to street segments. The street segment with the highest count had 169 crimes (1.19% of all acquisitive crimes).

Crime concentration table data, Cambridge acquisitive crimes, QGIS

From the table, we can derive descriptive statistics of crime concentration. Of 1,775 street segments (containing crimes) in the Cambridge study area just 2.5% (#44) accounted for 25% (#3,565 crimes) of the entire count of acquisitive crimes. This information can be used to determine a 'cut-off' for how many places are considered for an intervention response or can be used in combination with other mapping techniques to visualise results. Visual output examples are provided.

Top 25% and Top 50% crime street segments, Cambridge acquisitive crime (QGIS)

Top 25% and Top 50% crime micro-grids of 125m, Cambridge acquisitive crime (QGIS)

Combined map showing Top 50% crime segments and Top 50% micro-grids (QGIS)

5.2 Hot routes

Hot routes, also known as street segment maps, are created by aggregating sums of points (crimes) to lines (streets). These can then be visualised as crime patterns along a network - usually, this is a street network but could also be transport routes. Read more about 'Hot Routes' here: JDI Brief Hot Routes.

Generating a hot routes dataset in QGIS requires a street network layer and a points dataset, which are provided in the sample dataset. An analysis of crimes along a street network can be achieved using the Visualist function 'Graduated Lines Map' or 'Graduated Lines Segment Map'. The latter option provides the user with a parameter option for street segment sizes. Functions from the Visualist plugin are accessed via the processing toolbox.

An example output is shown for central Cambridge acquisitive crime. The street segment counts have been transformed to represent the total number of crimes per 100m. The lowest category (0-12) has been turned off on the visual.

5.3 Kernel Density Estimation (KDE)

KDE is perhaps the most frequently used technique for identifying spatial point patterns and hotspots among crime analysts. Point data (geographically referenced as projected coordinates X/Easting and Y/Northing) is necessary for applying KDE functions in QGIS. Read more about KDE here: Chainey, 2014, pp.126-128.

There are some critical points to consider regarding the use of KDE in crime analysis:

Lacking universal consensus on how to set the parameters (cell size and spatial bands)
Small amounts of data can mislead the map viewer (see 4.3 Nearest Neighbour Index section of this page)
Visual allure can lead to validity and statistical robustness being overlooked.

Chainey (2021 pp.37-64) provides a detailed examination of parameter settings. You can also read about these here: Chainey, 2014, pp.140-172.

To illustrate why cell size and bandwidth require consideration, a visual of central Cambridge KDE maps are provided using different parameter settings from maps created from the same dataset. We can see that altering the size of the spatial bandwidth (radius parameter in QGIS) changes the number, size and intensity of hotspots, whereas altering the cell size (pixel parameter in QGIS) alters the smoothness of the surfaces generated. We can see that this parameter selection can influence the geographies likely to be chosen when maps are being used to inform police and public safety interventions.

The question then is how can we determine the most appropriate bandwidth and cell size?

The suggested start point is to:

Generate a bounding box for your study area. This can be done in QGIS via the processing toolbox searching for 'Bounding boxes'.
Measure the shorter side of the study area bounding box. Use the Measure Line icon on the QGIS ribbon (or shortcut key Ctrl+Shift+M). Our Cambridge BUA study area is 11,809 metres.
Divide this value by 150 and multiply by 5. 11809 / 150 = 78.7. 78.7 * 5 = 393.

There are more scientific optimisation calculations in existence, however, these tend to produce large bandwidths that are considered unsuitable for exploring density distributions of crime (Chainey, 2013).

A visual of KDE hotspots using the calculated parameters is shown. These were created using the 'Heatmap (kernel density estimation)' function within the QGIS processing toolbox. There are a range of settings contained within the function, but for now, we just want to be familiar with three. The first is 'Radius' - this is a spatial band to be set in metres (or feet in US). In the Cambridge area, we calculated that this as 393m. The second is 'Pixel size X' - this is the cell size. We only need to specify this for 'Pixel size X' ('Pixel size Y' will automatically update on doing so). In the Cambridge area, we will use the figure calculated earlier (78.7) as the cell size. On the advanced parameters setting there is an option for 'Weight from field [optional]'. If you have a weight in your data you can select it from your table here.

KDE Hotspot central Cambridge violence and sexual offences (QGIS)

KDE Hotspot harm-weighted (pseudo harm values) central Cambridge violence and sexual offences (QGIS)

In the sample dataset provided, there is a violent crime dataset with pseudo-harm scores for Cambridge. You can see the weighted and unweighted hotspot maps of violence for central Cambridge in the visual provided.

When using harm weightings, it may be beneficial to transform the values (i.e., square root or natural log to reduce skewness) to avoid the creation of what might be termed 'frequency deserts'. These are high-intensity hotspots that can occur from small numbers of harmful events (like a single multi-person homicide). Extreme variations in harm weights untransformed otherwise could produce results that mislead the map viewer.

5.4 Choropleth and Grid Maps

Choropleth maps are used for displaying distributions of points typically within administrative, statistical and police divisional boundaries (i.e., political wards, lower super output areas, census blocks, and police beats). These maps are often used for:

performance analysis and the collation of aggregated statistics (i.e., counts by beat or ward),
Compstat style accountability and tasking meetings,
combined with other aggregated datasets (i.e., Census) to generate rates or,
to perform regression analyses (i.e., exploring relationships between crime counts and other socio-demographic and economic data).

Aggregation of point data into larger geographic boundaries can hide patterns and micro-variability in point data (see Modifiable Areal Unit Problem). This can lead to the misallocation or imprecise targeting of resources. The example below illustrates this more clearly.

Read more about choropleth maps here: Chainey, 2014, pp.55-57.

Figure image from Kitchin, Lauriault and McArdle (2015) Knowing and governing cities through urban indicators, city benchmarking and real-time dashboards (see here)

These maps show the housing vacancy rate in Tallaght, Dublin. If we use the Electoral Divisions (1), then a map viewer is likely to select a single large area identifiable whereby vacancy rates are 15% to <30%. Moving to the next spatial aggregation down to Enumerator Areas (2), there are now three areas where the map viewer is likely to have interest, one of which would have been missed entirely at the previous spatial aggregation.

These are still quite large areas geographically, and they are hiding more significant patterns. When moving to the next spatial aggregation down to Small Areas (3), we can see now there are four highly concentrated micro-places where housing vacancy rates all exceed 30% plus.

Suppose these were maps showing repeat domestic abuse or household burglary rates. In maps 1 and 2 we would miss entirely one of the highest areas of concern. Not only that, we would allocate and spread resources thinly across a larger area than was necessary, and potentially fail to have any impact as a result.

Grid maps using uniform polygons, as opposed to different shape and size boundaries, are a way of overcoming the problems associated with choropleth maps but still have limitations. A critique in terms of practicality is that they can be difficult to interpret visually, creating challenges when it comes to choosing areas of focus (see section 5.1 Crime concentration for more practical way of displaying and using grid map data). Read more about grid maps here: Chainey, 2014, pp.57-58.

More frequently hexagons are being used to display aggregated thematic data. Read more about why here: Why hexagons?

Grid Map using 125m grids (left) and Choropleth Map using Census 2011 output areas (right) showing all violent crime in Cambridge BUA (QGIS)

Generating choropleth and grid thematic maps in QGIS requires a polygon layer and a points dataset, which are provided in the sample dataset. These can be achieved using two options from the Visualist function 'Choropleth Map' or 'Grid Map'. The latter option provides the user with a parameter option for grid sizes. Functions from the Visualist plugin are accessed via the processing toolbox.

An example output is shown for Cambridge BUA violent crimes. The grid map shows the counts of violent crimes for 125m grids across the study area. The choropleth map uses the counts of violent crimes for each Census Output Area (2011). These have been transformed to a density value, which was calculated by dividing the number of crimes by the geographical area.

5.5 Getis Ord GI*

Getis Ord GI star, or Gi* statistic compares local and global averages (of aggregated points across uniform polygons - such as grids or hexagons) to identify areas that are significantly different when compared to an entire study area. Read more detail about this method here: Chainey, 2014, pp.176-182.

The Gi* method assists in defining areas that are hot and being able to distinguish from those that aren't without ambiguity. This helps address the limitations of KDE methods, which do not test or display statistically significant concentrations and the smooth transitions between high and low values it produces can inadvertently deceive or mislead map viewers.

The Gi* method can be combined with other techniques to produce composite map layers, an example is shown in the next section 5.6.

Calculating Gi* in QGIS requires point data to be aggregated into polygons. In the example given, we create first create a 'Grid Map' which counts the number of violent crimes per 125m square grid across Cambridge BUA. We then use the grid map with the 'Spatial Autocorrelation' function. Both these functions can be found within the processing toolbox and are from the Visualist plugin. We previously used the spatial autocorrelation tool to calculate Moran's I (see section 4.2), but this time we select 'Getis-Ord Gi*' from the LISA indicator drop-down menu.

Spatial Autocorrelation Map parameter options (QGIS)

Default output using Spatial Autocorrelation Map for Getis-Ord Gi* (QGIS)

The modified output which filters statistically significant grids and high z-scores (QGIS)

Once run, the default output produces a graduated values map whereby the Gi* statistic has been converted to and displayed as z-scores. We can filter the default output to isolate only those grid cells that are most statistically significant (the 'GETISORD_P' column, p value) AND possessing high Gi* statistic values above n standard deviations from the mean (the 'GETISORD_Z' column, z score).

In the modified output example, grids whereby the z-score was greater than or equal to 2.576 were retained. You may need to determine a value specific to your study area to avoid the overidentification of clusters. One approach is the Bonferroni correction. Read more about this here: Chainey, 2014, pp.179-180.

To calculate the correct z-score value for our Cambridge BUA example, we need to know the total number of grids (#6,152). If we wanted a significance level of p <= 0.05, then we must calculate the proportion of cells representing 5% of the study area, or 0.05/6,152 = 0.00000812744. We can find the z-score for this using the percentile to-z-score calculator found here: https://measuringu.com/calculators/zcalcp/. When submitting the 0.00000812744 value for our study area of 6,152 cells, a Bonferroni corrected p <= 0.05 statistical significance threshold value was calculated as 4.462.

5.6 Sampling raster values for composite maps

So far each of the techniques available to us offers both advantages and limitations. Therefore in practice, it can be more useful for analysts to use a combination of techniques in their analysis of spatial data when determining where to target and deploy resources. Different combinations that you may wish to consider can include:

Getis-Ord Gi* significant cells with the highest 25-50% of street segments overlaid,
KDE hotspots with the highest 25-50% street segments overlaid,
Getis-Ord Gi* significant cells coloured by KDE density values.

GI* significant clusters (p <= 0.05 and z >= 4.462), and KDE density values (QGIS)

In this example we use the outputs from the previous section. Here we filter the Spatial Autocorrelation map output to display grids whereby the 'GETISORD_P <= 0.05 AND GETISORD_Z >= 4.462' (see a short video from Map Academy using QGIS for filtering using greater than or less than).

To add the KDE density values to the grids there are four steps.

Generate a KDE heatmap using the same crime point data you used to generate the Gi* spatial autocorrelation map. Do this using 'Heatmap (kernel density estimation)'. This creates a new layer called Heatmap.
Generate a point layer by extracting centroids from the spatial autocorrelation map using the drop-down menu 'Vector > 'Geometry Tools > Centroids...'. This will create a new layer called Centroids.
Extract the values from the Heatmap by joining them to the Centroids. This can be done using the 'Sample raster values' algorithm in the processing toolbox. This will create a new layer called Sampled.
Using the object or feature ID column in Sampled, this data can be joined to the spatial autocorrelation map. Right-click the spatial autocorrelation map and navigate to 'Properties > Joins'. Once joined you can then display the sampled values in the map symbology.

You will now have a map identifying the most spatially significant locations represented using a graduated colour scale of values (KDE density values) that are more interpretable to end users than p values and z scores.

5.7 DBSCAN

DBSCAN, or Density-based spatial clustering of applications with noise, is a clustering method which helps identify clusters of points based on their location. The function ignores points that are likely to be random (referred to as noise). The process requires two parameter inputs the user provides the algorithm. The algorithm uses these guidelines to identify clusters. The input parameters are 1) the minimum number of points a cluster should contain and 2) the maximum spatial distance between each point in a cluster.

In the example maps provided, the minimum cluster size was set as 50 points and the maximum distance between points was 50 metres. This was done using three years of violence data for Cambridge. Please read the following on parameter inputs here.

There are no current guidelines on how to set parameters for DBSCAN when observing crime data. As a start, we might want to think about the volume of crime over time when we select the minimum number of points. For instance, if using 12-month data for serious woundings, then a cluster containing a mean of 1 event per month might be of interest. This might be much higher for theft from motor vehicles (i.e., 1 per week). The maximum distance between points, this could be guided by the NNI (see section 4.3). Alternatively, you may iteratively increase your distance between points and experiment with different outputs (i.e., 25m, 50m, 75m, 100m). This can be guided by other practical requirements, such as how many clusters can be targeted and over what area, given the resources available.

DBSCAN clusters of violence crimes in the north of Cambridge. Each colour represents a cluster containing a minimum of 50 points whereby the maximum distance between points is 50 metres. The gray smaller points represent random 'noise' (QGIS)

DBSCAN clusters of violence in the north of Cambridge. In this representation, the clusters are coloured according to the total number of points within them. We can use this to isolate the highest volume clusters that may be of most interest (QGIS)

Once you have clusters, these can be transformed into polygon files. One method for doing this is to split the clusters up into separate layers by their CLUSTER_ID field. This can be done using the 'Split vector layer' in the processing toolbox. These layers can then be turned into polygons by applying a dissolved buffer, using the 'Buffer' option in the processing toolbox. Run these as a batch process to save having to manually do each layer individually.

The example shown here was created by generating buffers around the largest clusters in Cambridge. The mapped area here identifies three geographic areas in the north of the city.

DBSCAN output transformed to polygons using dissolved buffers (QGIS)

6. Temporal, risky facilities and near repeat techniques

Next, we will look at QGIS functions that can be used to analyse spatial-temporal data and identify disproportionality. We will also look at simple methods that can be created with the combined use of QGIS and spreadsheet software.

6.1 Temporal stability index (TSI)

Previously mentioned was the law of crime concentration at small places (section 5.1). It's also often found that there is strong stability in this concentration over time. The 'Temporal Stability Index' (Chainey, 2014, pp.222-232) is a relatively simple measure one can use to assess the stability of crime levels segmented by equal temporal periods within geographical areas (i.e., hotspots, neighbourhoods or other bounded polygons).

To calculate the TSI you will need to create a geographically referenced dataset, where each row represents your geographic area of interest, each column represents an equally sized temporal period (i.e., periods of 4-weeks) and the values are counts of crime. This data is used to calculate an index score to denote whether or not the levels of crime are more equally spread over a specified period (stable level of crime), or concentrated in a shorter period of time (such as a crime spike). In the example provided, we use the values in the data column to calculate a proportional value displayed in the right-hand column 'TSI'. If you were calculating a TSI in Excel, the formula would be as follows:

1 - SUM((01/11/2022 column value/Grand Total column value)^2), repeated for each time period...)

The value for each time period is calculated as a fraction of the total for each geographic unit and then squared. The squared values for each time period are added together, and subtracted from 1. A TSI value of 1 indicates perfect heterogeneity - crime is equally distributed across each time period. A TSI value of 0 indicates complete homogeneity - crime occurred in just one time period.

TSI data example for violence at Uber Hexagons resolution 9 (QGIS)

An example output is provided. Here the map viewer is presented with micro-locations in Cambridge which cumulatively made up approximately 50% of all violent crimes (concentration), and a TSI value of greater than 0.75 for the prior six time periods (representing stability) this was an arbitrary cut-off selected). The most recent violent events are overlaid. This has been achieved using the function 'Choropleth Map' from the Visualist Plugin and filtering using greater than or less than (CumFreq <= 0.5 AND TSI >= 0.75).

The practical use of TSI is in determining areas of prioritisation.

In many cases when deciding where to allocate resources, we want to be able to find not only places that are the hottest, but also those which are most likely to continue to remain hot. Whether using the TSI in a tactical or strategic sense across differing units of geography, it can be a helpful calculation for finding the areas in which an intervention or response will be most beneficial to reduce crime.

6.2 Risky facilities (proportional symbols map)

Facilities are places with specific functions, such as restaurants, bars, schools or car parks for example. How much crime a facility experiences can vary significantly, and generally speaking there is a wide range in the volume of crime experienced by the same types of facility. Risky facilities refer to the small proportion of any specific type of facility that accounts for the majority of crime. Read more about risky facilities here: Eck, J and Clarke, R.V. (2007) Understanding Risky Facilities. Pop Center Tool Guide No. 6.

The Visualist Plugin for QGIS provides a function called 'Proportional Symbols Map' which can be used to visualise the distribution of risky facilities. Your data must first be aggregated to a georeferenced location. In the example, shoplifting offence data at supermarkets in Cambridge BUA has been used. Each supermarket is represented by a geographic coordinate (X/Easting, Y/Northing) and the number of shoplifting offences. Additionally, data was obtained on the area of each supermarket (metres squared) using polygon data from Open Street Maps. This was used to generate a rate of shoplifting offences per 100m squared.

It is important to consider appropriate denominators to determine a meaningful rate. In our supermarket shoplifting data, there are two locations which have almost the same number of offences. Supermarket A had an offence count of 182 and Supermarket B 181. However, Supermarket A has a floor space of just 155m sq, whereas Supermarket B is 8,848m sq. When shoplifting is expressed as a rate of offences per 100m sq then Supermarket A (117 offences per 100m sq) is significantly more risky than Supermarket B (2 offences per 100m sq). For location context, Supermarket A is in a high-footfall city centre express store with a continuous stream of customers throughout the day making short-duration visits to buy smaller numbers of carriable items. Supermarket B is an edge-of-city superstore best accessed by vehicle, with peaks and lulls in visitor activity during the day and trips more likely to be long-duration visits (i.e., a family weekly shop).

Proportional symbols map, shoplifting offences at supermarkets in Cambridge; NB Legend for this visual does not produce symbols by graduated sizes (QGIS)

6.3 Location quotients

An alternative way of comparing levels of risk across geographical areas or among facilities is the Location Quotient (LQ). Keeping with our shoplifting at supermarkets data example, we observe the relationship between offending and floor-space. The LQ is an alternative metric comparing concentration across space. LQ, which can be described as a rate ratio, compares the level of crime within a sub-area (i.e., a supermarket, grid cell, neighbourhood, district etc) to the equivalent rate within the larger region (i.e., all supermarkets in the study area, or the entire study area). Here we use the supermarket data to demonstrate how the LQ is calculated.

Using Supermarket J in the table as our example. The LQ of 23.6 tells us shoplifting offences are 22.6x higher than we would expect them to be at this supermarket when compared across all supermarkets. This was calculated as:

(Supermarket Offences / Study Area Offences) / (Supermarket Area Sq / Study Area Sq)

(182 / 1837) / (155 / 36874)

Sometimes we might be interested in how much crime occurs within the vicinity of a particular location. For example, is violence more prevalent in proximity to bars? We can use the method of LQ statistics with counts of crimes occurring within Multi-Ring Buffers to achieve this in QGIS. In order to do this we need to first calculate the area of the entire study area, and the area of each ring buffer; and the total number of crimes for each of these.

Supermarket shoplifting offences data for Cambridge BUA

Multi-Ring Buffers around bars and pubs in Cambridge (QGIS)

In the example, the 'Multi-Ring Buffer' Plugin Tool was used to draw four 100m donut ring buffers around Cambridge pubs. The area was calculated. 'Choropleth Map' was used to determine the number of violent crimes in each ring buffer. LQ calculations show that violence is 3.6x more likely within 100m of a bar/pub, and this risk decays as one moves further away from bars.

Read more about Location Quotients here: Using LQ Technique to Analyze Residential Burglary

6.4 Near space-time clusters (ST-DBSCAN)

Spatial-Temporal Density-Based Spatial Clustering of Applications with Noise, or ST-DBSCAN for short, is a core tool within QGIS that allows us to detect space-time hotspots in our data. Similar to the DBSCAN (see Section 5.7) there are predefined parameters for the user to input. These are the minimum cluster sizes (how many offences should be in a detected cluster), the maximum distance between each offence in a cluster, and the maximum time duration (i.e., hours, days, weeks) between each offence in a cluster. This is the closest we can currently get to being able to perform near-repeat analyses of crime completely within QGIS. You can read more on this website about near-repeat victimisation and how this kind of analysis can be useful here: Analysing near space-time patterns.

To use ST-DBSCAN in QGIS the function can be accessed via the processing toolbox searching for 'ST-DBSCAN clustering'. The example provided shows the results of a cluster of 10 violent crimes belonging to one of 16 unique clusters identified using violent crime data for Cambridge (temporal data was synthesised as this is not available open source). We can see in this example that these offences occurred within a relatively small area (no more than 100m between each point) and timeframe - 27th August to 6th September.

To date, there are a finite number of accessible studies detailing the use of ST-DBSCAN in crime analysis.

Viewing results of a ST-DBSCAN and a docked table with the crime details for this cluster (QGIS)

7. Changing patterns of crime

7.1 Visualising changes in crime

Using basic maps to visualise performance data can be done on the fly with QGIS, providing the data is organised in a wide format (there is also a GroupStats Plugin for creating pivot and contingency tables, if you need to do this in QGIS rather than spreadsheet software).

The example provided used a dataset containing a row of all police precincts in the NYPD geographical area. Each column in the dataset represented the total number of shootings for different periods of time. Here we map the change in the number of shootings in 2022 calendar year compared to 2021 (Brooklyn Borough displayed), labelling each precinct by the number change in shootings.

You can see in the Layer Properties dialogue box how we have calculated the change in the Value box by specifying a calculation. Here we are using the 'Poisson e-test' calculation to show significant changes. You can read more about that here: Testing changes in short-run crime patterns.

7.2 Atlases

Atlases in QGIS are useful when you want to create a series of layout pages from a single map document (i.e., CompStat style maps for different Precincts). For those familiar with ArcGIS Pro, this is similar to 'Data Driven Pages'. You can see an example output for NYPD shooting maps in the screenshot. It's possible to incorporate other visuals within an Atlas to look and feel more like reports. Other items could include charts, tables and dynamic text narrative for example. You can learn more about styling and visual allure in the QGIS Map Design book.

Screenshot of a simple QGIS Atlas for shootings events by precinct using QGIS Print Layout

8. Additional tools

8.1 Using Data Plotly for creating charts

If you need to create charts using your data in QGIS, there is a Data Plotly Plugin available. This provides a variety of basic chart functions that can be built with your point and polygon data (bar charts, boxplots, histograms, scatter plots). These are dynamic plots displayed in a web browser and can be updated quickly based on the map view or selection. For example, if you want to quickly generate charts to show day and time distributions across the study area, you can select the points within a specific hotspot(s) and update the charts according to data selection. Two examples are provided below, displaying time and day information using the Cambridge BUA synthetic violent crime data columns. The html code from the charts can be added to print layouts, to combine map and chart visuals in report style layouts for download and dissemination.

2D Histogram Chart showing the volume of offences by day and hour across Cambridge (QGIS, Data Plotly)

Bar and histogram showing the volume of violent crimes in central Cambridge by day and hour (QGIS, Data Plotly)

8.2. Adding open source base map tiles

To add open-source base map layers on QGIS, this can be done either using the QuickMapServices plugin (see video guide here), the Python Console (see here for how to guide), or by manually adding using XYZ Tiles on the Browser Panel.

To add manually, right click on the XYZ Tiles option in the Browser Panel. From here name the connection (i.e. Bing VirtualEarth) and add the URL (http://ecn.t3.tiles.virtualearth.net/tiles/a%7Bq%7D.jpeg?g=1).

URL and Min/Max Zoom levels are available in the code section of the following page.

9. Installing python packages for plugins

9.1. Installing Python packages

Some plugins in QGIS require that you install Python packages to use them.

In order to install packages required for QGIS Plugins you should navigate to the QGIS folder from the Windows Start bar applications menu. Within this folder you will see OSGeo4W Shell. Right-click on the icon and select the option to Open file location.

Once you have opened the file location, right-click OSGeo4W Shell and select the option to Run as administrator. Select Yes when the dialogue box appears, and then the Administrator: OSGeo4W Shell will open and you will be ready to add the packages from the command line.

Assuming you only have one version of QGIS installed, the command line should display the directory of your QGIS (C:\Program Files\QGIS 3.28.12>). To navigate the directory upwards use cd.. and downwards use cd Folder Name. Alternatively, you can first type o4w_env to take you to the directory needed to install packages.

Here you can see I have as an example installed pandas using 'pip install pandas'.

Here are some useful plugins for crime analysis and the Python packages that will require installation:

Cluster Analysis (k-means and hierarchical) requires pip install scikit_learn numpy pandas *may need to install one at a time*
Density Analysis requires pip install h3
H3 Toolkit requires pip install h3
Spatial Analysis Toolbox requires pip install pandas geopandas libpysal esda mgwr *may need to install one at a time*