Crime affects everyone involved, and in recent years with the increase in unemployment and the impact of recession, crime in Calgary has been on the rise
As someone who has lived in the Calgary area for a while now, I want to know which areas to lookout for, and ways I can keep myself safe
Most of it is congregated in certain sectors of the city
Most of it is robbery of vehicles, homes and businesses
Violent crime is concentrated towards the inner city
Our first step in the process was to call from an API from Open Calgary to get the business license data, and then turn that into the count of businesses per community up to 2018
Next we scraped median income data from Great News. This reflects one of the drawbacks in our project as the Median Incomes are from 2014, but still effective
Next we scraped from Open Calgary again to call from an API to get the crime data for 2018. We have count crimes, count disorders, and crime rate per community. Crime rate is measured as (count of crimes/population of community) *100,000
Finally, we got assessed.value. Which is the mean of assessed property values of houses in the community as of 2018
Latitude and Longitude came with the business license data and was used in the PowerBI visuals
Our data collection process is more in-depth in the code posted below, or as a file in the Dropbox
High level of correlation between number of businesses in a community and disorders/crime
High level of correlation between community assessed property value and the crime rate
High level of correlation between community crime and disorder count
Negative correlation between median income and all other variables, except assessed property value
Important predictors
Count Crimes
Median Income
Assessed Value
Median Income has a negative coefficient
Adjusted R-squared value of 0.5074 - moderate accuracy for our regression fit
High Median Income Communities with Low Crime Rates:
Strathcona Park
Cougar Ridge
Discovery Ridge
Evergreen
Hamptons
Tuscany
High Median Income/Property Value Communities with High Crime Rates:
Eau Claire
Britannia
Elboya
Rosscarrock
Hillhurst
Low Median Income Communities with High Crime Rates:
Forest Lawn
Southview
Bridgeland/Riverside
Crescent Heights
Cliff Bungalow
Mission
South Calgary
Bankview
Sunalta
Rosscarrock
Kingsland
Communities with Low Property Values & High Crime Rates:
Erlton
Greenview
Low Median Income/Property Value Communities with Low Crime Rates:
Kincora
Copperfield
Bridlewood
Communities with High # of Businesses & High Crime Rates:
Cliff Bungalow
Mission
Kingland
Forest Lawn
Bridgeland & Riverview
Crescent Heights
Hillhurst
Sunalta
South Calgary
Rosscarrock
Kingsland
There is a negative relation between the community median income and crime rate
TRUE
2. There is a negative relation between community median property value and crime rate
FALSE
3. There is a positive relation between the number of businesses and crime rate
TRUE
Best Predictors for Crime Rate?
Count of Crimes
Median Household Income
Assessed Property Value
library(httr)
library(dplyr)
library(lubridate)
library(jsonlite)
library(stringr)
library(plotly)
library(rvest)
library(car)
library(ggfortify)
library(corrplot)
url <-'https://data.calgary.ca/resource/vdjc-pybd.json?$limit=50000&$offset=0'
business.license.data <- httr::GET(url = url)
business.license.data <- content(business.license.data, as = "text")
busn.license <- do.call("rbind.data.frame", lapply(business.license.data, fromJSON))
busn.license$comdistnm <- as.factor(busn.license$comdistnm)
# scraping the business license data
table(busn.license$comdistnm)
as.data.frame(table(busn.license$comdistnm))
busn.license.count <- as.data.frame(table(busn.license$comdistnm)) # taking just a count of businesses per community
names(busn.license.count) <- c("community.name", "count.businesses") # renaming columns
#scrape median household incomes in calgary
scrape_income_html <- read_html(
"https://great-news.ca/demographics/?fbclid=IwAR0IXYkRJbOm2REaocdcfmZ_0ORh2htlbmRm-iFIu9k90NrCBbG3E0Axn08")
Table.Income <- html_nodes(scrape_income_html,
css = "table")
scraped_income_data <- as.data.frame(html_table(Table.Income))
scraped_income_data$Community <- toupper(scraped_income_data$Community)
#filter the income data by the communities that show up in our data frame
filtered_scraped_income_data <- scraped_income_data %>%
filter((scraped_income_data$Community %in% busn.license.count$community.name))
filtered_scraped_income_data$Median.Household.Income <- gsub("[\\$,]", "", filtered_scraped_income_data$Median.Household.Income)
income.data <- as.data.frame(filtered_scraped_income_data$Community)
income.data$median.income <- filtered_scraped_income_data$Median.Household.Income
# Rename the community name column
names(income.data)[names(income.data) == "filtered_scraped_income_data$Community"] <- "community.name"
#this table has the median household income data
community.information <- busn.license.count %>%
filter((busn.license.count$community.name %in% scraped_income_data$Community))
community.information <- merge(community.information, income.data, by.x = "community.name")
community.information$median.income <- as.numeric(community.information$median.income)
# Crime Rate, count of crime, and count of disorder
# Population Data is extracted from City of Calgary open data using an API ##
pop.url <- 'https://data.calgary.ca/resource/cc4n-ndvs.json'
calgary.population.data.raw <- httr::GET(url = pop.url)
raw_content_population <- httr::content(calgary.population.data.raw, as = "text")
population.data <- do.call("rbind.data.frame", lapply(raw_content_population, fromJSON))
population.data$res_cnt <- as.numeric(as.character(population.data$res_cnt))
population.data <- cbind(population.data[, 1:128], population.data$multipolygon)
## This Data frame contains info for only residential communities in Calgary ##
residential <- population.data %>%
filter(class == "Residential")
residential.communities <- unique(residential$name)
## Extracted Crime Data from Calgary Open Data ##
url <- 'https://data.calgary.ca/resource/848s-4m4z.json?$limit=130000&$offset=0'
calgary.incident.raw.data <- httr::GET(url = url)
raw_content_incident <- httr::content(calgary.incident.raw.data, as = "text")
incident.data <- do.call("rbind.data.frame", lapply(raw_content_incident, fromJSON))
incident.data <- cbind(incident.data[ , 1:10], incident.data$geocoded_column)
## We isolated crime data to 2018 incidents, for the sake of simplicity ##
crime.data <- incident.data %>%
filter(group_category == "Crime" & year == "2018")
## converted certain colunms from character to numeric ##
crime.data$latitude <- as.numeric(as.character(crime.data$latitude))
crime.data$longitude <- as.numeric(as.character(crime.data$longitude))
crime.data$count <- as.numeric(as.character(crime.data$count))
## Grouped crime data by community name ##
crime.by.community <- crime.data %>%
group_by(community_name) %>%
summarise(crime.count = sum(count))
## Filtered solely for the residential communities we defined above ##
crime.by.community.residential <- crime.by.community %>%
filter(community_name %in% residential.communities)
crime.by.community.residential <- crime.by.community.residential[order(crime.by.community.residential$community_name), ]
residential.communities <- unique(crime.by.community.residential$community_name)
residential <- residential %>%
filter(name %in% residential.communities)
residential <- residential[order(residential$name), ]
## Combined the population data with the crime data ##
residential.stats <- cbind(residential, crime.by.community.residential)
residential.stats.2018 <- cbind(residential.stats[ , 1:10], residential.stats$crime.count)
residential.stats.2018$crime.rate <- (residential.stats.2018$`residential.stats$crime.count`/residential.stats.2018$res_cnt)*100000
## We do the same thing we did above for crime but this time for disorders ##
## Isolate the data and then merge with data with the dataframe we made above ##
incident.data$count <- as.numeric(as.character(incident.data$count))
disorder.data <- incident.data %>%
filter(group_category == "Disorder" & year == "2018")%>%
group_by(community_name)%>%
summarise(disorder.count = sum(count)) %>%
arrange(community_name)
disorder.data <- disorder.data[-c(1:47),]
names(disorder.data) <- c("name", "count.disorders")
residential.stats.2018.v2 <- merge(residential.stats.2018, disorder.data, by.x = "name")
## removed some redundant data and then named the colunms ##
residential.stats.2018.v3 <- residential.stats.2018.v2[,-c(2:10)]
row.names(residential.stats.2018.v3) <- NULL
names(residential.stats.2018.v3) <- c("community.name", "crime.count", "crime.rate", "disorder.count")
community.information1 <- merge(community.information, residential.stats.2018.v3, by.x = "community.name")
cor(community.information1[,2:6])
# adding in Community Assessed Home Values
PA.Summary = read.csv('PA_Summary.csv')
names(PA.Summary) <- c("community.name", "Assessed.Value", "lat", "long") # changing names
community.information2 <- merge(community.information1, PA.Summary, by.x = "community.name") # making the merge
corr <- cor(community.information2[,2:7]) # cool looking at correlation
corrplot(corr, method = "circle", type = "upper", order = "hclust")
names(community.information2) <- c("community.name", "count.businesses", "median.income",
"count.crimes", "crime.rate", "count.disorders",
"Assessed.Value", "lat", "long")
# i renamed because the # count thought it was a comment in rcode
# regression starts here
fit <- lm(crime.rate ~ count.crimes + count.businesses + count.disorders + median.income + Assessed.Value, data = community.information2)
summary(fit)
# count disorders, count crimes, and count businesses all have high p-values. Probably due to correlation
null.model = lm(crime.rate ~ 1, data = community.information2)
full.model = lm(crime.rate ~ count.crimes + count.businesses + count.disorders + median.income + Assessed.Value, data = community.information2)
step.forward <- step(null.model,
scope = list(lower = null.model, upper = full.model),
direction = "forward")
summary(step.forward)
step.backward <- step(full.model, direction = "backward")
summary(step.backward)
summary(step.forward)
# I will go forward with our step.backward model
# it was expected that we would cut disorders and count.businesses, due to seeing how highly correlated it was to count.crimes.
summary(step.backward)
vif(fit)
vif(step.backward)
# vif goes down with no disorder and count.business
autoplot(fit, which = 2, ncol = 1) # testing normality
autoplot(step.backward, which = 2, ncol = 1) #testing normality w/ step.backwards. cuts businesscount and disorder
autoplot(fit, which = 4, ncol = 1) #cooks distance outliers
autoplot(step.backward, which = 4, ncol = 1) # cooks distance w/ step.backwards. cuts businesscount and disorder
cooks.D <- as.vector(cooks.distance(step.backward))
p <- 3 # Number of predictors in our model.
n <- nrow(community.information2) # n is the number of observations
influential.outliers <- which(cooks.D > qf(0.05, p + 1, n - p -1))
community.information2[influential.outliers, ]
# 4 major outliers. Beltline makes sense
# removing the MAJOR outliers
community.information3 <- community.information2[-influential.outliers, ]
fit.no.outliers <- lm(crime.rate ~ count.crimes + median.income + Assessed.Value, data = community.information3)
summary(fit.no.outliers)
autoplot(fit.no.outliers, which = 2, ncol = 1) # testing normality
autoplot(fit.no.outliers, which = 4, ncol = 1) #cooks distance outliers
cooks.D <- as.vector(cooks.distance(fit.no.outliers))
p <- 3 # Number of predictors in our model.
n <- nrow(community.information3) # n is the number of observations
influential.outliers <- which(cooks.D > qf(0.05, p + 1, n - p -1))
community.information3[influential.outliers, ]
# as we can see, with the cook's distance plot as well, this outlier is not nearly as influential as the previous 4 outliers.
# this cook's distance plot only goes past 0.15 for its influential outlier, as opposed to going past 4 cook's distance in the previous analysis
# if we compare the two normality qqplot's we can see that removing the outliers just steepens the normality plot. It does not
# fix the normality issues the plot is still only mostly normal within the first standard deviation. (-/+)
# removing the outliers dropped the R-squared by 0.2, which does drop the accuracy of our fit, but shows us how much the
# outliers were affecting the regression
# this is the Proportion tests
hoods <- unique(community.information3$community.name)
hoods1 <- residential.stats.2018.v2 %>%
filter(name %in% hoods) %>%
arrange(name)
community.information3$population <- hoods1$res_cnt
total.crime <- sum(community.information3$count.crimes)
total.population <- sum(community.information3$population)
for(i in 1:nrow(community.information3)){
results <- prop.test(x = c(community.information3[i,4], total.crime), n = c(community.information3[i,10], total.population), alternative = "greater")
community.information3[i,11] <- results$p.value
}
## This dataframe shows the top five worst neighbourhoods in Calgary for Crime ##
greatercrime <- community.information3[c(which(community.information3$V11 <= 0.05)), ]
greatercrime <- greatercrime[order(greatercrime$crime.rate, decreasing = TRUE), ]
greatercrime <- greatercrime[1:5, ]
for(i in 1:nrow(community.information3)){
results <- prop.test(x = c(community.information3[i,4], total.crime), n = c(community.information3[i,10], total.population), alternative = "less")
community.information3[i,12] <- results$p.value
}
## This dataframe shows the top five best neighbourhoods in Calgary for Crime ##
lowercrime <- community.information3[c(which(community.information3$V12 <= 0.05)), ]
lowercrime <- lowercrime[order(lowercrime$crime.rate), ]
lowercrime <- lowercrime[1:5, ]
## This is the code for the 3d Scatter Plots ##
Sys.setenv("plotly_username"="btma431group")
Sys.setenv("plotly_api_key"="YIIqXN1J7J5jhCLUW3DQ")
crime.rate.plot <- plot_ly(community.information3, x = community.information3$median.income, y = community.information3$Assessed.Value, z = community.information3$crime.rate,
marker = list(color = "red")) %>%
layout(title = "Crime, Income and Property Value", scene = list(xaxis = list(title = "Median Income"), yaxis = list(title = "Property Value"), zaxis = list(title = "Crime Rate")))
chart_link <- api_create(crime.rate.plot, filename = "3d Scatter Crime")
chart_link
disorder.rate.plot <- plot_ly(community.information3, x = community.information3$median.income, y = community.information3$Assessed.Value, z = community.information3$disorder.rate,
marker = list(color = "blue")) %>%
layout(title = "Disorders, Income and Property Value", scene = list(xaxis = list(title = "Median Income"), yaxis = list(title = "Property Value"), zaxis = list(title = "Disorder Rate")))
chart_link2 <- api_create(disorder.rate.plot, filename = "3d Scatter Disorder")
chart_link2