Variables used in cluster analysis,
Farm Density: This is a variable that shows the number of farms per Colorado county or the density per county. This will be the most important factor in looking at opening a farm store as farm owners and workers will be our largest target market.
Population: This data is population data by Colorado county. We feel that this is important to look at in the clusters as general population will also be a target market for our prospective business.
We seen that there are 3 distinct clusters that show up when we compare the data to one another:
Cluster 1: High Population, Average Farm Density
Cluster 2: Average Population, High Farm Density
Cluster 3: Low Population, Low Farm Density
As soon as looking at the correlation in these clusters and what they represent we were quick to sort out cluster 3 as a low population and low farm density county would not make a lot of sense to locate a farm store.
setwd("~/AREC_ps10")
# This script
# load packages
library(pacman)
p_load(tidyverse,ggplot2,skimr,GGally,broom,ranger,rsample,caret)
install.packages("janitor")
library(janitor)
# read in dataset
raw <- read_csv("Merged_Data.csv")
sumstats <- skimr::skim(raw)
sumstats #print the summary stats
ggpairs(raw,columns = c("Value"))
raw %>%
select(Farm_Density, Population) %>% #subset only our variables of interest
mutate(across(everything(),log)) %>% #log transform each of the variables.
ggpairs() #plot the tranformed variables
raw %>%
filter(Farm_Density>35) %>% #keep only stores with median distance from home <48km and median dwell less than 90 minutes
select(Farm_Density, Population) %>% #subset only our variables of interest
mutate(across(everything(),log)) %>% #log transform each of the variables.
ggpairs() #plot the tranformed variables
#######Cluster Analysis
# scale data
data_scaled <- raw %>%
select(Farm_Density, Population) %>% #subsetting only the quantitative data
scale()
# perform k-means clustering with k=3
set.seed(123) # for reproducibility
kmeans_fit<- kmeans(data_scaled,
centers=3, #the number of clusters
nstart = 25) #the number of random starts
#create a dataframe with the store level attribute data not included in the clustering
location_info <- raw %>%
select(County,Farm_Density, Population, fips)
# add cluster labels to dataset and join to location info
data_clustered <- raw %>%
mutate(cluster = kmeans_fit$cluster) %>%
inner_join(location_info,by="fips")
#######Classification
#Divide data into training and testing sample
set.seed(123)
data_split <- initial_split(data_clustered,prop=.7)
train_data <- training(data_split)
test_data <- testing(data_split)
#Fit the random forest model
rf_model <- ranger(factor(Farm_Density.x) ~ Population.x, #specify the model like a regression
data = train_data,
num.trees = 500)
#Predict classification of test data
rf_predict <- predict(rf_model,data = test_data)
#
cm <- confusionMatrix(rf_predict$predictions, #calling our predictions from the previous command
factor(test_data$data_clustered)) #comparing our modeled classification against the true data
#print the output
cm
all_predict <- predict(rf_model,data = data_clustered)
output_data <- data_clustered %>%
mutate(pred_sub_category = all_predict$predictions)
write_csv(output_data,"analyzed_data.csv")
In sheet 1 you see that the 3 clusters that are represented by a circle, square, and plus symbol all tend to separate and make their own groups that are recognizable. This scatter plot shows us that the 3 different clusters all represent an important different group that does not have a lot of overlap with other clusters.
In sheet 2 you see a map visualization of of the clusters based on the the counties that they are in. You can see that cluster 1 the blue area is around the Denver area that is high population and low farm density. The cluster 2 that is average population and high farm density. And finally cluster 3 that is low population and low farm density, and because of this we did not may much attention to the red counties as they are not viable options to locate within.