Project 2 Problem Set 3

Part 2

Variables used in cluster analysis,

Farm Density: This is a variable that shows the number of farms per Colorado county or the density per county. This will be the most important factor in looking at opening a farm store as farm owners and workers will be our largest target market.

Population: This data is population data by Colorado county. We feel that this is important to look at in the clusters as general population will also be a target market for our prospective business.

We seen that there are 3 distinct clusters that show up when we compare the data to one another:

Cluster 1: High Population, Average Farm Density

Cluster 2: Average Population, High Farm Density

Cluster 3: Low Population, Low Farm Density

As soon as looking at the correlation in these clusters and what they represent we were quick to sort out cluster 3 as a low population and low farm density county would not make a lot of sense to locate a farm store.

setwd("~/AREC_ps10")

# This script

# load packages

library(pacman)

p_load(tidyverse,ggplot2,skimr,GGally,broom,ranger,rsample,caret)

install.packages("janitor")

library(janitor)

# read in dataset

raw <- read_csv("Merged_Data.csv")

sumstats <- skimr::skim(raw)

sumstats #print the summary stats

ggpairs(raw,columns = c("Value"))

raw %>%

select(Farm_Density, Population) %>% #subset only our variables of interest

mutate(across(everything(),log)) %>% #log transform each of the variables.

ggpairs() #plot the tranformed variables

raw %>%

filter(Farm_Density>35) %>% #keep only stores with median distance from home <48km and median dwell less than 90 minutes

select(Farm_Density, Population) %>% #subset only our variables of interest

mutate(across(everything(),log)) %>% #log transform each of the variables.

ggpairs() #plot the tranformed variables

#######Cluster Analysis

# scale data

data_scaled <- raw %>%

select(Farm_Density, Population) %>% #subsetting only the quantitative data

scale()

# perform k-means clustering with k=3

set.seed(123) # for reproducibility

kmeans_fit<- kmeans(data_scaled,

centers=3, #the number of clusters

nstart = 25) #the number of random starts

#create a dataframe with the store level attribute data not included in the clustering

location_info <- raw %>%

select(County,Farm_Density, Population, fips)

# add cluster labels to dataset and join to location info

data_clustered <- raw %>%

mutate(cluster = kmeans_fit$cluster) %>%

inner_join(location_info,by="fips")

#######Classification

#Divide data into training and testing sample

set.seed(123)

data_split <- initial_split(data_clustered,prop=.7)

train_data <- training(data_split)

test_data <- testing(data_split)

#Fit the random forest model

rf_model <- ranger(factor(Farm_Density.x) ~ Population.x, #specify the model like a regression

data = train_data,

num.trees = 500)

#Predict classification of test data

rf_predict <- predict(rf_model,data = test_data)

cm <- confusionMatrix(rf_predict$predictions, #calling our predictions from the previous command

factor(test_data$data_clustered)) #comparing our modeled classification against the true data

#print the output

all_predict <- predict(rf_model,data = data_clustered)

output_data <- data_clustered %>%

mutate(pred_sub_category = all_predict$predictions)

write_csv(output_data,"analyzed_data.csv")

Part 3

In sheet 1 you see that the 3 clusters that are represented by a circle, square, and plus symbol all tend to separate and make their own groups that are recognizable. This scatter plot shows us that the 3 different clusters all represent an important different group that does not have a lot of overlap with other clusters.

In sheet 2 you see a map visualization of of the clusters based on the the counties that they are in. You can see that cluster 1 the blue area is around the Denver area that is high population and low farm density. The cluster 2 that is average population and high farm density. And finally cluster 3 that is low population and low farm density, and because of this we did not may much attention to the red counties as they are not viable options to locate within.

Page updated

Google Sites

Report abuse