Intro
A common problem that entrepreneurs face is where to locate their company or business to capitalize and maximize profits. In this project we looked at where a business owner who is looking to open a farm store should locate in Colorado. We took into account population per county, total farms per county, and Tractor Supply locations. We felt Tractor Supply would be one of the main competitors within the region and wanted to make sure to not be located in the same vicinity of their current stores. When first looking into this we felt that farmers would be vital to our success but also non-farmer traffic as well since we will be looking to carry a variety of products that would also cater to the everyday people. Once we located our data that would allow us to create the visualizations we began to discover where would be the best fit for these future locations to have the highest farm density, largest populations, while still being a reasonable distance away from the nearest Tractor Supply. We also went ahead and looked at the actual population and farms per county on a land area basis after the presentation and those can be seen in the updated visualizations in the appendix. We calculated this by taking the values per county and then dividing them by the total land area for the same counties.
Question
What county or counties should you look at opening a local farm store that caters to not only the farming community in Colorado to garner the highest level of interest from potential customers?
Background
Your largest regional competitor across the state is Tractor Supply and you will not want to compete in small areas with an already establish company.
Farm density per county will be important to look at as those customers will be your largest target market.
The general population per county though will also be important to consider as farms stores also provide a lot of household hardware, pet supplies, and gardening/planting items that nonfarm population may be interested in.
Data
Tractor Supply Locations
POI Factory
This data provided us with Tractor Supply Locations across the United States, and we refined it to only Colorado locations.
Farm Density Data
We gathered this data from the USDA/NASS Quick Stats that gave us farm density in Colorado per county.
Estimated Population Data
We gathered this data from the Colorado Department of Local Affairs, it is total estimated population per county in Colorado
Analysis
We looked at doing a cluster analysis that compared farm density and population per county to give us clusters that were correlated. We later then did some more clusters based on the farm density and population density based on the land area within the county and got similar and different results. We believed that this was the best option because we were not looking into future values, only the current values that we found, and we were assuming the market for farm stores was not in equilibrium. Our data broke into 3 distinct clusters in both analyses, but they were a little different but the main take away from both sets of clusters was that they were very distinct in how they broke out. They included; low population low farm density, low population high farm density, and high population low farm density. We then compared those clusters back to the Tractor Supply locations to see what areas we would think would be the best areas to locate a new farm store. When we did the analysis by area we seen that Denver broke into its own cluster and the surrounding counties also came into interest, but after looking at the values and thinking about them the counties were very small and then were able to get high values based on the land area, but for example Broomfield was in the cluster of interest but only has 33 farms in the entire county and very dense population which is not exactly the target market we are looking for. So we thought that doing an analysis by land area was helpful for looking at traveling distances to a farm store, but counties like that have to be closely looked at to make sure that there is actually enough target market within the county even if it is small and in surrounding counties to sustain opening a business like a farm store.
Conclusion
After analyzing our results, we feel that the best fit to locate would be in southeast and northern Weld County or Larimer County. The reason for this is because they are an area of high farm density as well has having an overall population of 1.2 million and 1.5 million. Also, with locating in the areas suggested it allows us to be a safe distance away from the nearest Tractor Supply’s. We feel that for these reasons these are the most profitable areas to locate and begin to build the brand of their company before going into direct competition. While there any many other counties that seemed reasonable after we broke down the data by land area, when you look at the actual figures a lot of those counties rule themselves out as they are large enough that you wouldn’t expect someone to travel all the way across from other counties to buy products and some have very low number of farms but when you divide it by their small land area they look like they have a lot per the square area. So, we still believe that Larimer or Weld would be the best place to target if you were going to open a new farm store.
setwd("~/AREC_ps10")
# This script
# load packages
library(pacman)
p_load(tidyverse,ggplot2,skimr,GGally,broom,ranger,rsample,caret)
install.packages("janitor")
library(janitor)
# read in dataset
raw <- read_csv("Merged_Data.csv")
sumstats <- skimr::skim(raw)
sumstats #print the summary stats
ggpairs(raw,columns = c("Value"))
raw %>%
select(Farm_Density, Population) %>% #subset only our variables of interest
mutate(across(everything(),log)) %>% #log transform each of the variables.
ggpairs() #plot the tranformed variables
raw %>%
filter(Farm_Density>35) %>% #keep only stores with median distance from home <48km and median dwell less than 90 minutes
select(Farm_Density, Population) %>% #subset only our variables of interest
mutate(across(everything(),log)) %>% #log transform each of the variables.
ggpairs() #plot the tranformed variables
###################################################
#######Cluster Analysis
# scale data
data_scaled <- raw %>%
select(Farm_Density, Population) %>% #subsetting only the quantitative data
scale()
write_csv(data_scaled,"data_scaled.csv")
# perform k-means clustering with k=3
set.seed(123) # for reproducibility
kmeans_fit<- kmeans(data_scaled,
centers=3, #the number of clusters
nstart = 25) #the number of random starts
#create a dataframe with the store level attribute data not included in the clustering
location_info <- raw %>%
select(County,Farm_Density, Population, fips)
# add cluster labels to dataset and join to location info
data_clustered <- raw %>%
mutate(cluster = kmeans_fit$cluster) %>%
inner_join(location_info,by="fips")
##########
##Regression
m1 <- lm(Farm_Density.x ~ Population.x, #specifying the regression formula
data = data_clustered)
summary(m1)
new_data <- expand_grid(Farm_Density.x=seq(0,100000,1000),
Population.x=seq(0,5000,500))
fitted_data <- augment(m1,newdata = new_data) %>%
rename(number=.fitted)
write_csv(fitted_data,"fitted_data.csv")
#############################
#######Classification
#Divide data into training and testing sample
set.seed(123)
data_split <- initial_split(data_clustered,prop=.7)
train_data <- training(data_split)
test_data <- testing(data_split)
#Fit the random forest model
rf_model <- ranger(factor(Farm_Density.x) ~ Population.x, #specify the model like a regression
data = train_data,
num.trees = 500)
#Predict classification of test data
rf_predict <- predict(rf_model,data = test_data)
#
cm <- confusionMatrix(rf_predict$predictions, #calling our predictions from the previous command
factor(test_data$data_clustered)) #comparing our modeled classification against the true data
#print the output
cm
all_predict <- predict(rf_model,data = data_clustered)
output_data <- data_clustered %>%
mutate(pred_sub_category = all_predict$predictions)
write_csv(output_data,"analyzed_data.csv")
sink("Project_2_PS_2.txt")
source("Project_2_PS_2.R",echo=TRUE)
sink()
> setwd("~/AREC_ps10")
> # This script
>
>
> # load packages
> library(pacman)
> p_load(tidyverse,ggplot2,skimr,GGally,broom,ranger,rsample,caret)
> install.packages("janitor")
Error in install.packages : Updating loaded packages
> library(janitor)
> # read in dataset
> raw <- read_csv("Merged_Data.csv")
[1mindexing[0m [34mMerged_Data.csv[0m [======================================================================================] [32m378.81MB/s[0m, eta: [36m 0s[0m
> sumstats <- skimr::skim(raw)
> sumstats #print the summary stats
── Data Summary ────────────────────────
Values
Name raw
Number of rows 63
Number of columns 14
_______________________
Column type frequency:
character 4
logical 1
numeric 9
________________________
Group variables None
── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 Program 0 1 6 6 0 1 0
2 State 0 1 8 8 0 1 0
3 Ag District 0 1 9 22 0 6 0
4 County 0 1 4 11 0 63 0
── Variable type: logical ────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean count
1 Watershed 63 0 NaN ": "
── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 Year 0 1 2017 0 2017 2017 2017 2017 2017 ▁▁▇▁▁
2 State ANSI 0 1 8 0 8 8 8 8 8 ▁▁▇▁▁
3 Ag District Code 0 1 51.4 30.6 10 20 60 75 90 ▇▁▁▇▅
4 County ANSI 0 1 61.5 36.6 1 30 61 92 125 ▇▇▇▇▇
5 Farm_Density 0 1 617. 669. 12 252. 399 845 4062 ▇▂▁▁▁
6 str_state 0 1 8 0 8 8 8 8 8 ▁▁▇▁▁
7 str_county 0 1 61.5 36.6 1 30 61 92 125 ▇▇▇▇▇
8 fips 0 1 8061. 36.6 8001 8030 8061 8092 8125 ▇▇▇▇▇
9 Population 0 1 91635. 184559. 788 5865 15717 45809 730395 ▇▁▁▁▁
> ggpairs(raw,columns = c("Value"))
Farm Density Per County