The data we used for this study came from the California State Water Resources Control Board.1 The original data sets had 17 attributes: Sample ID, Station ID, Station Name, Sample Date, Sample Time, Parameter, Qualifier, Result, Unit, Method, type, County, Station Description, Beach Name, Latitude, Longitude, and Creation Date. Parameter refers to the bacteria type measured and Result refers to the observed bacteria level. Method is the sampling method used to collect the bacteria. For the purposes of our study, we were only interested in eight of these attributes: Sample Date, Parameter, Result, Unit, Method, Latitude, Longitude, and Beach Name. The data came from three separate data sets, each with observations from different years: 2021 to 2023.
All of the pre-processing of the data was done using RStudio and several R libraries, namely ‘dplyr’, ‘tidyr’, and ‘stringr’. First, once the data was gathered then uploaded onto and read by RStudio, we combined the three sets into one using vertical (up-down) merging. Next, we subset the data with only the variables we were interested in. We also noticed that the Sample Date variable was read as a string, so we converted it to a Date data type and formatted it accordingly. We also filtered the Unit variable to only include ‘MPN/100ml’, as it had the most number of readings and allowed us to compare the data with our safe bacteria thresholds. Lastly, we removed any duplicate observations from the set.
In order to study the trends over time for monthly (30-day) means for bacteria level, we aggregated the observations for each beach/month/bacteria combination into an average. To do this, we grouped our data by Beach Name, Sample Date (which was now formatted as Year-Month), and Parameter, then took the mean Result based on those groupings. This gave us a table that included a mean bacteria level for each combination. We did the same for daily bacteria levels as well, since some days had multiple samples from the same beach. This resulted in a table with mean daily readings for each beach/day/bacteria combination. To ensure these tables were ready for use in a clean map visualization, we also took the mean Latitude and Longitude for each Beach Name grouping, since multiple stations exist at a beach, each with their own latitude and longitude.
Here is an example of the R code we used to clean the data and to create new tables for analysis:
beach_qual <- rbind(beach_qual_2021, beach_qual_2022, beach_qual_2023) %>%
select(c(SampleDate, parameter, Result, unit, method, Beach.Name, Latitude, Longitude)) %>%
mutate(SampleDate = as.Date(SampleDate,
format="%m/%d/%Y")) %>%
arrange(SampleDate, Station_ID) %>%
distinct()
beach_qual_mean <- beach_qual %>%
mutate(SampleDate = format(SampleDate, format="%Y-%m")) %>%
group_by(Beach.Name, SampleDate, parameter, unit) %>%
summarise(mean_by_month = mean(Result, na.rm=TRUE),
Latitude = mean(Latitude, na.rm=TRUE),
Longitude = mean(Longitude, na.rm=TRUE)) %>%
arrange(SampleDate) %>%
filter(unit == 'MPN/100ml')
Many beaches had little to no data, so when conducting our analysis on the cleanliness of individual beaches, we decided to only look at those beaches that had an average of at least 10 readings per month over the course of the three year span, or at least 360 individual observations. When determining cleanliness for a body of water, many entities consider at least five readings over the course of a 30-day period to be sufficient.
Additionally, the bacteria types E. Coli and Fecal Coliforms were tested less often than Enterococcus and Total Coliforms, so any analysis concerning these two bacteria may not be fully representative of the true underlying trends.