Since NYC specific zip codes were needed, I filtered the data by counties: New York (Manhattan), Queens, Kings (Brooklyn), Richmond (Staten Island), and Bronx. Then to get the spatial data, the Census ZCTA was condensed to zipcodes whose numeric values were under 11700 and then inner joined the tables with pandas to make a working zip code dataframe.
I reorganized the dataframe to be queried by changing the columns. After formatting the zip code dataframe for later merging, I took each median income from households, family, married-couples, and non-family and computed the averages for each zip code under "income" and created a categorical feature, "income_category" to break up the average median income range into sections of $20,000 with regular expressions. I proceeded to left join merge and created a zip code - income dataframe, preserving the zip codes.
For DSNY datasets, the data was not originally sorted by zip code but only through operational district and community districts. I created getDistrictZip(), getZip(), and getZipDistrict() that all iterated through the district, litter basket, and zip code dataframes to obtain the missing information that each set was lacking through the use of shapely.polygon overlaps() and within(). I was then able to group the baskets and drop off sites by the zip code they reside in and merged them to the zip code - income dataframe.
To visualize the litter baskets, food drop off sites, zip code, and DSNY districts, I used Folium to create a map to see these features. I proceeded to create a getXCoord() and getYCoord() for the csv datasets to extract the locations. Using Folium markercluster classes, I created markers for both the litter baskets and food drop off sites to place in the map. Then added the shapefile polygons of the ZCTA data and DSNY district as separate layers. Then finally uploaded to Datapane to be embedded into the website.
To check the validation of some of my predictions, I used Seaborn and Matplotlib to create a bar plot that showed the top 20 zip codes with the highest density of public litter baskets through sorting the number of baskets in ascending order. There I also created a pie chart of the percentages of existing food scrap drop off sites by getting the percentages of sites per zip code over the total.
For tobacco retail dealer licenses data, I wanted to analyze if there was a correlation between the number of active cigarette sellers in a zip code to other aspects of litter resources and measurements. I sorted the dealer dataframe for relevant columns such as business name, zip code, and spatial data. There I grouped the data by zip code and left joined this to the main dataframe.
I performed the same routine with the other DSNY datasets and made a column for community district and sorted the data frame for columns in district and different parts of garbage collection. I first merged it with the DSNY district dataframe to configure if there were food scrap sites per each district with zipScrapSite(). Then I merged this with the main data frame on a left join with zip code.
I dropped the other columns that do not pertain to district and current month's rating and grouped by district. I then merged this to the main dataframe through district associated to the zip code.
From there I created a heatmap with Seaborn on the correlation matrix between income, litter baskets, drop off sites, tobacco sellers, types of garbage collected, and scorecard ratings. This would help to make connections to compare to the rest of my hypothesis. Noticing out of the factors, the number of cigarette sellers to number of baskets had a higher correlation compared to the others. I used sklearn's train_test_split and statsmodel to train and create the linear regression model. Using the r-squared value, I was able to determine how much of the variance in number of baskets can be explained by the number of cigarette sellers.
There were only 2 zip codes whose income was less than the median in the top 20.
While there are over 23,000 litter baskets that was sorted by zip code, only 31.19% of them are in the low-income areas.
From the analysis on DSNY districts, for each district there is at least one food scrap drop off site.
However, based on zip code and income only 33% of all sites reside in lower-income neighborhoods.
# of Public Litter Baskets (Organized by Zip Code): 23,197*
# of Food Scrap Drop Off Locations (By Zip Code): 148*
# of DSNY Districts: 59 (3 in Staten Island (SI), 12 in Manhattan (MN), 14 in Queens (QW, QE), 12 in the Bronx (BX), 18 in Brooklyn (BKN, BKS))
*Some were not considered since their point spatial data was not contained in a ZCTA polygon object during analysis
After cleaning and filtering all the features into one main geodataframe, created a correlation matrix to a heatmap
Specifically chose Pearson correlation since we wish to see the linear relationship among all the variables by zip code
From this heatmap matrix, we can observe that there is a 0.6 correlation value between num_tobacco_dealers (# of cigarette retail sellers) and num_baskets (# of public litter baskets) by zip code
Through building the linear regression using statsmodel library, the calculated R-squared (coefficient of determination) value is 0.364.
This means that the number of cigarette retail sellers only accounts for 36.4% of the variance for the number of public litter baskets
While the regression doesn't seem to cover most of the points, the weight of outliers above the line changes the best fit
From the analysis conducted in the project, I can conclude that my original hypothesis was mostly false. While lower-income areas contain 31.19% of litter baskets and 33% of food scrap sites, it does not mean there is a strong relationship between that and sanitation factors. From the correlation matrix and the linear regression model demonstrates that income does not attribute to these factors. This means to look further into this question, other factors must be studied and analyzed.