In my spare time I like to dabble in data. Below are some data-related projects I've undertaken along with notes, codes, and notebooks for them.
Growing up in The Bronx, I remember visiting a single supermarket throughout my life. Fortunately for my family and I, the supermarket was within walking distance of our building and the supermarket staff would deliver groceries to our apartment. It wasn't until moving away from The Bronx that I realized just how lucky we were to be able to afford our groceries and to not have to use public transportation or costly cabs to get them. This wasn't the case for many people in The Bronx and throughout the country. Upon learning of "food deserts" across the US, I became interested in how the phenomenon occurs in the uniquely densely-populated areas of NYC and what the city is doing to address the problem. Below I highlight details of my exploration into the problem of food insecurity in NYC in the best way I know how (math and data). At the end of each section for this project, you will find links to the python scripts used for the analysis, as well as links to the datasets used, where possible.
1. Understanding the problem
My exploration began with understanding the relevant factors contributing to fresh food accessibility (or lack thereof). To this end I consider, throughout the city:
The number of fresh-food, full-service supermarkets
The number of bodegas (delis), which tend to provide only relatively cheap, unhealthy, processed food options
The population
The average number of vehicles available per household
The average cost of fresh food groceries
The per-capita income
The rate of diet-related serious illnesses such as:
Type II Diabetes
Heart disease
Cancer
For each of these factors, I explore both publicly and privately available data and look for the story in the numbers. The remainder of the study is broken up into 4 parts:
Obtaining and visualizing the relevant data
Building a quantitative model to encapsulate how each of these varies across the city
Looking for quantitative correlations (or lack thereof) between each of these factors
Drawing conclusions and making recommendations for addressing the problem of food insecurity in NYC
2. Obtaining and Visualizing the Relevant Data
Visualization of the available data allows us to understand any additional transformations we'd have to consider before modeling the state of food deserts in NYC. To access the data, I made use of NYC's Open Data portal as well as third party websites to create my own datasets where necessary. I began by understanding the spatial distribution of foodstores in NYC, with use of a dataset containing information on Retail Foodstores throughout NY state[1]. This data was parsed, along with data on zipcodes pertaining to NYC and geospatial data to show the distribution of foodstores throughout the city.
Categorizing stores
The label "food store" includes establishments such as bodegas (delis), small grocers, bakeries, supermarkets, and much more. As we are interested in understanding the number of both fresh-food markets and bodegas as a measure of fresh food availability, the first challenge is understanding which of these food stores can be categorized as either. A reasonable approach would be to consider the area of the location in question. To explore this, below I show the distribution in area of four different categories of stores: (1) those with the words "Deli" or "Grocery" in the name (Fig. 1, top left); (2) those with the words "Supermarket", "Market", or "Cooperative" in the name (Fig. 1, top right); (3) those with just the word "Supermarket" in the name; and (4) all food stores in NYC. I distribute the stores over 4 Square Footage bins, including 0-2,000 Sq. Ft., 2,000-5,000 Sq. Ft., 5,000-15,000 Sq. Ft., and above 15,000 Sq. Ft following an extensive study by Olivia Limone and Nadia Sanchez. It is generally true that bodegas/delis tend to be smaller in size than full-service supermarkets. However, a binary approach where we, e.g., "classify all food stores below a given square footage as delis and all others as fresh food markets" may misrepresent the number of stores in each category, due to the large disparity in the number of locations in different categories. For instance considering the area distribution of stores, I find that there are:
~3,700 bodegas below 2,000 Sq. Ft., suggesting that at least 600 (14%) bodegas may be misclassified as fresh food stores
Only roughly 400 total stores which can definitively be classified as Supermarkets, 25% of which fall below 2,000 Sq. Ft. and may be misclassified as bodegas
Roughly 700 (53%) potential fresh food stores may be misclassified as bodegas
Roughly 100 (25%) supermarkets may be misclassified as bodegas
If we consider all NYC food stores and classify by store size, we may expect a city-wide ratio for bodega:fresh-food-stores of roughly 2:1. However, considering names of stores as classification criteria instead, we may expect a city-wide bodega:fresh-food-store ratio of 3:1 or an even more dire bodega:supermarket ratio of roughly 10:1
The classification of food stores into fresh food stores or other is clearly a complicated problem, due to the limited data available from the NYC Open Data portal which is not necessarily informative for classification purposes. In this study, I will proceed with what I consider potentially more informative (words in the names of stores) as a way to classify stores. Full-service supermarkets will be categorized as stores which contain "Supermarket" in the name. Fresh food markets (which include supermarkets) will be categorized as stores which contain either "Supermarket", "Market", or "Cooperative" in the name; from visual inspection of the dataset this category also includes specialty markets that mainly produce one type of object (such as butchers, fish markets, and bakeries) as well as small grocers and mini-markets (delis that sell limited produce). Finally, bodegas will be categorized as stores which contain either "Deli" or "Grocery" in the name. I will present my analyses considering fresh food stores generally, but also just supermarkets as I believe the latter is potentially more informative in terms of food accessibility (I assume that, typically, people with limited time would prefer to visit a single full-service supermarket location to meet all of their food needs rather than visiting multiple smaller specialty markets).
[1] Retail Food Store Data: https://data.ny.gov/Economic-Development/Retail-Food-Stores/9a8c-vfzj/about_data
Fig. 1: Distributions of the sizes of stores in the dataset based on categrorization. I consider categorizing stores based on keywords in their names (blue figures), or by size alone (red figure). I find it optimal to categorize stores based on keywords in their names to avoid erroneous categorization.
Alternate Supermarket Dataset
For an alternative look at the of supermarkets in NYC, I cross-referenced the data filtered by the name criterion (see above) for categorizing supermarkets with data obtained from a third-party website which lists supermarket locations around the country: supermarketpage. To acquire the data from supermarketpage, I scraped each page with location information on NY state supermarkets. I then parsed the scraped html to extract the full street address of each location, including zipcode. I cross-referenced each extracted street address zipcodes with the NYC Open Data census tracts to find which locations are within NYC. Finally, I used Bing Maps API to query Bing maps for the geolocation of each full address. The result is show in the image below (Fig. 2, left). I find a total of 596 supermarkets throughout NYC using supermarketpage, which is close in number to the 414 supermarkets found using the name criterion filtering of data obtained using the NYC Open Data portal discussed above (see Fig. 1, bottom left). I also find that the geolocations of the supermarkets from supermarketpage coincide well with those from the NYC Open Data filtered dataset (see Fig. 2, right). However, each dataset appears to contain location information missing in the other. For example, the supermarketpage data shows that the aread directly south of Central Park is densely populated with supermarkets, whereas the NYC Open Data filtered data shows a relative dearth of supermarkets in that area. To cover all potential NYC supermarket locations given these two data sets, I combined both datasets and classified locations which are within 200 ft. of another location as a duplicate entry, to avoid overcounting as much as possible. The combined dataset of NYC supermarkets includes a total of 902 locations, resulting in a city-wide bodega:supermarket ratio of nearly 5:1.
Fig. 2: Geospatial locations of supermarkets from a third-party website (supermarketpage.com, left) and those in the NYC Open Data dataset (right). The full list of supermarkets is the combination of each of these with duplicates removed.
Other Relevant Quantities
One of several metrics I use to quantify the extent of food insecurity in NYC is physical access to fresh food. To account for New Yorkers that have access to vehicles and thereby supermarkets which are further away, I consider the estimated percentage of households with at least one vehicle available throughout the city using consumer market data from SimplyAnalytics. Below I show this as a heat map (Fig. 4, right) for The Bronx as an example. In order to account for population density and affordability, I also consider the estimated population and per capita income for each census tract in NYC using data from SimplyAnalytics.
Fig. 3: Heatmap of several quantities relevant to food accessibility and insecurity. I focus on The Bronx as an example. Shown from left to right are the percentage of households with at least 1 vehicle available, the population, and the per-capita income.
As I am also interested in understanding the public health outcomes related to food insecurity, I use census-tract level data available from the CDC on chronic diseases. These data show an epidemiological model-based prevalence of several chronic illnesses. Below I showcase an example of the geospatial distribution of some relevant illnesses (including High Blood Pressure, Diabetes, and Obesity).
Fig. 4: Heatmap of the prevalence of several illnesses that may be connected to food insecurity. Shown from left to right are the prevalence of high blood pressure, diabetes, and obesity.
Datasets used:
[1] https://data.ny.gov/Economic-Development/Retail-Food-Stores/9a8c-vfzj/about_data
[2] https://data.cityofnewyork.us/Health/Recognized-Shop-Healthy-Stores/ud4g-9x9z/about_data
[4] https://data.cityofnewyork.us/City-Government/2010-Census-Tracts/fxpq-c8ku
Python scripts:
pull_supermarkets_from_web.py: scrapes supermarketpage.com for supermarkets in NYC and uses Bing maps API to get their geolocations
size_distribution_histograms.py: shows histograms for the size distribution of supermarkets, bodegas, etc.
3. Building a model
With some understanding of the relevant quantities and how they change around the city, I decided to look for correlations between them to try to get a quantitative understanding of the extent of food insecurity in NYC. I quantify each relevant quantity using two metrics. The first metric assesses the nature of food accessibility in each census tract, by enumerating the number of supermarkets within walking and driving distance. The second metric assesses food expense, by considering the ratio of estimated food costs to per-capita income.
First Metric: food accessibility
In this study, I came across two terms commonly used when considering this problem:
Food Desert: formally defined in this USDA study as a region with at least 500 people and/or 33% of the population with zero supermarkets within 1 mile. Given the dense population of NYC, and the difficulty of walking 1 mile for groceries, I instead consider 0.5 miles a walkable distance and the reference population as 1,000 people. I interpret this to mean that for a region in NYC to be considered a food desert, the population:accessible-supermarket ratio (i.e., the Food Desert Index) should be above 1,000:1. An area with a ratio above this threshold meets my criterion for being categorized as a food desert.
Food Swamp: generally defined as urban environments where unhealthy food options (snack foods, fast food, etc.) outnumber fresh-food options.
To quantify the food accessibility throughout the city, I first discretize each NYC census tract into equidistant points on a grid. Gridpoints are separated by a distance of 0.15 miles, which I found to be a suitable resolution to ensure each census tract is sampled at least once. Below (Fig. 4) I show an example of what this discrete grid of points looks like for census tracts in The Bronx, with different gridpoint colors used for each census tract. To account for New Yorkers that have access to vehicles (and thereby supermarkets which are further away), I also consider the estimated percentage of households with at least one vehicle available (see Fig. 4, left).
Fig. 5: Discretized grid of sample points used to calculate the average quantitative indices considered in this study. I show The grid in The Bronx as an example.
For each of the above sample points (Fig. 5), I ask:
How many bodegas are within walking distance? (0.5 miles)
How many supermarkets are within walking distance? (0.5 miles)
How many supermarkets are within driving distance? (2 miles - typically a 10-15 minute drive in the busier areas of the city)
I then consider different metrics meant to assess the extent of each of these terms. To quantify the extent of food swamps and deserts in NYC, I calculate, for each census tract
where
I then consider the average value of the above metrics for each census tract which I show below (Fig.5).
Fig. 6: Food Desert (left) and Food Swamp (right) Indices throughout the city. Higher values for each index signifies a higher chance of the region being a food desert or swamp. I highlight regions which may be classified as food deserts and swamps using arrows.
In the left panel of Fig. 5, regions which appear as pink meet the criterion for being categorized as food deserts (i.e., with Food Desert Index < 0.001). Areas which appear as near-white or green are borderline food deserts (i.e., with Food Desert Index ~ 0.001) or do not meet my criterion for being food deserts, respectively. I find that there are a significant number of areas around NYC which may be considered food deserts, with the following neighborhoods potentially housing them:
Manhattan:
Clinton, Hell's Kitchen, and Chelsea
Bronx:
Hunts Point, Soundview (partially), Eastchester, Co-op City (partially), Mott Haven, Highbridge, Throgs Neck (borderline food desert)
Brooklyn:
small parts of East Flatbush, Sunset Park, and Windsor Terrace
Queens:
Blissville, Long Island City (partially), South Corona
Howard Beach, Breezy Point, Far Rockaway (these neighborhoods are along shores of different kinds so my assessment may not be suitable here as the buffers I consider include water!)
Neighborhoods along the Easternmost part of Queens may be misrepresented as food deserts because I don't account for supermarket locations on nearby Long Island
Staten Island:
Large swaths of Staten Island appear as food desert or borderline food deserts. However, closer inspection of these census tracts reveals that large parks, golf courses, and cemeteries cover the midsection of the island but are not categorized as such in the NYC Open Data portal. Despite this, there appear true food desert locations on Staten Island due to such few supermarket locations. Moreover, the Southern Tip of Staten Island does not include data from nearby supermarkets in Perth Amboy, NJ (a 15 minute drive away). For these reasons, I exclude SI from the remainder of the discussions.
In the right panel of Fig. 5, regions which appear as purple meet the criterion for being categorized as food swamps (i.e., with Food Swamp Index < 1). Areas which appear as near-white or green are borderline food swamps (i.e., with Food Desert Index ~ 1) or do not meet my criterion for being food deserts, respectively. There are larger swaths of NYC that may be categorized as food swamps, where snack and processed food options may significantly outnumber fresh food options. I identify the following areas of NYC as potentially housing food swamps:
Almost the entirety of The Bronx (with the exception of the wealthier reaches in the Northern parts of the borough)
all of Upper Manhattan
Large parts of North Brooklyn and South Queens
Based on Fig. 5, Food Swamps may be a more relevant public health problem in NYC. Although there are a significant number of potential food deserts throughout the city, the area corresponding to food swamps is much larger.
Second Metric: food expense
Another important factor for food access is food expense. To assess the expense of food throughout the city, I consider similar dimensionless parameters to the food desert and swamp indices, but instead considering per-capita income and food price data. Factors I consider include:
Per-capita income data acquired from SimplyAnalytics
Food price data acquired by scraping the data on this site: Northeast region food pricing from the Bureau of Labor Statistics
According to the US Census Bureau, there are on average 2.56 members per household on average in NYC - I use this to determine the average cost of groceries per month. Assuming a cost of $400/month for a single person, this would come out to $1,024/month for a household. This estimate is higher than the national average, but lower than other estimates of grocery costs in NYC. I leave this estimate as a free parameter in my model to understand how modifying this affects my conclusions.
Given these assumptions, I calculate the food expense index by considering the ratio of monthly per-capita income to approximate monthly food costs:
where
Fig. 7: Food Expense Index throughout the city. Higher values of the index correspond to regions where estimated fresh food costs are a significant monthly financial burden.
As expected, this metric mostly tracks income throughout the city, but it allows me to highlight the regions where estimated monthly food costs may be a significant financial burden to the residents there. These areas (some of which are highlighted with arrows in Fig. 6) are regions where the estimated monthly food cost is close to monthly income (per person). The lowest ratios are close to 1-to-3, suggesting that a third of the monthly income goes to food alone. Some of these regions also coincide with food deserts and swamps, which provides some initial insight into the problem of food insecurity in NYC: regions where food is a significant financial burden may coincide with regions where fresh food is not readily available and tend to coincide with regions where cheaper, unhealthy food options may vastly outnumber fresh food.
4. Quantifying food INSECURITY in The Bronx, Manhattan, Brooklyn, and Queens
Given the set of metrics considered in the "Building A Model" section above, I considered the potential correlations that exist in the data. In particular, I am interested in which census tracts have low combined food availability and food affordability, and whether auxiliary data can inform us about these census tracts. To begin this exploration, below I show how the Food Expense Index correlates with the Food Desert and Food Swamp indices.
Fig. 8: Relationship between different indices that contribute to food insecurity in NYC. Left: Relationship between the Food Expense Index and the Food Desert Index. Points that fall above the vertical line can be classified as food deserts. There is no significant correlation between the Food Desert and Food Expense indices, indicating that Food Deserts, if they exist, can be found across income ranges. Right: Relationship between the Food Expense Index and the Food Swamp Index. There is no significant correlation between the Food Swamp and Food Expense indices, indicating that Food Swamps are likelier to be found in regions where fresh food is a significant financial burden (i.e., low income regions).
The left panel of Fig. 7 shows the relationship between the Food Desert Index and the Food Affordability Index. I mark with a black line the criterion for a census tract to be considered a food desert. Anything above the black line qualifies as a food desert. The right panel note:
Food Deserts are not signifincantly prevalent in NYC, with only ~10% of census tracts potentially qualifying as such.
There is no significant correlation between food affordability in an area and the likeliness that the area is a food desert. Moreover, when they do appear, one can find them in both low- and high-income regions.
However, there is a significant correlation between food affordability in an area and the likeliness that the area is a food swamp. Lower income areas where fresh food is a significant financial burden (typically ~30% the monthly income per person) also tend to be areas where the amount of unhealthy food options significantly outnumber fresh food options. This trend may be pointing to one of the realities of food insecurity for lower income residents of NYC. For lower income residents, fresh food is a significant financial burden and they tend to be surrounded by plentiful unhealthy and cheaper processed food options. These residents may be incentivized to spend their already limited income on these readily available, typically cheaper, unhealthy food options.
For a combined look at food insecurity - be it from accessibility (measured by the Food Desert Index), higher access to unhealthy foods (measured by the Food Swamp Index), or financial access to fresh food (measured by the Food Affordability Index) - I consider the combined, normalized food metric, defined as:
where each index is normalized to be at most equal to 1, such that
if the Food Desert Index is equal to 1, the area is very likely to be a food desert. The lower the value, the unlikelier it is to be a food desert.
if the Food Swamp Index is equal to 1, the area is very likely to be a food swamp. The lower the value, the unlikelier it is to be a food swamp.
if the Food Expense Index is equal to 1, food costs are a significant financial burden to the typical area resident. The lower the value, the lower the financial burden of fresh food.
I then consider how this metric varies around the city and how it correlates with quantities such as the per-capita income, access to vehicles, and rates of diet-related illnesses such as diabetes, heart disease, and cancer. Below, I show the geospatial distribution of the Food Insecurity Index. I highlight potentially troublesome areas with the Food Insecurity Index above 1, pointing to a significant amount of food insecurity stemming from either lack of access to fresh-food supermarkets, an overabundance of cheap unhealthy food options, or fresh-food costs being a significant financial burden (or all three). I label each of these areas using A-F for the sake of discussion.
Fig. 9: Left: Heatmap of the Food Insecurity Index, which accounts for fresh food access, prevalence of cheap and unhealthy food options, and financial burden of fresh food. Right: Currently open NYC FRESH stores (as of 06/22) (obtained from this site).
My analysis suggests that food accessibility is not a significant problem in NYC. Rather, the prevalence of cheap, unhealthy food options, combined with the high financial burden of fresh food is a more significant factor when considering food insecurity.
The City of New York has thus far mainly considered access to fresh food as a way to alleviate food insecurity with initiatives such as the Food Retail Expansion to Support Health (FRESH) program, which gives financial incentives to real estate developers that include fresh food markets in their development plans for new housing. So far, roughly 30 FRESH markets have opened up, mainly in regions with high food insecurity (see Fig. 9, right panel). Each of these FRESH stores is considered in my original analysis. If I remove these stores from the analysis, they do not impact my findings significantly. In other words, there are plethora fresh food stores available to the residents of those areas, without the need for FRESH stores. To address food insecurity, it may instead be more efficient to target solutions at the fiancial burden of fresh food and prevalence of unhealthy food options in low income communities.
5. Conclusion: IMPACTS on Public HEALTH AND WHAT TO DO
What is the impact of food insecurity on public health?
To get an idea of how food insecurity may cause negative public health outcomes, I considered the correlation of the Food Insecurity Index with the prevalence of several chronic illnesses throughout the city. For each illness considered, I calculate the Pearson correlation coefficient, which gives a quantitative measure of the correlation between the disease prevalence and food insecurity. Below I show examples of the potential relationship between the prevalence of relevant illnesses and the food insecurity index. Above each figure I also show the Pearson correlation coefficient.
Fig. 10: Correlation between the prevalence of diet-related diseases and the Food Insecurity Index, shown in the title is the Pearson correlation for each case.
Below I also show a table showing the Pearson correlation coefficient between the Food Insecurity Index and the prevalence of other diseases. I highlight entries that show a positive correlation.
Tab. 1: Pearson correlation coefficient between the prevalence of chronic illnesses and food insecurity in NYC.
I typically find positive (albeit weak) correlations between the prevalence of certain diseases and the Food Insecurity Index. This suggests that NYC residents in areas with relatively higher food insecurity (as calculated by my model) are at higher risk of serious chronic illnesses. This presents a public health problem for the city that may be addressed by specifically targeting food insecurity.
Clustering
Below I also use a kmeans clustering algorithm to consider potential clusters of census tract, categorized by their Food Insecurity Index, Per-capita Income, and Prevalence of several diet-related ailments. In the left panel of Fig. 11 I show an example of what these clusters look like, focusing on the prevalence of high blood pressure. I can identify the 3 following clusters:
Red: census tracts with relatively high food insecurity, relatively low income, and relatively high prevalence of diet-related ailments.
Light green: census tracts with relatively low food insecurity, relatively high income, and relatively high prevalence of diet-related ailments.
Dark green: somewhere in-between these two groups, with medium levels of food insecurity, income, and diet-related ailment prevalence.
In the right panel of Fig. 11 I show the spatial distribution of these clusters to examine where in the city we should foucs solutions to alleviate food insecurity.
What can be done about food insecurity in NYC?
Food access is not the main factor contributing to food insecurity in NYC. The bigger issues contributing to food insecurity are the combination of the financial burden of fresh food and the prevalence of food swamps. Thus, a more effective way of addressing food insecurity throughout the city may be to target these factors instead. For example, the City of New York could:
Increase financial incentives to buying fresh food. Typically in NYC this has taken the form of a disincentives for unhealthy foods (for example, sugar taxes). However, without directly alleviating the financial burden associated with fresh food, these solutions only increase the financial burden of living in a food swamp. A double-sided solution to the problem where cheap, unhealthy food options are disincentivized and purchasing of fresh food is incentivized (for example, via vouchers) is crucial.
Alleviating the prevalence of food swamps. Significant improvements may be made by reducing the prevalence of food swamps. This could, for example, take the form of significant property taxes on chain fast-food restaurants in low income areas, to disincentivize them opening new locations there. This could also take the form of subsidizing fresh food inventory in bodegas throughout the city, via either providing cheap fresh food inventory to bodegas, with a price-cap to pass savings onto customers.
Each of the aforementioned solutions would act to reduce the Food Swamp and Food Expense indices, thereby reducing food insecurity, ideally with the effect of reducing the prevalence of serious diet-related illnesses.
In this project, I consider a historical view of housing prices, demographics, income, and services available to understand the recent and distant history of gentrification in NYC. [work in progress]