Predicting NYC 311 Call Volume
by
by
311 is a non-emergency phone number that allows New Yorkers to report problems, access municipal services, and request information. NYC311 is available 24 hours a day, 7 days a week, 365 days a year. NYC311 provides access to NYC government services through the call center, social media, mobile app, text, and video relay service.
NYC311 serves the public and handles all requests for government and non-emergency services, connecting residents, business owners, and visitors with the information and people who can help them best. The NYC311 service helps agencies improve service delivery by allowing them to focus on their core missions and manage their workload efficiently. It also provides insight to improve city government through accurate, consistent measurement and analysis of service delivery.
New York City, the most populous metropolitan area in the United States, is home to over 8.4 million residents. To serve this massive and diverse population, the city operates the nation’s largest and most complex municipal government with more than 350,000 city employees and 120 agencies, offices and organizations offering over 4,000 different services to residents.
The number of 311 calls for service in New York City has increased every year since 2012. Resource allocation and the management of non-emergency incidents is an important problem that needs to be addressed for smooth functioning of cities.
If NYC311 had the ability to accurately anticipate call volume for each responding agency and accurately predict the resolution time for different complaints, they could more efficiently staff government agencies and more effectively serve their communities.
The NYC311 dataset is automatically updated daily and was made public in October, 2011. The log of requests dates back to 2010. In this dataset, each row represents a 311-service request. In total, the dataset has > 25 million rows and 39 columns.
The dataset can be accessed via Socrata Open Data API (SODA). The SODA provides programmatic access to this dataset including the ability to filter, query, and aggregate data. This is useful when dealing with such a large dataset because we can pull in specific data we want from a query. For example, this analysis will pre-filter on service requests that are Status=Closed.
The dataset does not need much cleaning as it is already tidy. Tidy data makes it easy to extract needed variables because it provides a standard way of structuring a dataset. When a dataset is tidy, every column is a variable, every row is an observation, and every cell is a single value. In the NYC311 dataset, each row represents a single 311 entry event.
Key features that I will be using in my Machine Learning models include borough, time of request creation, complaint type, response time, agency assigned etc. These features are mainly text, numerical, and dates.
This study will use groups as the unit of analysis. Data will be examined and grouped by agency, borough, and complaint type.
What has been done already in this area or with this dataset by others?
Most studies and projects using this dataset have been exploratory in nature. Because the NYC 311 dataset does not need much cleaning, it is a good candidate for both researchers and students to use it to practice new data visualization techniques.
The location features in this dataset are also a popular choice to work with. I will not be including those in this study, but these features include latitude, longitude, and zip code. Researchers use this information to build interactive visualization in Tableau.
The figure above was created by Avonlea Fisher (2020) as practice with WordCloud implementation.
On the left, Aaron Owen (2017) experimented with Shiny, a data visualization package in R.
On the right is another figure that was created by Fisher (2020) to experiment with Plotly functionality.
The analysis on the right by Humberto Hernandez in 2018 used a combination of text mining and sentiment analysis to see how satisfied NYC residents are with the 311 service. This study found that surprisingly, in terms of sentiment analysis for the “NYC311” tweets, the majority of them describe a positive or neutral sentiment (~67%).
A project completed by Ikonomakis, Cambiaghi, & Cannistrà in 2017 incorporated machine learning as it aimed to predict what things could happen to your house if live in one of the main neighborhoods of New York, at any given time of the year. This analysis allowed the user to select borough, month, and time of day and it would predict the complaint type.
What are the gaps?
Because most research with this dataset is exploratory in nature, there are few studies that incorporate machine learning into their analysis. The models I have found aim to predict most common complaint type. I have not found models that aim to predict the amount of time to expect to resolve a 311 request based on complaint type.
Additionally, at the beginning of 2020, this dataset was updated to include complaint types related to COVID-19. These include "Business not in compliance, Mass Gathering, Restaurant/Bar Not in Compliance, Business not allowed to be open, and Social Distancing". Because these features are new, there is room for analysis on COVID-related complaint types.
Are you trying to close the gaps or trying to create something novel?
This project closes gaps of prior research.
Can we train a model to accurately predict the response time for certain complaint types? This would help New Yorkers set expectations for the time it would take for issues to get resolved and provide government agencies with key information on where resources are needed.
For my exploratory data analysis, I pulled all NYC 311 service requests from 2020 with a status=closed. I only included closed tickets because I wanted to create a column that looks at the time difference between opening and closing a ticket.
The below graph shows the 311 request history for 2020:
As you can see in the above graph, there are several days where 311 request volume are higher than average. These next two visualizations are deeper dives into the spikes on 7/4 and 7/5.
On the left, we are looking at when the spikes in 311 requests occurred on these peak days. On the right, we are looking at what complaint types were recorded along with these requests.
The image above shows the top 15 complaints New Yorkers are making. Residential Noise is by far the most common complaint. I took this a step further by looking at trends in Residential Noise. The image on the right shows that Residential Noise complaints are highest in warm weather months.
This chart on the left is an example of the average time to resolve an issue for a sample of complaint types. As you can see, NYC is quicker to respond to urinating in public complaints compared to wood pile complaints.
On the right, we can see the breakdown of 311 requests by borough. It is important to understand the population of these neighborhoods when viewing the distribution of 311 request volume. The below image is taken directly from the Census Bureau and includes population estimates for each neighborhood. As you can see, the distribution of 311 request volume is in line with population estimates.
The graph on the left looks at the feature descriptor. For background, descriptor is associated to the feature complaint type, and provides further detail on the incident or condition. Descriptor values are dependent on the complaint type, and are not always required.
The graph shows the top descriptors broken out by borough. This graph suggests that the Bronx is the loudest music playing and party neighborhood in NYC.
The graph below shows the average resolution time for a 311 request by agency (in days). The Department of Information Technology & Telecommunications takes the longest to respond to requests while the Department of Homeless Services (DHS) and the New York Police Department (NYPD) are the quickest to respond. The nature of the requests each department responds to also plays a roll in how quickly they respond. See the charts below for more details.
It makes sense that the NYPD responds faster to requests when they are responding to items like Illegal Fireworks and Blocked Driveways.
DOITT takes longer to respond to complaints because it focuses solely on Public Payphone complaints and LinkNYC complaints. LinkNYC is replacing payphones acrosss NYC and it provides fast, free public Wi-Fi, phone calls, device charging, and a tablet for access to city services, maps and directions.
To predict the response time for a 311 request, I will implement an ARIMA model. Creating a model that can forecast response times would be valuable to city residents so they can set expectations for how long their issues will take to be resolved.
Auto Regressive Integrated Moving Average (ARIMA) is a statistical model used for forecasting time series data. The ARIMA forecasting algorithm is based on the idea that the information in the past values of the time series can alone be used to predict the future values.
AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations.
I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary.
MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.
Each of these components are explicitly specified in the model as a parameter.
An ARIMA model is one where the time series was differenced at least once to make it stationary and the AR&MA terms are combined.
Predicted Yt = Constant + Linear combination Lags of Y (up to p lags) + Linear Combination of Lagged Forecast Errors (up to q lags)
An ARIMA model is characterized by 3 terms/parameters: ARIMA(p,d,q)
p: the order of the AR term
d: the number of differencing required to make the time series stationary (degree of differencing)
q: the order of the MA term
A linear regression model is constructed and the data is prepared by a degree of differencing in order to make it stationary (to remove trend and seasonal structures that negatively affect the regression model).
A value of 0 can be used for a parameter, which would indicate not use that element in the model. This way, the ARIMA model can be configured to perform the function of an ARMA model, and even a simple AR, I, or MA model.
If a time series has seasonal patterns, then we would need to add seasonal terms and the model would become a SARIMA (Seasonal ARIMA).
The objective is to identify the values of p, d and q.
Let’s recall the visualization of the number of 311 calls made per day in 2020.
First we will look at the d (# of differencing) parameter :
The first step to building an ARIMA model is to make the time series stationary
To check if the series is stationary we use the Augmented Dickey-Fuller Test (from statsmodels package)
For an ARIMA model, you need differencing only if the series is non-stationary. Otherwise, no differencing is needed and d=0.
If p-Value > 0.05 we go ahead with finding the order of differencing
The p-value is: 0.346121 which is > .05 so we proceed with finding the order of differencing
1st level differencing achieves a p-value of 0, and an autocorrelation graph that stays relatively within the boundaries. However, we can see that every 7 lags goes outside of the boundary, showing a weekly seasonality in the data. We need to implement SARIMA to take care of that
If the model has well defined seasonal patterns, then enforce D=1 for a given frequency ‘x’.
Because we can see clear weekly trends in the auto correlation plot above, we needed to account for seasonality. As you can see below, the seasonal spikes are intact after applying usual differencing (lag 1) vs. seasonal differencing trends are more in line.
As shown below, a model of SARIMA(3,1,1)(1,0,1)[7] is best.
3,1,1 makes sense with our first visualizations because 3 p (AR) fits with spike we see in the autocorrelation plot of the 1'st level differencing, 1 I (d) gave us a p-score of 0, and 1 MA (q) would take more visualizations but fits with the understanding that recent events matter more for the data.
(1,0,1,7) is the calculated ARIMA hyperparameters p,d,q for the season of 7 days.
Next, I'll put throw these values into the model and look at some visualizations.
Top Left: The residual errors seem to fluctuate around a mean of zero and with the exception of a few spikes, have a reasonably uniform variance.
Top Right: The density plot suggest a fairly normal distribution with mean zero.
Bottom Left: All the dots should fall perfectly in line with the red line. Any significant deviations would imply the distribution is skewed.
Bottom Right: The Correlogram, aka, ACF plot shows the residual errors are not autocorrelated.
Any autocorrelation would imply that there is some pattern in the residual errors which are not explained in the model. So you will need to look for more X’s (predictors) to the model.
Overall, it's not perfect as the spike from the storm seems to be throwing off some of our values still (shown with a 'heavy tail' q-q plot). The correlation is doing well however. Now it's time to forecast.
So how did the SARIMA model do with forecasting? To evaluate this, I calculated the Root Mean Square Error (RMSE) and the R-squared of these 2 lines. These are good indicators of model performance because they represent the goodness of fit of a regression model.
I found that this model achieves an R-square of 0.71 which is fairly high, especially for a forecast. The ideal value for an R-squared is 1, so the closer the value is to 1, the better the model is fitted.
On the right side of this graph, the red line highlighted by the grey bar is a 14-day forecast with no actual data.
Let’s revisit my overall research question for this project. Can we train a model to accurately predict the response time for certain complaint types?
When we are evaluating how the SARIMA forecast model performed, it’s important to remember that this R-square score is from a model that is trained on itself, so it makes sense that the score would be high. Remember, ARIMA at its core is based on the idea that the information in the past can alone be used to predict the future values.
Implementing SARIMA also ensures that R-squared is stationary vs. an ordinary R-squared where there are trends or seasonal patterns.
A high R-squared score could be driven by overfitting. In addition, the residual errors mostly fluctuated around a mean of zero, but there were several spikes. These spikes are due to unpredictable and stochastic events that influenced call volume which in turn we would expect to influence response time. Example of these events I found included, extreme weather events, pandemic, utility outages, and protests. The unpredictable nature of these large events makes it difficult to accurately predict response times.
To further evaluate how my SARIMA model performed, I wanted to compare it to a Random Forest Classifier which is a supervised learning algorithm. This decision-tree based algorithm can be used for both classification and regression tasks. The X-values I chose for this analysis were agency, complaint type, and borough used to predict a Y-value of response time in hours.
The Random Forest model generated an R-square score of .13. With an ideal R-squared score of 1, Random Forest didn't do nearly as well as SARIMA.
While this is a bit of a disappointing score, not many other models exist that would provide adequate runtime with the large volume of data included.
I found NYC311 call data to be unpredictable and that stochastic events influence call volume which in turn may impact response time
Examples of this from the 2020 data include: extreme weather events, pandemic, utility outages, and protests
Additional datasets could be integrated to model potential stochastic events and how these impact call volumes. For example, the integration of weather pattern data could assist in tagging calls in response to extreme weather.