Latent Dirichlet allocation model (LDA) is a textual analysis method used to understand the most important terms in a text. Before applying LDA model text data needed to be preprocessed, like using the Inverse Document Frequency (IDF) which refers to the inverse fraction of documents in the words collection which has specific term. The more this term appears, the lower the IDF score is. In this way, they get IDF score for words and by sorted them, stop words can be collected. The number of topics as K in LDA model which is modifiable, to cluster articles into K types. After brief experimentation, we settled on K = 6 due since it gave us both a good amount of keywords and is easy to interpret.
LDA allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. If these observations are words that are collected into text files, it posits that each text file is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. As mentioned before, we use this model to determine the important keywords in articles about LinkNYC and broadband access.
Hypothesis testing is a statistical inference method used to determine the probability that a given hypothesis is true. Z-test is a commonly used statistical test used for dealing with big samples. Z-test assumes that data follows a normal distribution.
We use a Two-sample Z-score for comparing two means to analyze if there are statistically significant differences between areas with LinkNYC kiosks and areas without LinkNYC kiosks. We calculate Z-score with equation below. We examine whether LinkNYC launch has impacted city services using the same approach.
Two-sample Z-score for comparing two means equation:
Where:
x̅1 - x̅2 is the difference between the sample means of 311 Service Requests in areas with and without LinkNYC kiosks;
σ1 and σ2 are the standard deviations of 311 Service Requests in areas with and without LinkNYC kiosks;
n1 and n2 are the number of 311 Service Requests in areas with and without LinkNYC kiosks.
We performed various statistical t tests on 311 Service Request data at 95%, 99%, 99.99% significance levels. The values associated with these significance levels are 1.96, 2.576, 3.29053.
Service request counts on census blocks with and without LinkNYC kiosks
First, we compared Service Request counts in areas with LinkNYC kiosks and ares without such kiosks.
H0: There is no difference in the number of 311 total Service Requests between census blocks with LinkNYC kiosks installed and census blocks without kiosks.
H1: There is a statistically significant difference in the number of 311 total Service Requests between census blocks with LinkNYC kiosks installed and census blocks without kiosks.
Using the formula above, we calculated Z = 24.44. As the Z-score is much higher than 3.29053, we conclude that the 311 complaint total counts in with kiosk area are much lower than its in without kiosk area at the 99.99% significance level.
Service Requests counts before and after LinkNYC
Next, we compared with the 311 complaint total counts change after LinkNYC kiosk built.
H0: There is no difference in the number of 311 total Service Requests before and after LinkNYC kiosks were installed.
H1: There is a statistically significant difference in the number of 311 total Service Requests before and after LinkNYC kiosks were installed.
Using the formula above, we calculated Z = 2.58. As the z-score is higher than 2.576, we could conclude that, at the 99% significance level, the 311 complaint total counts drops after LinkNYC kiosk are introduced to the area.
Service Requests about sidewalk noise before and after LinkNYC
Next, we compared with the 311 complaint about sidewalk noise counts change after LinkNYC kiosk built.
H0: There is no difference in the number of 311 sidewalk noise Service Requests before and after LinkNYC kiosks were installed.
H1: There is a statistically significant difference in the number of 311 sidewalk noise Service Requests before and after LinkNYC kiosks were installed.
Using the formula above, we calculated Z-score = 3.53. As the z-score is higher than 3.29053, we could conclude that, at the 99.99% significance level, the 311 complaint about sidewalk counts increase after LinkNYC kiosk are introduced to the area.
Clustering analysis is an unsupervised learning method that groups similar observations into a number of clusters. We used K-means clustering to group LinkNYC PUMAs as described in Section 2.1.3 into clusters using LinkNYC Median number of EBT-related calls done in that area and percentage of households. Tables containing descriptive statistics for both variables can be found in the figure below.We can the clusters depicted by colors as well on a corresponding map.
We find that the following areas share the same profile based on our two factors, ordered from lowest to highest means:
Based on the time series analysis and ARMA model, we looked at a possible relationship between the number LinkNYC kiosks and unemployment rates. We noticed a downward trend of monthly unemployment rate in New York as seen in the Figure 1 below (The Time Series analysis of unemployment change over time in New York). Blue line gives the change of rate of unemployment over time The red line is the trend of the data seen through the ARMA model. Then we discovered a reverse trend in the number of LinkNYC kiosk activations.
From the Figure 2 below (Changes in unemployment and LinkNYC station activation), we can see that the number LinkNYC of new LinkNYC station increases month-over-month in the timeframe of Late 2015 to Late 2018. Although, there might be some correlation between these two variables, we could not conclude a causal relationship between the variables. There is a number of external factors that explains the correlation, such as the steady growth of the US economy and predetermined incremental increase in LinkNYC stations in the city.
Pearson’s correlation coefficient:
Where:
In a nutshell, there are many positives as well as negatives to working with different kinds of data. Even though these datasets allowed us to understand LinkNYC usage better, there were many analysis done which limited further work. For instance, demographic data allowed us to see certain correlations but made us understand how correlation can never be causation. The higher usage of EBT calls in younger and lesser income neighborhoods does not give us causation but only correlation to look further in. This allowed us to study correlations of Food Stamps which allowed us to see how areas like Manhattan, Staten Island and a few other boroughs had high correlations to the percentage of households receiving food stamps. Many people used EBT phones for food stamp enquiry, thus allowing us to understand why usage was high in lower areas. Even with 311 requests, the type of requests being made are greatly affected by LinkNYC.