Hypothesis

1st: Hypothesis

“A theft will most likely happen in community 23 from 18:00 to 20:00 on Friday in Chicago”

Logistical Regression:

First, we tried linear regression on the attributes we chose but there is no linear relationship among the attributes. Therefore, we decided to use logistical regression.

First, we created a data frame with community area, weekday number, an intercept column of theft and dummy variable for time_section. Then, we instantiated X and y where X is equal to community area, weekday number and dummy variable and y is an intercept column. When we ran the regression, we got the probability of “a theft will most likely happen in community 23 from 18:00 to 20:00 on Friday in Chicago” is equal to .27365, which is very low. So, this method does not support our hypothesis.

Decision Tree:

Decision Tree takes as two input arrays: an array X where X is equal to community area, weekday number, and dummy variable and an array y holds the class label = theft. We used X and y to build decision tree model and tested the accuracy on the given test data and label.

We found the accuracy of the model equal to .77227, which means the model is accurate. Then, we used this model to test the hypothesis. The probability of “A theft will most likely happen in community 23 from 18:00 to 20:00 on Friday in Chicago” is .17544, which is very low. So this method does not support our hypothesis.

KNN:

We used the same training sample to build a KNN model. Also, we used the same test sample to find whether the model is accurate. The accuracy is equal to .72615, which means the model is pretty accurate. Using the model, we tested the hypothesis and the probability of the hypothesis being true is equal to 0, which means our hypothesis is not true.

Naive Bayes:

We used the same training sample to build a Naive Bayes model. Also, we used the same test sample to find whether the model is accurate. The accuracy is equal to .7709, which means the model is pretty accurate. Using the model, we tested the hypothesis and the probability of the hypothesis being true is equal to .27375. Since the probability is low, it means our hypothesis is not true.

Support Vector Machine:

For this method, SVM model takes a long time to produce our result. The model computes the distance between each pair of data points. For this particular Chicago crime dataset, we have more than 220,000 data points, so it takes a long time. We did not include the result here but the code is still in the program.

Random Forest:

The way we applied this method is the same to the previous methods. The accuracy of this model is .7683. The probability of the hypothesis being true is .18855. Since the probability is low, it means our hypothesis is not true.

2nd Hypothesis:

“The time that crime most likely happens in Chicago is same as Montgomery.”

T-test:

We use the function “stats.ttest_ind()” from the package scipy of stats in python to test our hypothesis. Inputs of the function are “time_hour” of Chicago and “time_hour” of Montgomery. The function returns the T statistic and P-value. The p-value is 1.599e-08 which is very low. So, this method does not support our hypothesis.

3rd hypothesis:

“Driving under influence will most likely happen in Silver Spring from 16:00 to 18:00 on Tuesday in Montgomery”

Logistical Regression:

First, we created a data frame with cityNum, day_of_week, an intercept column of Driving under influence and dummy variable for time_section.

Then, we instantiated X and y where X is equal to cityNum, day_of_week and dummy variable and y is an intercept column.

When we ran the regression, we got the probability of “a Driving under influence will most likely happen in Silver Spring from 16:00 to 18:00 on Tuesday in Montgomery” is equal to .01037, which is very low. So, this method does not support our hypothesis.

4th hypothesis:

“The day of week that crime most likely happens in Chicago is same as Montgomery”

T-test:

We use the function “stats.ttest_ind()” from the package scipy of stats in python to test our hypothesis. Inputs of the function are “day_of_week” of Chicago and “day_of_week” of Montgomery.

The function returns the T statistic and P-value. The p-value is 8.0979e-192 which is very low. So, this method does not support our hypothesis.