Due in class on the day of last class (Apr 30, 2024).
Choose one for Project 1 (any choice should have a regression component).
Grading:
1) 20% Description of the problem or dataset, including how you processed the data
2) 20% Analysis, e.g., what did you find interesting or any salient pattern in the data? Please give interpretation.
3) 20% R code for regression analysis (and Python or other scripts used to process the data)
4) 20% plots and visualization
5) 20% writing of the report (correctness, clarity etc)
1. Car prices by year in current US market
For your favorite car model (e.g., Ford Taurus), find the average prices for cars of this model that were manufactured in each year from 1997 to 2015 (in US). You can do this from some web sites such as cars.com, truecar.com etc. When you process the data, try to store other information about the car, such as mileage, condition etc; you may need to use this data for project 1. Or, you can simply use the data in the given example here. Or, you can use a dataset from the UC Irvine Machine Learning Repository to carry out a linear regression and answer the similar questions. The following are requirements:
a) A detailed description on how you obtain the data (be cautious on potential bias in data collection).
b) For each year, you need to find the price of at least 100 cars, and then calculate the average
(click here for example on how to extract the car price from a messy text file)
c) Produce a year-price scatter plot (click here for an example)
d) Tell during which years the dip in prices slows down. If you want to buy a used car or if you have a
new car, when would you buy or sell it? Why?
e) Carry out linear regression on average prices Vs year
Produce another scatter plot and add the regression line
Report output of the linear regression
g) Include the data (only the average prices and the years) as part of your submission
2. Global terrorism data
This is a dataset from kaggle.com, which consists of more than 150000 terrorist attacks during 1970-2015. Here is the link for the data.
a) A detailed description on the dataset.
Define major attacks are those involving a casuality more than 10, 3-10 as small attacks and minor otherwise. For each of minor, small, ad major attacks, complete b-d)
b) Produce a scatter plot of year Vs number of attacks for major attacks and minor attacks, respectively.
c) Tell if there were years when there are changes in the trend of #attacks Vs Year
d) Carry out linear regression on #attacks Vs year
Produce another scatter plot and add the regression line
Report output of the linear regression.
3. Coronavirus Datasets
The recent outbreak of Coronavirus in Wuhan, China has attracted wide attentions. Many data have been collected by multiple sources. It would be interesting to analyze the data, discover patterns of spread, and gain insights. Also, as the virus is still spreading at a large scale and an unknown rate, it would be highly valuable to give a reasonable estimate of the expected causalities. A wide collection of datasets can be found from reddit.com.