For your favorite car model (e.g., Ford Taurus), find the average prices for cars of this model that were manufactured in each year from 1997 to 2015 (in US). You can do this from some web sites such as cars.com, truecar.com etc. When you process the data, try to store other information about the car, such as mileage, condition etc; you may need to use this data for project 1. Or, you can simply use the data in the given example here. Or, you can use a dataset from the UC Irvine Machine Learning Repository to carry out a linear regression and answer the similar questions. The following are requirements:
0) The project must be on one of the following topics: 1) Data visualization; 2) Linear regression; 3) Clustering
1) The presentation will be held on the date and time scheduled by UMassD for DSC101 final exam, and the final report due 11:59pm EST on the same day of the presentation.
2) The project consists of presentation and final report, and each worth 50 points.
3) To provide max flexibility, there are three options for the course project:
a) Simply choose one the labs that has the relevant topic (visualization, clustering or linear regression) and resubmit, with an additional presentation.
20% of the grade will be automatically taken off for the project report part under this option.
b) Choose one of the labs that has the relevant topic (visualization, clustering or linear regression) and replace with new datasets, with an additional presentation;
c) A new project (the following are example projects) with report and presentation.
The following are example project, which you could use or you may prefer to choose your own data. See also here for more example projects from previous years, here for project ideas or here for example data.
The project report should consist of the following. A description of the data set (please write in your own words), including where this data set is from, what this data is for, what are the features and response variable, number of instances, number of features etc. Then provide a description of the method or analysis, including how you preprocess and analyze (or visualize) the data, and what are the findings and a possible interpretation of the findings. The report should also have a section for conclusion or discussion of your analysis or findings. Please also cite any references you use in the report. The report is expected to be 8-12 A4 pages with font size 12, single spcae and margin size of 1 inch; too long or short will result in a penalty to your project grade.
Car prices by year in current US market
a) A detailed description on how you obtain the data (be cautious on potential bias in data collection).
b) For each year, you need to find the price of at least 100 cars, and then calculate the average
(click here for example on how to extract the car price from a messy text file)
c) Produce a year-price scatter plot (click here for an example)
d) Tell during which years the dip in prices slows down. If you want to buy a used car or if you have a new
car, when would you buy or sell it? Why?
e) Carry out linear regression on average prices Vs year
Produce another scatter plot and add the regression line
Report output of the linear regression
g) Include the data (only the average prices and the years) as part of your submission
2. Global terrorism data
This is a dataset from kaggle.com, which consists of more than 150000 terrorist attacks during 1970-2015. Here is the link for the data.
a) A detailed description on the dataset.
Define major attacks are those involving causalities more than 10, 3-10 as small attacks and minor otherwise. For each of minor, small, ad major attacks, complete b-d)
b) Produce a scatter plot of year Vs number of attacks for major attacks and minor attacks, respectively.
c) Tell if there were years when there are changes in the trend of #attacks Vs Year
d) Carry out linear regression on #attacks Vs year
Produce another scatter plot and add the regression line
Report output of the linear regression.
3. Earthquake data
This is a dataset from kaggle.com, which lists the date, time, and location of all earthquakes with a magnitude of 5.5 or higher.