Due 11:59 pm Thur Dec 15
In this project stage you will combine the two tables (and add any other table/data that you want), then do an analysis. I discuss these two steps below.
Combining the Two Tables
1) You will start by creating the schema of a table E, which is the target table into which you will merge the two tables A and B. If tables A and B have identical schema, then table E has the same schema. Otherwise, the schema of table E will be the union of the schemas of tables A and B.
2) Next, you will write a Python script to combine the tuples of tables A and B to create the tuples for table E. Note that we have discussed such a step several times in the class. We will discuss it in more details in the class in a coming lecture.
3) Finally, you execute this Python script to populate the table E.
Note: In this step, if you want to add more data, such as combining a table D with tables A and B to create table E, that is fine too.
Performing an Analysis on the Combined Table
You are now ready to perform an analysis on the combined table. This analysis is something of your own choosing. But it must involve one of the key techniques that we will cover in the class: classification, clustering, correlation discovery, anomaly detection, or OLAP-style exploration. I will discuss more in the class.
What to submit
Submit the following on your group's website:
a CSV file storing Table E
a pdf file that discusses the following issues:
how did you combine the two tables A and B to obtain E? Did you add any other table? When you did the combination, did you run into any issues? Discuss the combination process in detail, e.g., when you merge tuples, what are the merging functions (such as to merge two age values, always select the age value from the tuple from Table A, unless this value is missing in which case we select the value from the tuple in Table B).
Statistics on Table E: specifically, what is the schema of Table E, how many tuples are in Table E? Give at least four sample tuples from Table E.
What was the data analysis task that you wanted to do? (Example: we wanted to know if we can use the rest of the attributes to accurately predict the value of the attribute loan_repaid.) For that task, describe in detail the data analysis process that you went through.
Give any accuracy numbers that you have obtained (such as precision and recall for your classification scheme).
What did you learn/conclude from your data analysis? Were there any problems with the analysis process and with the data?
If you have more time, what would you propose you can do next?