Syllabus

Location and Time - CSI 257, 11:20-12:35 TR

Professor - Dr. Mark Lewis, Office: CSI 270H, Phone: 999-7022, e-mail: mlewis@trinity.edu. The best way to reach me typically is by e-mail. I check it frequently and try to respond promptly.

Office Hours - See my T-mail calendar or by appointment. I'm in my office a lot so you should free to drop by. If you are coming from lower campus you can always call or write a short e-mail to see if I'm in and available at that time. I can also do some virtual office hours using Google+ and Hangouts.

Text - "Think Like a Data Scientist" by Brian Godsey. I also plan to post videos students can use for review/learning purposes. Here are some other books that you might consider buying if you are interested.

Course Description - This course is intended to be an introduction to data science, focusing largely on the computational aspects of the field. We will specifically learn about the Spark framework for big data analytics. Apache Spark is a distributed data framework developed to be used on large clusters to look at data sets that cannot be processed on single machines. We will also talk about the nature of data science, and introduce enough stats to allow you to perform the required analytics work. The course itself will be very hands-on, and the workload will largely focus on writing code to perform analyses of various data sets.

Work/Time Expectations - You signed up for an upper division Lewis course the first time I have ever taught it. I own you. You are the guinea pigs for this course.

Coding Practices - A lot of what you do for this course will be writing code that you turn in to me. Code should be well formated and reasonably documented. Code written outside of class that you turn in to me should be of your own construction. All code you turn in is pledged.

Grades - The grade for this course will be composed of five components. These components and what they entail are discussed below. This table summarizes how each component contributes to your grade in the course. All items turned in for a grade in this course are to be pledged. For code, the pledge statement should be put in a comment at the top of the code.

Data Sets - At the beginning of the semester I am going to ask you to go out and find three data sets that you find interesting. You will write a short paper where you describe each of these data sets, why you find them interesting, and what questions you would want to answer using those data sets. At the end of the semester, we will revisit this and you will do an analysis of what you wrote originally taking into account what you learned during the semester.

In-class Problems - Roughly every week there is a day when you will be working in small groups answering questions on data sets using the techniques that we have been talking about. For each of these you will have to submit to me your answers to the questions and the code that you used to get those answers. These will be graded on a scale of 0-2. You get 1 point if you are there and work, but don't get most of the answers. You get a 2 if your group gets most of the answers. You can get back 1 point by adding correct answers with code to your write-up for the between-class problems.

Between-class Problems - Roughly every week you will also have a more significant problem to work on that you will do on your own outside of class. The solution and the code that solves it will be due to me before the next class day. Submissions that come in late will be 25% off for one day, 50% off the second day, and aren't accepted after that.

Quizzes - There will also be six quizzes given during the course of the semester. These quizzes will cover readings about data science as well as aspects of Spark that we have talked about.

Final Project - At the end of the semester each of you will do a larger project that you will present to the class during the normally scheduled final time.