Syllabus

Location and Time - CSI 388, 11:20-12:35 TR

Professor - Dr. Mark Lewis, Office: CSI 270H, Phone: 999-7022, e-mail: mlewis@trinity.edu. The best way to reach me typically is by e-mail. I check it frequently and try to respond promptly.

Office Hours - See my T-mail calendar or by appointment. I'm in my office a lot so you should free to drop by. If you are coming from lower campus you can always call or write a short e-mail to see if I'm in and available at that time. I can also do some virtual office hours using Hangouts.

Text - "Introduction to Machine Learning" by Ethem Alpaydin. I have also posted videos students can use for review/learning purposes. Here are some other books that you might consider buying if you are interested:

Course Description - This course is intended to be an introduction to big data processing and machine learning. We will specifically learn about the Spark framework for big data analytics. Apache Spark is a distributed data framework developed to be used on large clusters to look at data sets that cannot be processed on single machines. We will also talk about the basics of machine learning. We will cover the mathematical theory of machine learning and you will write code to implement many of these algorithms. The course itself will be very hands-on, and the workload will largely focus on writing code to perform analyses of various datasets as well as writing your own basic ML library.

Work/Time Expectations - You signed up for an upper-division Lewis course. I have now taught this course a few times, but I'm still trying to find a format and workload that I think optimizes your learning. You have been warned.

Coding Practices - A lot of what you do for this course will be writing code that you turn in to me. Code should be well formatted and reasonably documented. Code you turn in to me should be of your own construction. All code you turn in is pledged.

Grades - The grade for this course will be composed of five components. These components and what they entail are discussed below. This table summarizes how each component contributes to your grade in the course. All items turned in for a grade in this course are to be pledged. For code, the pledge statement should be put in a comment at the top of the code.

Data Sets - At the beginning of the semester I am going to ask you to go out and find three data sets that you find interesting. You will write a short paper where you describe each of these data sets, why you find them interesting, and what questions you would want to answer using those data sets. At the end of the semester, we will revisit this and you will do an analysis of what you wrote originally taking into account what you learned during the semester.

Coding Problems - Roughly every 1-2 weeks you will also have a problem set to work on that you will do on your own outside of class. Your work on these will include writing Spark code to do the analytics and then doing a write-up the gives your answers to the questions. All of the above will appear in a GitHub repository.

Custom ML Code - At regular intervals, you will turn in implementations of your own machine learning algorithms. These will be in a separate GitHub repository. They will be auto-graded by unit tests.

Final Project - At the end of the semester each of you will do a larger project that you will present to the class during the normally scheduled final time. Most of the time this is done using one of the data sets that you identified at the beginning of the semester. The project must involve not only a large data set but also several of the machine learning algorithms that we have discussed and worked with along with several plots that illustrate your conclusions.

Modified Grading Scheme

    • Points (Minimum values. I might make some of the CustomML assigments worth more.)

      • Data Sets - 100

      • Data Assignments (9) - 100

      • Custom ML (5) - 100

      • Final Project - 400

    • Grades

      • A : 1650

      • A- : 1580

      • B+ : 1520

      • B : 1450

      • B- : 1380

      • C+ : 1320

      • C : 1250

      • C- : 1180

      • D+ : 1100

      • D : 1000