CS620 --Introduction to Data Science and Analytics
Fall 2023
Course Overview
Welcome to CS620/DASC600 Data Science.
Note that the course content will be delivered via https://canvas.odu.edu/ for all the registered students. The recitation sections will be delivered each Tuesday by Dr. Yi He. We will utilize the recitation time to discuss about the weekly course content, assignments, class activities and the project work.
Course Overview Data science is an interdisciplinary blend of the analytical, computational, and statistical skills necessary to extract knowledge from large and complex sets of data. The proliferation of such data has led to an acute shortage of students with data science skills in the local, national, and global economies.
This course will introduce students to this rapidly growing field of Data Science and equip them with some of its basic principles and tools as well as its general mindset. Students will learn concepts, techniques, and tools they need to deal with various facets of data science practices. Cross-listed with DASC 600.
Course Objectives Students completing this course should be able to:
Define and explain the key concepts and models relevant to data science.
Understand the processes of data science: identifying the problem to be solved, data collection, preparation, modeling, evaluation and visualization.
Develop an appreciation of the many techniques for data modeling
Be comfortable using commercial and open-source tool such as python and associated libraries for data analytics and visualization.
Basic Information
Instructor: Yi He
Office: E&CS 3108 Email: yihe@cs.odu.edu
Office Hour: Tuesday, 10AM – 11AM, or by appointment Classroom: DRGS 1117
Meeting Time: 11 AM -- 12:15 PM Tuesday
Reading & Project Write-ups: 11 AM -- 12:15 PM Thursday (No in-person meetings)
Grading
Final course grades are based on the overall average. Overall class grade (not the individual grade) windows may be increased in size if the instructor finds it appropriate. Final score in % will be rounded to the nearest whole number. Assigning + or – grades may be made at instructor’s discretion.)
A: 94-100, A-:90-93, B+:87-89, B: 84-86, B-:80-83, C+:78-79, C: 74-77, C-:70-73, Fail (Grade F): 0-69
The scores you receive on the various graded tasks in the class will be weighted as follows:
Grading correction: The assignment or exam grading correction requests should be sent to the instructor within 1 week of receiving the grade, or before the end of the semester, whichever comes first. After that, your grade will not be adjusted. If you find a mistake in grading, please let the instructor know. Your grade will not be lowered.
There is no separate grading scale for PhD students, but PhD students will typically be held to a higher standard.
The scores you receive on the various graded tasks in the class will be weighted as follows:
Homework Assignments (5): 25%
We will have five homework assignments, in total worth 25% of your overall grade.
In-class Activities + Discussion Forum Interactions: 15%
Class activities and participation in the discussion are both important to your success in the course. As one measure of your participation and course preparation, we will have class activities related to lecture topics to supplement the learning.
Final Exam: 20%
Final examination will be a comprehensive (covering all the modules), closed-book exam and will be scheduled during the last week of the class. On the week before final exam, I will post a study guide that will help students to prepare for the written examination. You may have one standard 8.5" by 11" piece of paper with any notes you deem appropriate or significant (front and back) for the final exam.
Data Project: 40%
The data project is an opportunity to tackle a more challenging data science activity. Details, requirements, and submission information will be on the project section of the course web page. For the project, you will work individually or team of 2-3 students on a problem of your choosing that is interesting, significant, and relevant to data science. The ultimate goal of your course project is to tackle some interesting real-world problems. All members of a group will receive the same grade on group work. Therefore, it is in your interest to choose other group member (ideally, first week of the class) who have the same goal in the class as you do. It is also in your interest to work together and ensure that all tasks are completed effectively. Your scores on group work may be adjusted based on your contribution. The goal of your data project is to apply the techniques learn in each week of the class towards your dataset (exploration, wrangling, machine learning, visualization). We are going to use Google Colab (Colaboratory) (https://colab.research.google.com/), a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.
Textbook
No textbook is required in this course in general. Below are some recommended materials that could consolidate your background knowledge, so as to facilitate your understanding of what shall be covered in this course.
Top 10 algorithms in data mining, By Xindong Wu, CRC Press, 2009
A Hands-On Introduction to Data Science by Chirag Shah, Cambridge University Press, April 2, 2020
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, By William McKinney, O'Reilly; 2 edition (October 20, 2017)
Data Science from Scratch: First Principles with Python By Joel Grus, O'Reilly 1st edition, 2015
Communication
Piazza: All questions will be fielded through Piazza. The primary benefit is that for many questions everyone can see the answer and other students can answer as well. I will endorse good student responses. Additionally, I expect you to actively participate in online discussions at Piazza. You can post public or private messages that can only be seen by the instructor. You will be signed up with your odu email, but you may switch to another email.
Canvas: Cavans will be used primarily for course weekly module content and grade dissemination.
Email: Again, email should only be used in rare instances, I will probably point you back to Piazza if you have a question related to course materials and/or relevant to other students in the class.
If you send email to me (for any urgent matter such as health issue etc.,), please be sure to include your name and the course number in the body of the e-mail. You should also use an appropriate subject line that looks like “CS620-Health” etc. Failure to follow these guidelines may result in delayed response.
Course Schedule (Tentative)
Please check the course website periodically for updates. Substantial changes will also be announced via emails.
Homework Assignments
The homeworks are to be done as individuals. There will be five homework assignments
Homework 1: Due, Sunday, Sep. 9, 11.55pm - submit your web url to Piazza and answer to part 2 to Canvas.
Homework 2: Due, Sunday, Sep. 23, 11.55pm - Update and commit your .py file through GitHub Classroom (find more information in PLE/Piazza)
Homework 3: Due, Sunday, Oct 21, 11.55pm - submit your LastName-hw3.py file to Canvas.
Homework 4: Due, Sunday, Nov 18, 11.55pm - submit your LastName-hw4.pdf file to Canvas.
Homework 5: Due, Sunday, Dec 9, 11.55pm - submit your LastName-hw5.pdf file to Canvas.
Data Project
Milestone Due Dates:
Abstract (5pt) : Sunday Sep. 12, on Colab, Submit your URL to Piazza
Progress Checks I and II (10 pts): Oct 10, Nov 14
Final Report and YouTube Presenation/Demo (30 pts): Dec 12, Colab and Youtube
Introduction
The data project is an opportunity to tackle a more challenging data science activity. For the project, you will work in individual or a team of 2-3 students on a problem of your choosing that is interesting, significant, and relevant to data science. More members you have (2 or 3), my expectations from the project will be high compared to an individual project, so choose carefully. The ultimate goal of your course project is to develop to tackle some interesting real-world problem. All members of a group will receive the same grade on group work. Therefore, it is in your interest to choose other group member (ideally, first week of the class) who have the same goal in the class as you do. It is also in your interest to work together and ensure that all tasks are completed effectively. Your scores on group work may be adjusted based on your contribution. The ultimate goal of your data project is to apply the techniques learn in each week of the class towards your dataset (exploration, wrangling, machine learning, visualization). You can utlize any rosources for this project, but I highly reocmmend using Google Colab (Colaboratory) (https://colab.research.google.com/), a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.
The assignment is flexible: choose a topic of interest to you and your group and carry out a cohesive, complete project based around it. The range of possible topics that you can choose among is broad. However, the project you pick should incorporate a dataset and wide range of data science techniques. I think the most interesting problems will be ones in which you identify and work with some "client" to develop a solution to one of their problems. Such clients can include organizations with which you are involved, work site, etc. You can propose to carry out a project as part of a larger effort. However the caution here is that you will need to be able to separate out the contribution made by this class' project from the rest.
You will need to prepare a written project abstract at Google Colab and get it approved by the instructor, project progress report, prepare a final report (written), and give a demo. Peer-assessment: Individual student's grades for projects will be influenced by their teamwork as evaluated by their project group members. This will be applied as an overall weight to the term project grade.
Project Abstract
The abstract (in Google Colab) should include the following information:
○ Each member name, email, web portfolio link in the very first lines of the Colab.
○ Data Source (if any)
URL,
a short description
first few records of the dataset (head())
○ Your end goal with this dataset (build a recommender system, prediction model/classifier, evaluation of models, visualizing something, infer something, or something else)
○ Any secondary datasets you are planning to utilize to augment your primary dataset (should be clearly specified that this is a secondary dataset)
○ Project Plan/ Gantt Chart. Team member contribution plan (if a team project)
● You need to have an acceptable abstract submitted by the deadline. Without abstract you'll recive zero for your project grade.
● Submit your Colab link to Piazza thread.
Project Progress Checks I and II (continue your report at Colab)
In this progress checks, you should assess the progress you are making on your project and update the work plan as necessary. Continue your earlier Colab document documenting your progress towards the project.Start with your proposal or previous progress report (if any) and add the following content to your progress report .
The progress should inform about,
Design decisions,
The target audience for the project (who the users are)
What tools/technologies will you use and why?
Data Models - optional
System architecture and implementation details (if any) - optional
What is the overall design? What are the core features?
How does the requirements/design satisfy the needs of your users?
Screen shots of the user interface or a prototype (if any).
Code snippets or if Colab project, then complete code segments.
Current Status
What is the current status of the implementation?
What is left to do?
Complete the report assuming that ultimately this will be your final project report submission.
Project Presentation (10 Minutes Video)
Use Zoom (you have access to zoom pro via ODU https://www.odu.edu/ts/collaboration-tools/zoom) or any other video recording tool to record a 10 minute or less (2-3 pts penelty will be applied if more than 10 minutes) video of your project work and upload it to YouTube. You can show your implementation/demo (use the screenshare option) and also your presentation slides or Google Colab. Your presentation/Demo should succinctly tell us *why* we should care and *what* interesting insight you have about the chosen data project. Give us some insight into the tough / cool / interesting aspects of your project. This is your time to shine, so carefully prepare what exactly you want to show off that will impress us in this summary. View the audience as potential upper management in your company -- so convince us that your problem is important, that you have the appropriate insight about the dataset.
Follow the Guidelines preparing your Summary section for the talk (This should be at the very end of your Colab)
Have a title of the project and your name at the very first in your Summary Section of the Colab or have it in your very first slide.
Have a clear outcome presented (charts, graphs, key conclusions)
Know what you want your audience to take away from your presentation. Ideally, you would like the audience to leave with an understanding of what you’re doing and why you’re doing it.
Tell a Story
You may like to present your punch line first, tell it like a story, with a beginning, middle and an end. It’s not easy to condense your project into a short presentation, so you may find it easier to break your presentation down into smaller sections. Try writing an opener to catch the audience attention, then highlight your key findings, and finally have a summary to restate the importance of your work.
Engage your audience!
Embed your YouTube Video in your Colab at the Summary Section.
Project Final Report
A comprehensive report describing the project. This should be a "complete" document, so it should include front matter (title page, abstract, table of content, chapters), or a sidebar index that connect to your report elements. These should include problem statement, explain your design and implementation, results and evaluation. This report should stand by itself as the archival description of the project.
This is the continuation of your same Google Colab project document (Abstract, Progress checks).
Project Title
Team member name, email, web portfolio link in the very first lines of the Colab.
Colab file title should be "YourLastName_CS620_DataProject"
Your results and evaluation (or evaluation strategy)
What metrics used (or will you use) to evaluate the success of your project?
Performance measures (how you measure them)?
Other criteria?
You should address the same questions as those you have addressed in the previous reports (abstract, progress checks), only with more details, especially regarding some of the challenges that you need to solve and your experimental results if any.
You should also include your conclusions from the study and point out how your work can be further extended (i.e., future work).
References if available (this should be the very last section of your Colab)
Provide as much context as you can—any kind of diagrams and illustrations that will make it easier for us to understand and evaluate your effort. Graphics are always helpful, you have probably heard the saying “a picture is worth a thousand words”!
Do not just show the diagrams—for all figures, tables, charts, and diagrams provide some narrative discussion! Unfortunately, diagrams, particularly technical diagrams, are rarely if ever self-explanatory. You should document the alternative solutions that you considered as well as the arguments for the final choice. Diagrams only represent your final solution, but do not explain why you decided on this solutions and what alternatives were considered. Hence, all diagrams must be accompanied with explanation and discussion of alternatives and tradeoffs. Anything that could lead to ambiguity or misunderstanding on the reviewer’s part, should be clearly explained. Explanations should be written in prose and key arguments highlighted in bullet points.
There is no limit on the number of pages (or size) for the report. Of course, you should avoid stuffing your report with redundant or irrelevant material.
Data sources for projects
Huge collection of Awesome Public Datasets
Amazon AWS Public Data Sets
Amazon question/answer dataset from Julian McAuley
Amazon product data from Julian McAuley
Stanford Large Network Dataset Collection from Jure Leskovec
Nice list of datasets hosted by archive.org
Collection of 200,000+ Jeopardy! questions
Another good list, this one from KDnuggets
ProPublica data store