DSCI 550: Data Science at Scale

Class Info:

Spring Semester, 2022

Location : OHE 100D and online

Time: Th 3:30-6:50pm

Class number: 32413D

Class number: 32426D (DEN)


Dr. Keith Burghardt


E-Mail: keithab@isi.edu

Office Hours: Thu 6:45pm-7:45pm (right after class, online): https://usc.zoom.us/j/91219692056

Dr. Goran Muric


E-Mail: gmuric@isi.edu

Office Hours: Thu 6:45pm-7:45pm (right after class, online): https://usc.zoom.us/j/91219692056

Teaching Assistant

DSCI 550 Overview

This course is designed as an overview course to give students a broad understanding of Informatics topics for Big Data and to get practical experience with key Big Data informatics techniques. Topics include roadmap of informatics, the data lifecycle, the role of the data scientist, and analyzing and exploring Big Data with real world use cases in data analytics, and big data. Understanding Big Data involves understanding of digital file formats, their detection and data extraction from them. Emphasis areas include Document Type Detection; Parsing and extraction; Metadata understanding and analysis; Language Identification and detection from files and finally file formats and representation. The class also has a specific focus on Content Detection and Analysis from large data sets. Datasets used in the course are publicly collected by the instructor or his collaborators involved in national Big Data initiatives including DARPA, NASA and other projects. The course is designed to be accessible to students with experience programming in Python and Java at an intermediate level. The course will introduce the students to topical software frameworks that deal with Big Data including Tika, Solr, ElasticSearch™, TensorFlow, Nutch and Apache Hadoop™. The course will be a combination of lecture, in-class discussion, readings, group-based assignments and a final exam.

The objective of this course is to train students to be able to understand Big Data and Large Data Environments, e.g., file formats, their representation, and how to automatically extract information from large datasets of files. Specifically, students successfully completing this course will achieve three main objectives:

  1. Develop sufficient proficiency in Big Data frameworks to write software capable of automatically extracting information from data including its text and metadata and language.

  2. Develop sufficient proficiency in techniques with Large Data sets collected from the Web and other places (Intranet, Science Data Sets, Public Data Sets).

  3. Develop sufficient proficiency in Python and Java to write and execute software that is “File Aware” and that automatically extracts text and metadata from large data sets.

The primary teaching methods will be discussion, case studies, and lectures. Students are expected to perform directed self learning outside of class which encompasses, among other things, a considerable amount of literature review. Leadership training in open source is provided and encouraged, and students leave with an experience in open source that makes them more marketable to companies and institutions looking to hire in Big Data, and Data Science.

In addition to foundations, and practical experience with Big Data and Data Science, the class will also introduce the student to the state-of-the-art in content detection research, future trends and state-of-the-practice. Students are expected to attend class regularly, and participate (as directed) in all class discussions, and most importantly, have fun!


Statement on Academic Conduct and Support Systems

Academic Conduct Plagiarism - presenting someone else's ideas as your own, either verbatim or recast in your own words - is a serious academic offense with serious consequences. Please familiarize yourself with the discussion of plagiarism in SCampus in Section 11, Behavior Violating University Standards. Other forms of academic dishonesty are equally unacceptable. See additional information in SCampus and university policies on scientific misconduct. Discrimination, sexual assault, and harassment are not tolerated by the university. You are encouraged to report any incidents to the Office for Equity, Equal Opportunity, and Title IX or to the Department of Public Safety. This is important for the safety of the USC community. Another member of the university community - such as a friend, classmate, advisor, or faculty member - can help initiate the report, or can initiate the report on behalf of another person. The Sexual Violence Prevention & Services provides 24/7 confidential support, and the sexual assault resource center webpage sarc@usc.edu describes reporting options and other resources.

Support Systems

A number of USC's schools provide support for students who need help with scholarly writing. Check with your advisor or program staff to find out more. Students whose primary language is not English should check with the American Language Institute which sponsors courses and workshops specifically for international graduate students. The Office of Disability Services and Programs provides certification for students with disabilities and helps arrange the relevant accommodations. If an officially declared emergency makes travel to campus infeasible, USC Emergency Information will provide safety and other updates, including ways in which instruction will be continued by means of blackboard, teleconferencing, and other technology.

Statement on Diversity

The diversity of the participants in this course is a valuable source of ideas, problem solving strategies, and engineering creativity. We encourage and support the efforts of all of our students to contribute freely and enthusiastically. We are members of an academic community where it is our shared responsibility to cultivate a climate where all students and individuals are valued and where both they and their ideas are treated with respect, regardless of their differences, visible or invisible.


Chris A. Mattmann, and Jukka Zitting. Tika in Action, 256 pages. New York: Manning Publications, November 2011. ISBN: 9781935182856.

C. Mattmann. Machine Learning with TensorFlow: 2nd Edition. 456 pages. New York: Manning Publications, December 2020. ISBN 9781617297717



(subject to change; check regularly)


Many thanks to Dr. Chris Mattmann, who created this course, wrote the textbooks, and helped us with preparing the lectures