Good Stuff‎ > ‎Data‎ > ‎

Data Science

Graph datasbes

flockDB, neo4j, pregel     triple databses for RDF

Pre-requisites and Co-requisites
Information and Database Systems I (CIS 4301), Data Structures and Algorithms (COP3530), Probabilities and Statistics (STA 5325/5328) or equivalent is a prerequisite.

Course Objectives
In this course, we will discuss recent work and publications in Data Science/Big Data Analytics research with emphasis on systems and algorithms for large-scale advanced data analysis. Each student will be responsible for presenting one or more research papers in class and participating in discussions on papers presented by the other students in class. Also, each student will do a class project that has the largest impact on the final grade. Every student should be comfortable with programming (C/JAVA/C++/Scala) and preferably have prior experience with data mining or machine learning.

Course Materials
  • Books are NOT required, but recommended.
  • Assigned papers for each class.
  • Supplementary online courses for additional knowledge in (NLP/ML/PGM).


Course Outline and Topics
This course will cover the most recent developments in a broad range of Data Science problems. I would like to put more focus on systems and algorithms that enable advanced (statistical/machine learning) data analysis. The topics are as follows:
  • Big Data Analysis Systems and Frameworks
    • Map-Reduce – Mahout, Spark/Shark
    • Parallel DB – MADLib, Tuffy/Felix
    • Others: GraphLab, SciDB, DataPath
  • Big Data Analysis Models and Algorithms 
    • Structured Data Mining
    • Text Analysis
    • Image Retrieval
    • Unsupervised Learning
    • Dimensionality Reduction
  • New Research Trends and Applications 
    • Crowd-sourcing, Human intelligence
    • Probabilistic Databases, Knowledge Bases
    • Data Visualization, Data Cleaning, Data Integration
    • E-discovery, EMR

Additional Reading
Project Proposal Write-up (1-2 pages)

1. Names of the group members and UFIDs
2. Title of the project
3. Proposal should include the following topics:
a. What do you propose to do? What is data? What is data product? or What is the piece
of framework you aim to build for Data Science?
b. What is the related work and state-of-the-art?
c. Why is it an important/interesting course project for Data Science?
d. What are the novelties of this project?
e. What is the end goal? A system? A new algorithm?
f. What is the measure of success? Evaluation? Prototype demo?

More on projects

1. Second Language Acquisition Through Artificial Neural Networks
2. A Study of Semi-Supervised Feature Selection Algorithms
3. Map-Reduce For Motif Search in Biological Networks
4. PANDA: Predictive Analysis with iNtegration of Doctoral Assessment
5. ProbKB: Managing Web-Scale Evolving Knowledge
6. Personalized Knowledge Based for Food, Diet and Nutrition
7. Movie/Congressional Dialog Extraction and Analysis
8. Text Analytics over Online Course Discussion Forums
9. Image Extraction and Retrieval
10. Knowledge Extraction from Formal Text: Wikipedia and News Articles    Reference: NIST KBA Competition 2012/2013
11. Stock Trend Prediction based on Text Data
12. Knowledge Extraction from Informal Text: Twitter
a. Public Health Tracking and Visualization
b. Early Mental Health Problem Detection
13. Social Network Analysis
a. Influential People Identification
b. Community Detection and Link Prediction
14. E-discovery: Smart Ranking
a. Indexing and Ranking for Information Retrieval
b. Topic Modeling and Visualization

Review: Héctor Corrada Bravo and Raghu Ramakrishnan. Optimizing MPF queries: decision support and probabilistic inference. In Proceedings of SIGMOD '07
Review: Cheng-Tao Chu; Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Ng, and Kunle Olukotun. "Map-Reduce for Machine Learning on Multicore". NIPS 2006.
Unreasonable Effectiveness of Data. Alon Halevy, Peter Novig, Fernando Pereira, Google.