Data Science

Data science (DS) is a new field that builds on computer science, statistics, mathematics, e-science, and others, to develop principles, algorithms, and best practices to generate, process, and use data. Virtually all parts of society are now becoming data driven: a lot of data is being collected, stored, and analyzed to glean insights. Data science emerged in response to this need. There is no commonly accepted definition of DS as yet. For our purposes, we define DS to be focusing on three goals:

    • extracting actionable insights from raw data (a.k.a., building the "infamous" raw-data-to-insight pipeline).

    • developing data-intensive artifacts, e.g., knowledge bases/graphs, recommender systems, etc.

    • designing data-intensive experiments to answer questions, e.g., A/B testing.

Overall Agenda

We believe that DS will become increasingly critical, that the database community has much to contribute, and that it should seek to play a leadership role in pushing this field forward. Today, however, no clear DS agenda exists. In this direction, we work toward developing such an agenda. The key distinguishing aspects of this agenda are the following:

    • It focuses on data quality, and targets data cleaning and integration for now.

    • It seeks to build end-to-end systems, as parts of current open-source ecosystems of DS tools.

    • It integrates research, system building, education, and outreach.

Our key observation is that going from raw data to insights typically happens in two stages. In the first stage, we need to do data cleaning and integration, i.e., data must be extracted from multiple sources, cleaned, and integrated into a single unified high-quality data set. In the second stage, analyses are then performed on this data set to infer insights. As described, data cleaning and integration plays a major role in the DS pipeline. This is also the part where current ecosystems of open-source DS tools are quite weak in, and yet the database community has been quite strong in. As a result, the database community has a great opportunity to develop such tools for current DS ecosystems. Based on this observation, our current DS agenda is as follows:

    • We seek to make the case that the database community has a lot to contribute in terms of data cleaning/integration to DS, but the work needs to be done in integration with current ecosystems of DS tools, where data cleaning/integration has received relatively little effort.

    • We work on research, system building, education, and outreach on data cleaning and integration, as described in that direction.

    • For additional education, training, and outreach

      • We design new DS courses and new degree programs for DS.

      • We explore systematic ways to train the rest of UW-Madison in DS, and to help the rest of UW-Madison (i.e., domain scientists at UW) do DS. This includes for example collecting training materials, designing short DS courses, building DS tools, etc.

      • We work on developing a university-wide DS hub.

    • We partner with domain scientists at UW-Madison to develop high-value data repositories and tools.

This is a broad DS agenda that requires extensive collaboration with other groups in the CS department, with many groups/centers/institutions across the campus, and with organizations/companies outside the campus.

Current Progress

    • What is Our Agenda for Data Science?, by A. Doan, position statement, CIDR-2017.

    • We have been working on research, system building, education, and outreach for data cleaning and integration, as described in that direction.

    • For additional education, training, and outreach

      • A new DS course has been designed and taught, a new course (in collaboration with other colleagues here in the database group) is being designed.

      • A professional MS program in DS, housed in the CS department, is currently being developed.

      • We are working on helping other departments at UW-Madison (e.g., statistics) design and teach their DS programs.

      • Training materials are being collected and will eventually be published in a repository such as BigGorilla.

      • A university-wide DS Hub is being formed, and AnHai Doan is serving on the Steering Committee. AnHai also serves on the Advisory Committees of several joint DS efforts between UW-Madison and companies.

    • Several high-value data repositories at UW-Madison are being built. More soon.

Miscellaneous