1st International Workshop on Big Data Discovery & Curation

Co-located with KDD 2014

Sunday August 24, 2014, New York, USA



August 24, 2014. Venue: Bloomberg Headquarters 731 Lexington Ave New York, NY 10022

 Time  Title  Author/ Speaker
 2:00-2:45 Keynote: The Data Curation Experiment on Web Tables
 Sunita Sarawagi, IIT Bombay, India
 2:45-3:15 Invited Talk: The confluence of digital curation and data analysis
 Wray Buntine, Monash University
 3:15-3:35  Research Talk: From Information Management to Information Orchestration: The need for an Evolving, Self-Organizing, Self-Governing Data Lake
 Peter Schwarz, Mary Roth, Eser Kandogan, Joshua Hui, Holger Kache, Kevin Shank. IBM
 3:35-3:50  Demo Talk: Self-Generated Health Information Exchange (SGHIx): Protecting and Providing Benefits for Aggregating Personal Health Data
 Thomas P. Caruso, Lauren K. Li, J. Marshall Presnell, Robert G. Capra, Ashwani Kaul, Gary Marchionini, and
Linda L. Dimitropoulos
 3:50-4:05  Demo Talk: Data book Architecture: A Policy-driven Framework for Discovery and Curation of Federated Data
 Hao Xu., Mike Conwayy,Arcot Rajasekarz, Reagan Moorex, Akio Sonek, Jane Greenberg, and Jonathan Crabtree
 4:05-4:30  Break  
 4:30-5:00 Invited Talk: Mining Topics in Documents: Standing on the Shoulders of Big Data  Brett Zhiyuan Chen, University of Illinois at Chicago (UIC)
 5:00-5:30  Invited Talk: From Music Fandom to Artists and Back

 Maarten de Rijke, University of Amsterdam
 5:30-6:00  Conclusion Talk: Research topics in Data Curation and Congnitive Computing
 Mary Roth, IBM Research


The Data Curation Experiment on Web Tables: Curating structured data from Web tables presents challenges of scale and diversity not seen in traditional enterprise data. The first challenge is that schema of "organic" web tables is non-existent or noisy.  We discover the schema of a web table by annotating its cell, columns, and column pairs to entities, types, and relationships respectively of a well-defined ontology.  The second challenge is extracting structured data from table columns that are predominantly textual.  We present statistical methods of segmenting textual cells into structured fields by exploiting the prevalent data redundancy on the Web.  Finally for curating numbers from Web tables, we discuss how to tackle the challenge of inferring units of quantitative columns via CFG-based parsers and several tricks that tap a large corpus of Web tables.

The confluence of digital curation and data analysis
: While my own research addresses high-performance non-parametric
modelling of unstructured data, for instance citation networks and abstracts, and other mixed text and network data, in my role
as an educator I look at data science generally.  Digital curation and data analysis are complementary parts of the data lifecycle.  While data analysis is our rocket science, the sexiest job of the 2010s, digital curation is the engine in the backroom.  To consider the importance that digital curation plays in our community, consider the work that went on in the background in the development of the Reuters RCV1 (news) collection or any of the TREC collections.  In this talk we will briefly look at the Australian landscape in open data and health data and consider the digital curation and archiving task from the naive perspective of an old data analyst: what are the challenge problems that digital curation and archiving has that data analysis can help with

Mining Topics in Documents: Standing on the Shoulders of Big Data
: Automatically extracting knowledge from different data sources or domains is an important problem. It is even more challenging, in the era of big data, given a large number of different domains. How can the high-quality knowledge be automatically extracted? How useful is the extracted knowledge towards an application? How is diversity of domains influencing the performance? In this talk, I will introduce the work of my KDD 2014 paper. The paper proposes to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, the algorithm first mines some reliable (prior) knowledge from the past domain learning/modeling results and then uses it to improve the new learning. In this paper, topic modeling is used as the example for the learning task. The aim is to discover meaningful topics from each individual domain. In more details, I will introduce how the quality knowledge is automatically extracted and how to deal with inappropriate knowledge when it comes to a new domain.

From Music Fandom to Artists and Back: In this talk I will report on ongoing work on real-time analysis of self-reported music listening behavior. 6 tweets per second are self-reports of people on what music they are listening to right now. This translates to 500K tweets per day or 180M per year, which amounts to a significant volume of information for analyzing music listening behavior around the globe. The analysis starts by recognizing artists and songs, a basic entity linking task that is made challenging by the highly dynamic nature of the domain.

A core component here is how to map short, noisy and unedited text to a knowledge base for performing behavioral analysis. Real-time aggregation of the data, using open APIs, has interesting and new applications for both music fans and artists. Music fans can check what song or artist is popular or emerging right now, and artists can discover new potential audiences or gauge how likely is their music to become the next big sound.

I will discuss the trajectory of research we have followed in this area over the last two years, and how it led to the creation of 904Labs, an Amsterdam startup created to make search and recommendation self-learning. This is joint work with Wouter Weerkamp (904Labs) and Manos Tsagkias (University of Amsterdam and 904Labs).

Workshop Co-chairs:
Rajeev Gupta (IBM Research, India)
Ganesh Ramakrishnan (Indian Institute of Technology, Mumbai, India)

Program Committee:
Prasad Deshpande (IBM Research, India)
Sriram Padmanabhan (IBM Santa Teresa Lab, SanJose)
Pauli Miettinen (Max-Planck-Institut für Informatik, Saarbrucken, Germany)
Rahul Gupta (Google, Mountain View, CA, USA)
Rainer Gemulla
(Max-Planck-Institut für Informatik, Saarbrucken, Germany)
Tamrapani Dasu (AT&T Research)