August 24, 2014. Venue: Bloomberg Headquarters 731 Lexington Ave New York, NY 10022
The Data Curation Experiment on Web Tables: Curating structured data from Web tables presents challenges of scale and diversity not seen in traditional enterprise data. The first challenge is that schema of "organic" web tables is non-existent or noisy. We discover the schema of a web table by annotating its cell, columns, and column pairs to entities, types, and relationships respectively of a well-defined ontology. The second challenge is extracting structured data from table columns that are predominantly textual. We present statistical methods of segmenting textual cells into structured fields by exploiting the prevalent data redundancy on the Web. Finally for curating numbers from Web tables, we discuss how to tackle the challenge of inferring units of quantitative columns via CFG-based parsers and several tricks that tap a large corpus of Web tables.
The confluence of digital curation and data analysis: While my own research addresses high-performance non-parametric
modelling of unstructured data, for instance citation networks and abstracts, and other mixed text and network data, in my role
as an educator I look at data science generally. Digital curation and data analysis are complementary parts of the data lifecycle. While data analysis is our rocket science, the sexiest job of the 2010s, digital curation is the engine in the backroom. To consider the importance that digital curation plays in our community, consider the work that went on in the background in the development of the Reuters RCV1 (news) collection or any of the TREC collections. In this talk we will briefly look at the Australian landscape in open data and health data and consider the digital curation and archiving task from the naive perspective of an old data analyst: what are the challenge problems that digital curation and archiving has that data analysis can help with?
Mining Topics in Documents: Standing on the Shoulders of Big Data: Automatically extracting knowledge from different data sources or domains is an important problem. It is even more challenging, in the era of big data, given a large number of different domains. How can the high-quality knowledge be automatically extracted? How useful is the extracted knowledge towards an application? How is diversity of domains influencing the performance? In this talk, I will introduce the work of my KDD 2014 paper. The paper proposes to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, the algorithm first mines some reliable (prior) knowledge from the past domain learning/modeling results and then uses it to improve the new learning. In this paper, topic modeling is used as the example for the learning task. The aim is to discover meaningful topics from each individual domain. In more details, I will introduce how the quality knowledge is automatically extracted and how to deal with inappropriate knowledge when it comes to a new domain.
From Music Fandom to Artists and Back: In this talk I will report on ongoing work on real-time analysis of self-reported music listening behavior. 6 tweets per second are self-reports of people on what music they are listening to right now. This translates to 500K tweets per day or 180M per year, which amounts to a significant volume of information for analyzing music listening behavior around the globe. The analysis starts by recognizing artists and songs, a basic entity linking task that is made challenging by the highly dynamic nature of the domain.
A core component here is how to map short, noisy and unedited text to a knowledge base for performing behavioral analysis. Real-time aggregation of the data, using open APIs, has interesting and new applications for both music fans and artists. Music fans can check what song or artist is popular or emerging right now, and artists can discover new potential audiences or gauge how likely is their music to become the next big sound.
I will discuss the trajectory of research we have followed in this area over the last two years, and how it led to the creation of 904Labs, an Amsterdam startup created to make search and recommendation self-learning. This is joint work with Wouter Weerkamp (904Labs) and Manos Tsagkias (University of Amsterdam and 904Labs).
Rajeev Gupta (IBM Research, India)
Ganesh Ramakrishnan (Indian Institute of Technology, Mumbai, India)
Sriram Padmanabhan (IBM Santa Teresa Lab, SanJose)
Pauli Miettinen (Max-Planck-Institut für Informatik, Saarbrucken, Germany)
Rahul Gupta (Google, Mountain View, CA, USA)
Rainer Gemulla (Max-Planck-Institut für Informatik, Saarbrucken, Germany)
Tamrapani Dasu (AT&T Research)