Datasets
1. Event recommendation
Here you can download the seminar recommendation datasets, described in the paper:
Einat Minkov, Ben Charrow, Jonathan Ledlie begin_of_the_skype_highlightingend_of_the_skype_highlighting, Seth Teller, Tommi Jaakkola, Collaborative Future Event Recommendation, CIKM 2010
User study data is madeavailable with the courtesy of Nokia Research center.
Download here (4.75 MB)
The compressed file includes the following folders:
- data-all: A collection of seminar announcements published via MIT CSAIL email list between May '02 and June '09.
- mit-user-study: this user study includes a subset of the seminar announcements corpus, pertaining to 15 consecutive weeks, starting in Sep '07. Duplicate messages and messages that do not include a seminar announcement have been removed from this dataset. The mit-labels file details user preferences: for every week, seminars that the user would have liked to attend are labeled with '1', or '0' otherwise. In the study, users have been instructed to select at least one seminar per week that they would have liked to attend; however, if they would have preferred to attend none of the seminars offered on a given week, this is indicated by setting a 'no-interest' field to 'true' for that week. (These indications are missing for several users, who participated in the initial stages of the study.) The user study is anonymized, and user ids are arbitrary.
- cmu-user-study: this user study includes another subset of the seminar announcement corpus, pertaining to 15 consecutive weeks starting from the 6th week of '09. In this dataset, header information -- except for seminar title and speaker name -- has been removed. Supposedly, this focuses user attention on the seminar's content and speaker information, rather than on the venue, home institute of the speaker, etc. (The corresponding source messages with full header are available in the all-data folder.) The enclosed cmu-labels file is organized in the same fashion as described above. The user study is anonymized, and user ids are arbitrary.
2. Personal name annotation in Email
Due to privacy issues, it is very hard to get hold of large and realistic email corpora. Here you can find
a couple of email datasets, as well as a dataset of news groups text - annotated with personal names spans.
The full description of these datasets, including relevant statistics and references, is available in:
Einat Minkov, Richard C. Wang, William W. Cohen, Extracting Personal Names from Emails: Applying Named Entity Recognition to Informal Text, HLT/EMNLP 2005
- The email corpora given here were extracted from the Enron corpus, made public by the Federal Agency Regulatory commission. A version of this data was later purchased by the CALO project, and made available for research purposes.
- The first dataset, 'Enron-Meetings', consists of all messages located in folders named "meetings" or "calendar" (excluding a few very large files). Most of these messages are meeting related. The second
- subset, 'Enron-Random', was formed by uniformly sampling a user name (out of 158 users) and then
- randomly sampling an email from that user.
- As a second type of informal text, we also annotated a collection of newsgroups postings. The 'Newsgroups' dataset was extracted from the 20Newsgroups corpus, by Vitor R. Carvalho.
- These datasets are given here in a Minorthird format (plain text, with separate labels files), as well as
- in a 'general' format, where the personal labels are embedded in the text using XML tags.
- The given zipped files construct a directory tree. The separation into train and test folders corresponds
- to the data splits described in the abovementioned paper. Further separation is for convenience purposes.
Download: Enron Meetings: Minorthird format
Enron - random : Minorthird format
NewsGroups : Minorthird format
3. Personal name disambiguation and threading
Here you can download Enron corpora and datasets, used for the general problems of entity disambiguation and the extraction of inter-entity relations. Email here is represented as a relational database, which includes text. Specifically, the tasks considered in these subsets of the Enron corpus are person name disambiguation in email and intelligent message threading.
Two variations of the data are provided:
A. row email essages, and the corresponding datasets (queries and correct answers), as used in
Einat Minkov, William W. Cohen, Andrew Y. Ng,
Contextual Search and Name Disambiguation in Email using Graphs,
SIGIR 2006
Download:
Person name disambiguation corpora
B. graph files (net relations and entity declarations), and the corresponding datasets, as used in
Einat Minkov, William W. Cohen,
Learning to Rank Typed Graph Walks: Local and Global Approaches,
WebKDD and SNA-KDD joint workshop 2007
Download:
Person name disambiguation corpora
Note: the corpora files of (A) and (B) are different representation of the same data (where reply lines have been removed in the latter). The datasets are mostly identical, with the exception that some examples were moved from the training and test sets to a development set.