Datasets

A list of datasets for doing research on email.

Enron Email Dataset
http://www.cs.cmu.edu/~enron/
A paper describing the dataset:
http://www.ceas.cc/papers-2004/168.pdf
The Enron Dataset reconstruction project:
EnronData.org

The BC3: British Columbia Conversation Corpus
The First Publicly Available Annotated Corpus for Email Summarization
Contains 40 threads/3222 sentences from W3C corpus, with executive summaries, abstractive summaries, speech acts, meta sentences, subjectivity
http://cs.ubc.ca/labs/lci/bc3.html

W3C corpus used for TREC Enterprise Track
W3C mailing list corpus crawled in 2004 was used for Email search and Expert search within Enterprise Track of TREC conference in 2005 and 2006.
It is possible to obtain the corpus following standard TREC procedures.

Attachment Prediction Dataset
A copy of the Enron dataset which indicates which messages had attachments is available from Mark Dredze.
 
The EnronSent corpus is a special preparation of a portion of the Enron Email Dataset designed specifically for use in Corpus Linguistics and language analysis.  It contains 96,107 messages from the "Sent Mail" directories of all the users in the corpus. It has been cleaned specifically for use with conventional corpus linguistics tools, and an attempt has been made to remove as much non-human generated text as possible from the raw messages in the original data.

Person Name Annotations
Einat Minkov has email annotated with person names: http://www.cs.cmu.edu/~einat/datasets.html

Conversation Threads, Multi-Lingual Conversations, Communication Network
Comments