Data Provided or Permissible
This site lists the training data is permissible for the training of MT systems and language models for ASR.
Provided Data
MT Training and Development Data:
ASR AM Training Data (English, German): sign this agreement, scan it and send to gretter@fbk.eu.
Other Permissible Data
Parallel:
All data released for translation tasks at the WMT 2017 (and previous WMT editions)
parallel corpora from the Wikipedia for en-{cs,de,fr} (and en-vi) pairs, provided by Krzsyztof Wołk of Polish-Japanese Academy of Information Technology
QED Corpus v1.4 except for the talks listed as not permissible
Monolingual:
TED
any subtitles of a TED or TEDx talk that is not listed as non-permissible.
LDC
LDC2004T19 Fisher English Training Speech Part 1 Transcripts
LDC2009S04, 2007 NIST Language Recognition Evaluation Test Set
LDC2006S35, CSLU: Multilanguage Telephone Speech Version 1.2
LDC2011T11 Arabic Gigaword Fifth Edition
LDC2016T16 English Speed Networking Conversational Transcripts
LDC2016V01 HAVIC Pilot Transcription
LDC2016T03 NewSoMe Corpus of Opinion in Blogs
LDC2014T23 Fisher and CALLHOME Spanish--English Speech Translation
LDC2012T11 American English Nickname Collection
LDC2012S05 USC-SFI MALACH Interviews and Transcripts English
LDC2005T35 American National Corpus (ANC) Second Release
LDC2005S13 Fisher English Training Part 2, Speech
LDC2005T19 Fisher English Training Part 2, Transcripts
LDC2005S16 RT-04 MDE Training Data Speech
LDC2005T24 RT-04 MDE Training Data Text/Annotations
LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
LDC2004S13 Fisher English Training Speech Part 1 Speech
LDC2004S02 ICSI Meeting Speech
LDC2004T04 ICSI Meeting Transcripts
LDC2004S05 ISL Meeting Speech Part 1
LDC2004T10 ISL Meeting Transcripts Part 1
LDC2004S09 NIST Meeting Pilot Corpus Speech
LDC2004T13 NIST Meeting Pilot Corpus Transcripts and Metadata
LDC2004S08 RT-03 MDE Training Data Speech
LDC2004T12 RT-03 MDE Training Data Text and Annotations
LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
LDC2004S07 Switchboard Cellular Part 2 Audio
LDC2003T02 1998 HUB5 English Transcripts
LDC2003S06 Santa Barbara Corpus of Spoken American English Part II
LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
LDC2002S23 1997 HUB5 English Evaluation
LDC2002S10 1998 HUB5 English Evaluation
LDC2003T02 1998 HUB5 English Transcripts
LDC2002S09 2000 HUB5 English Evaluation Speech
LDC2002T43 2000 HUB5 English Evaluation Transcripts
LDC2002S13 2001 HUB5 English Evaluation
LDC2002S06 Switchboard-2 Phase III Audio
LDC2002T31 The AQUAINT Corpus of English News Text
LDC2002S04 Translanguage English Database (TED) Speech
LDC2002T03 Translanguage English Database (TED) Transcripts
LDC2002S35 Voicemail Corpus Part II
LDC2001S13 Switchboard Cellular Part 1 Audio
LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
LDC2001T14 Switchboard Cellular Part 1 Transcription
LDC2001S94 TDT3 English Audio
LDC2000S86 1998 HUB4 Broadcast News Evaluation English Test Material
LDC2000S88 1999 HUB4 Broadcast News Evaluation English Test Material
LDC2000S92 TDT2 Careful Transcription Audio
LDC2000T44 TDT2 Careful Transcription Text
LDC99L23 American English Spoken Lexicon
LDC99S79 Switchboard-2 Phase II
LDC99S84 TDT2 English Audio
LDC99S82 USC Marketplace Broadcast News Speech
LDC99T36 USC Marketplace Broadcast News Transcripts
LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
LDC97S44 1996 English Broadcast News Speech (HUB4)
LDC97T22 1996 English Broadcast News Transcripts (HUB4)
LDC98S71 1997 English Broadcast News Speech (HUB4)
LDC98T28 1997 English Broadcast News Transcripts (HUB4)
LDC98S75 Switchboard-2 Phase I
LDC98T25 TDT Pilot Study Corpus
LDC98S77 Voicemail Corpus Part I
LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
LDC97S44 1996 English Broadcast News Speech (HUB4)
LDC97T22 1996 English Broadcast News Transcripts (HUB4)
LDC96S36 Boston University Radio Speech Corpus
LDC96S46 CALLFRIEND American English-Non-Southern Dialect
LDC96S47 CALLFRIEND American English-Southern Dialect
LDC97L20 CALLHOME American English Lexicon (PRONLEX)
LDC97S42 CALLHOME American English Speech
LDC97T14 CALLHOME American English Transcripts
LDC97S62 Switchboard-1 Release 2
LDC96S36 Boston University Radio Speech Corpus
LDC96S46 CALLFRIEND American English-Non-Southern Dialect
LDC96S47 CALLFRIEND American English-Southern Dialect
Miscellaneous: