corpus

UT-Podcast

This corpus is designed to assist English dialect research. It include three dialects: Australian English, American English, United Kingdom English. For convenience, they are noted as: AU, UK, and US, respectively. It includes both audio and text. The point of providing this corpus is to assist acoustic and language modeling for accent research. The data is collected from public sources on the internet and is intended only for the purposes of academic research, and not to be used for commercial use/applications.

Download Link: Click here to Download (485.7 MB)

References:

John H.L. Hansen, Gang Liu, “Unsupervised accent classification for deep data fusing of acoustic and language information”, Speech Communication, (accepted Nov.2015, to appear in Spring 2016)

Rahul Chitturi, John H.L. Hansen, "Dialect Classification for Online Podcasts Fusing Acoustic and Language Based Structural and Semantic Information," ACL-08: HCT:Association for Computational Linguistics (ACL): Human Communication Technologies Conf. , pp. 21-24, Columbus, Ohio, June 15-20, 2008