Taiwanese Across Taiwan Corpus
Taiwanese Across Taiwan (TAT) corpus is a Large-Scale database of Native Taiwanese Article/Reading Speech collected across Taiwan.
References
- Yuan-Fu Liao, Chia-Yu Chang, Hak-Khiam Tiun, Huang-Lan Su, Hui-Lu Khoo, Jane S Tsay, Le-Kun Tan, Peter Kang, Tsun-guan Thiann, Un-Gian Iunn, et al. 2020. Formosa speech recognition challenge 2020 and Taiwanese across Taiwan corpus. In 2020 23rd Conference of the Oriental COCOSDA Inter- national Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pages 65–70. IEEE.
- Yuan-Fu Liao, Hui-Lu Khoo, Un-Gian Iunn, Tsun-Guan Thiann, Jane S. Tsay, Le-Kun Tan, Huang-Lan Su, Hak-Khiam Tiun, Peter Kang, Li-Chen Chang, Su-Lian Liao, Hong-Hūi Tân, Siok-Hong Liau and Chhun-Sui Na, et al. 2022. TAIWANESE ACROSS TAIWAN CORPUS AND ITS APPLICATIONS. In 2022 25rd Conference of the Oriental COCOSDA Inter- national Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), to appear. IEEE.
台文語音語料庫 (300 hours*6 Tracks)
說明(introduction)
TAT(Taiwanese Across Taiwan)為台語朗讀語料(reading speech),是以原生台文文本,收集來自台灣各地不同腔調的台語語音,並同時以6隻麥克風進行錄製。錄好的台語語音,經由兩次人工校正文本後,整理成可供語音辨認技術研究與開發使用之語音語料庫。目前共錄製600人,每位語者錄製半小時,總計300小時語料 (6 tracks)。並切分成3集,包括:
TAT-Vol1~2 (100*6 hours)
TAT-MOE (200*6 hours)
此外,為製作台語語音合成器,我們並同時進錄音室錄製台語強勢腔(高雄腔)與次強勢腔(台北腔)各一男一女,每人10小時的語音,分別為:
TAT-TTS-M1~2
TAT-TTS-F1~2
其中,TAT-Vol1~2與TAT-TTS-M1~2, TAT-TTS-F1~2語料庫,已經授權社團法人中華民國計算語言學學會發行,申請人需向學會提出申請,簽妥授權使用協議書,並同意確實遵守協議書上之約定條款。
麥克風(microphones)
ZOOM XYH-6左聲道(XYH-6-X)
ZOOM XYH-6右聲道(XYH-6-Y)
電容式麥克風(condenser)
領夾式麥克風(lavalier)
ios系統手機錄音(ios)
android系統手機錄音(android)
音檔(wav)格式
取樣格式 :16kHz,16 bits PCM
音檔格式: *.wav
JSON檔(metadata)格式
音檔格式: *.json
{
"音檔長度": "6.69",
"漢羅台文": "我欲坐八點十六分往屏東的車幫",
"台羅": "guá beh tsē peh tiám tsa̍p-la̍k hun óng pîn-tong ê tshia-pang",
"台羅數字調": "gua2 beh4 tse7 peh4 tiam2 tsap8-lak8 hun1 ong2 pin5-tong1 e5 tshia1-pang1",
"白話字": "góa beh chē peh tiám cha̍p-la̍k hun óng pîn-tong ê chhia-pang",
"字數": "14",
"提示卡編號": "0012",
"句編號": "1.1",
"發音人": "IUF008",
"性別": "女",
"年齡": "20",
"教育程度": "大學",
"出生地": "屏東縣東港鎮",
"現居地": "台中市西區",
"腔調": "高屏普通腔",
"錄音環境": "安靜隔音室內",
"提示卡切換速度": "快",
"總錄音時間(分)": "100"
}
Audio/Text Samples
Release
GitLab server at https://speech.nchc.org.tw/