Taiwanese Across Taiwan Corpus

Taiwanese Across Taiwan (TAT) corpus is a Large-Scale database of Native Taiwanese Article/Reading Speech collected across Taiwan.

References

  1. Yuan-Fu Liao, Chia-Yu Chang, Hak-Khiam Tiun, Huang-Lan Su, Hui-Lu Khoo, Jane S Tsay, Le-Kun Tan, Peter Kang, Tsun-guan Thiann, Un-Gian Iunn, et al. 2020. Formosa speech recognition challenge 2020 and Taiwanese across Taiwan corpus. In 2020 23rd Conference of the Oriental COCOSDA Inter- national Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pages 65–70. IEEE.

  2. Yuan-Fu Liao, Hui-Lu Khoo, Un-Gian Iunn, Tsun-Guan Thiann, Jane S. Tsay, Le-Kun Tan, Huang-Lan Su, Hak-Khiam Tiun, Peter Kang, Li-Chen Chang, Su-Lian Liao, Hong-Hūi Tân, Siok-Hong Liau and Chhun-Sui Na, et al. 2022. TAIWANESE ACROSS TAIWAN CORPUS AND ITS APPLICATIONS. In 2022 25rd Conference of the Oriental COCOSDA Inter- national Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), to appear. IEEE.

台文語音語料庫 (300 hours*6 Tracks)

說明(introduction)

TAT(Taiwanese Across Taiwan)為台語朗讀語料(reading speech),是以原生台文文本,收集來自台灣各地不同腔調的台語語音,並同時以6隻麥克風進行錄製。錄好的台語語音,經由兩次人工校正文本後,整理成可供語音辨認技術研究與開發使用之語音語料庫。目前共錄製600人,每位語者錄製半小時,總計300小時語料 (6 tracks)。並切分成3集,包括:

  • TAT-Vol1~2 (100*6 hours)

  • TAT-MOE (200*6 hours)

此外,為製作台語語音合成器,我們並同時進錄音室錄製台語強勢腔(高雄腔)與次強勢腔(台北腔)各一男一女,每人10小時的語音,分別為:

  • TAT-TTS-M1~2

  • TAT-TTS-F1~2

其中,TAT-Vol1~2與TAT-TTS-M1~2, TAT-TTS-F1~2語料庫,已經授權社團法人中華民國計算語言學學會發行,申請人需向學會提出申請,簽妥授權使用協議書,並同意確實遵守協議書上之約定條款。

麥克風(microphones)

  • ZOOM XYH-6左聲道(XYH-6-X)

  • ZOOM XYH-6右聲道(XYH-6-Y)

  • 電容式麥克風(condenser)

  • 領夾式麥克風(lavalier)

  • ios系統手機錄音(ios)

  • android系統手機錄音(android)

音檔(wav)格式

取樣格式 :16kHz,16 bits PCM

音檔格式: *.wav

JSON(metadata)格式

音檔格式: *.json

{

"音檔長度": "6.69",

"漢羅台文": "我欲坐八點十六分往屏東的車幫",

"台羅": "guá beh tsē peh tiám tsa̍p-la̍k hun óng pîn-tong ê tshia-pang",

"台羅數字調": "gua2 beh4 tse7 peh4 tiam2 tsap8-lak8 hun1 ong2 pin5-tong1 e5 tshia1-pang1",

"白話字": "góa beh chē peh tiám cha̍p-la̍k hun óng pîn-tong ê chhia-pang",

"字數": "14",

"提示卡編號": "0012",

"句編號": "1.1",

"發音人": "IUF008",

"性別": "女",

"年齡": "20",

"教育程度": "大學",

"出生地": "屏東縣東港鎮",

"現居地": "台中市西區",

"腔調": "高屏普通腔",

"錄音環境": "安靜隔音室內",

"提示卡切換速度": "快",

"總錄音時間(分)": "100"

}


Audio/Text Samples

Release