Note: This site had been move to https://sites.google.com/nycu.edu.tw/speechlabx
Formosa Speech Recognition Challenge 2020 (FSR-2020) is the second event of the Formosa Speech in the Wild (FSW) project, which is organized by Taipei University of Technology (NTUT).
Taiwanese (a.k.a. Taiwanese Hokkien, Hoklo, Taigi, Southern Min or Min-Nan) is a language spoken natively by about 70% of the population of Taiwan. Although the number of Taiwanese speakers still continues to drop, especially among the youth generations, it's not yet too late to save this language. Therefore, we are now calling for and welcome participants from both academic and industrial sectors to FSR-2020. Students are especially welcomed to participate for the competition for the Student Award.
Free Taiwanese Across Taiwan corpus "TAT-Vol1" collected Across Taiwan in 2019.
更新中文及台文的成績,新的答案也已更新在gitlab上
中文輸出部分,針對兩部分挑出部分檔案,並在成績計算上剔除
1.含非台語語音之檔案(以人工聽音檔的方式) 共225個
2.文字含數字、英文之檔案 共177個
最後總共390個檔案不列入成績計算(因部分重複)
台文輸出部分,針對兩部分挑出部分檔案,並在成績計算上剔除
台文部分的檢查由台文系畢業的助理協助檢查
1.含非台語語音之檔案(以人工聽音檔的方式) 共631個
2.文字含數字、英文或非台文之檔案 共482個
最後總共1,113個檔案不列入成績計算
下面ID意思: 英文字母為隊伍代碼,第一個數字為Track,第二個數字為submission
計分方式請參考PILOT-TEST說明
目前結果為beta版,若有發現任何問題請盡快與我們聯絡!
目前的結果算法為CER(字元錯誤率)
Track3分成兩個部分供大家參考
台羅數字調(考慮音調)&台羅數字調(不考慮音調)
計算結果使用 SER(音節錯誤率)
Track2台文的結果為直接算CER(字元錯誤率)
Track1中文的部分,分為直接算CER(字元錯誤率)& 使用BLEU計算翻譯分數,
BLEU使用Google-BLEU
https://colab.research.google.com/github/gcunhase/NLPMetrics/blob/master/notebooks/gleu.ipynb
Build an automatic Taiwanese speech recognizer (ASR) that could output either (至少選一個Track):
Traditional Chinese characters (繁體中文字),i.e., Taiwanese Speech to Chinese Characters (translation)
Taiwanese Southern Min Recommended Characters by Ministry of Education of Taiwan (台文漢字,依據教育部部定 臺灣閩南語推薦用字,漢字優先)
Taiwan Minnanyu Luomazi Pinyin (依據教育部部定 臺灣閩南語羅馬字拼音方案之『台羅拼音數字調』,以本調為準)
For example:
Track1 - 現在是晚上八點(同義字會由我們轉成統一格式再行評分,像:臺&台,除CER外,並計算翻譯分數)
Track2 - 這馬是暗時八點(除外來語外,都用漢字表示,另外,同義字也會先處理)
Track3 - tsit4 ma2 si7 am3 si5 peh4 tiam2 (本調為準)
檔名:請以“單位+隊名+參賽者”為檔名,以避免誤判(之前沒寫的沒關係,會檢查email位置)。
答案格式:ID 答案(同Kaldi, 一欄為音檔ID,一欄為語音辨認器輸出)
以下範例
Track1:
1 現在是晚上八點
2 今天是六月十九
3 你還不承認
Track2:
1 這馬是暗時八點
2 今仔日是六月十九
3 你閣毋承認
Track3:
1 tsit4 ma2 si7 am3 si5 peh4 tiam2
2 kin1 a2 git8 si7 lak8 gueh8 tsap8 kau2
3 li2 koh4 m7 sing5 jin7
This challenge is based on the "TAT-Vol1" corpus.
"TAT-Vol1" consists about 100 speakers recruited across Taiwan, in total about 50 hours (Training + Eval + Test sets).
This data is released here for FREE under a Non-Commercial Use Only license. Please read and accept the License.
Baseline Scripts: Kaldi-based baseline recipes are provided in Github for students to develop their own systems easily and quickly. --> https://github.com/t108368084/Taiwanese-Speech-Recognition-Recipe
2020/06/01 --- Registration Open
2020/07/01 --- Training Data Release
2020/09/01 --- Pilot-Test (dry-run only) Data Release
2020/09/21 --- Pilot-Test (dry-run only) Result Submission
2020/09/30 --- Pilot-Test (dry-run only) Performance Notification
2020/12/01 --- Registration Close
2021/01/01 --- Final-Test Data Release
2021/01/08 --- Final-Test Result Submission
2021/01/15 --- Final-Test Performance Notification (released)
2021/1/22 2021/02/28--- Paper Submission
2021/01/31 2021/03/19 --- Result/Award Announcement and Workshop (T.B.D.)
PS: Pilot-Test (dry-run) is only used to make sure everything for the final-test is fine, not for scoring!
Yuan-Fu Liao (廖元甫)
Associate Professor, Department of electronic Engineering, National Taipei University of Technology