Formosa Speech Recognition Challenge 2020 - Taiwanese ASR

Call-for-Participants

Formosa Speech Recognition Challenge 2020 (FSR-2020) is the second event of the Formosa Speech in the Wild (FSW) project, which is organized by Taipei University of Technology (NTUT).

Taiwanese (a.k.a. Taiwanese Hokkien, Hoklo, Taigi, Southern Min or Min-Nan) is a language spoken natively by about 70% of the population of Taiwan. Although the number of Taiwanese speakers still continues to drop, especially among the youth generations, it's not yet too late to save this language. Therefore, we are now calling for and welcome participants from both academic and industrial sectors to FSR-2020. Students are especially welcomed to participate for the competition for the Student Award.

Call for Participants - Formosa Speech Recognition Challenge 2020.pdf

Key Messages

  • Free Taiwanese Across Taiwan corpus "TAT-Vol1" collected Across Taiwan in 2019.

  • NEWS

  • 帳號跟語料取得,若有遺漏,或是還有問題的,請盡快通知我們處理!

  • 帳號新釋出的訓練語料已可取得囉!語料內容如下圖,請簽新的授權書(檔案如下),並回傳簽好的掃描檔到 formosa@speech.ntut.edu.tw,即可獲取下載新釋出訓練語料的帳密。

智慧財產保護暨保密切結書.pdf
english-version智慧財產保護暨保密切結書.pdf
  • Track1可用參考語料:Youtube『民視戲劇館 Formosa TV Dramas』,包括:浪濤沙,風水世家,幸福來了,春花望露,車站人生等節目。但請注意智慧財產權。

  • Pilot-Test測試語料參考答案已釋出!

  • 請注意:沒有遵守比賽辦法繳交結果的隊伍,將無法獲得正式TAT語料授權,必須把語料庫完全刪掉不再使用,而且之後可能會公布黑名單!

  • 請沒收到speech.nchc.org.tw帳號(似乎 email server有點問題),或是還無法下載訓練語料者,盡快跟我聯絡!

  • Pilot-Test測試語料已釋出!此測試集只用作程序驗證用,相對簡單,不計成績。

  • Pilot-Test 結果繳交,請不要超過9/21。

  • Final-Test測試將比Pilot-Test困難很多。會包含從媒體上取得之語料(籌備中,須另外簽授權,稍後宣布)。

Final-test results(更新)

更新中文及台文的成績,新的答案也已更新在gitlab上

中文輸出部分,針對兩部分挑出部分檔案,並在成績計算上剔除

1.含非台語語音之檔案(以人工聽音檔的方式) 共225個

2.文字含數字、英文之檔案 共177個

最後總共390個檔案不列入成績計算(因部分重複)


台文輸出部分,針對兩部分挑出部分檔案,並在成績計算上剔除

台文部分的檢查由台文系畢業的助理協助檢查

1.含非台語語音之檔案(以人工聽音檔的方式) 共631個

2.文字含數字、英文或非台文之檔案 共482個

最後總共1,113個檔案不列入成績計算


下面ID意思: 英文字母為隊伍代碼,第一個數字為Track,第二個數字為submission

計分方式請參考PILOT-TEST說明

Pilot-test results

(BETA version)

目前結果為beta版,若有發現任何問題請盡快與我們聯絡!

目前的結果算法為CER(字元錯誤率)

  • Track3分成兩個部分供大家參考

台羅數字調(考慮音調)&台羅數字調(不考慮音調)

計算結果使用 SER(音節錯誤率)





  • Track2台文的結果為直接算CER(字元錯誤率)




  • Track1中文的部分,分為直接算CER(字元錯誤率)& 使用BLEU計算翻譯分數,

BLEU使用Google-BLEU
https://colab.research.google.com/github/gcunhase/NLPMetrics/blob/master/notebooks/gleu.ipynb

TRACKs

  • Build an automatic Taiwanese speech recognizer (ASR) that could output either (至少選一個Track):

  1. Traditional Chinese characters (繁體中文字),i.e., Taiwanese Speech to Chinese Characters (translation)

  2. Taiwanese Southern Min Recommended Characters by Ministry of Education of Taiwan (台文漢字,依據教育部部定 臺灣閩南語推薦用字,漢字優先)

  3. Taiwan Minnanyu Luomazi Pinyin (依據教育部部定 臺灣閩南語羅馬字拼音方案之『台羅拼音數字調』,以本調為準)

For example:

  1. Track1 - 現在是晚上八點(同義字會由我們轉成統一格式再行評分,像:臺&台,除CER外,並計算翻譯分數)

  2. Track2 - 這馬是暗時八點(除外來語外,都用漢字表示,另外,同義字也會先處理)

  3. Track3 - tsit4 ma2 si7 am3 si5 peh4 tiam2 (本調為準)

繳交格式說明:

檔名:請以“單位+隊名+參賽者”為檔名,以避免誤判(之前沒寫的沒關係,會檢查email位置)。

答案格式:ID 答案(同Kaldi, 一欄為音檔ID,一欄為語音辨認器輸出)

以下範例

  • Track1:

1 現在是晚上八點

2 今天是六月十九

3 你還不承認

  • Track2:

1 這馬是暗時八點

2 今仔日是六月十九

3 你閣毋承認

  • Track3:

1 tsit4 ma2 si7 am3 si5 peh4 tiam2

2 kin1 a2 git8 si7 lak8 gueh8 tsap8 kau2

3 li2 koh4 m7 sing5 jin7


Database

  • This challenge is based on the "TAT-Vol1" corpus.

  • "TAT-Vol1" consists about 100 speakers recruited across Taiwan, in total about 50 hours (Training + Eval + Test sets).

  • This data is released here for FREE under a Non-Commercial Use Only license. Please read and accept the License.

  • Baseline Scripts: Kaldi-based baseline recipes are provided in Github for students to develop their own systems easily and quickly. --> https://github.com/t108368084/Taiwanese-Speech-Recognition-Recipe

Important Dates

  • 2020/06/01 --- Registration Open

  • 2020/07/01 --- Training Data Release

  • 2020/09/01 --- Pilot-Test (dry-run only) Data Release

  • 2020/09/21 --- Pilot-Test (dry-run only) Result Submission

  • 2020/09/30 --- Pilot-Test (dry-run only) Performance Notification

  • 2020/12/01 --- Registration Close

  • 2021/01/01 --- Final-Test Data Release

  • 2021/01/08 --- Final-Test Result Submission

  • 2021/01/15 --- Final-Test Performance Notification (released)

  • 2021/1/22 2021/02/28--- Paper Submission

  • 2021/01/31 2021/03/19 --- Result/Award Announcement and Workshop (T.B.D.)

PS: Pilot-Test (dry-run) is only used to make sure everything for the final-test is fine, not for scoring!

Contact