for South Africa's eleven official languages -
subsidiary resources
The LWAZI speech corpus contains telephone speech from approximately 200 speakers per language, in each of the eleven official languages of South Africa.
The main purpose of this site is to serve as a repository for the data partition lists and pronunciation dictionaries that were used in experiments reported in the updated corpus paper: Charl van Heerden, Neil Kleynhans, Marelie Davel, “Improving the Lwazi ASR baseline,” accepted for publication in to Proc. INTERSPEECH, San Francisco, USA, Sept. 2016.
The LWAZI corpus can be obtained from the RMA at http://rma.nwu.ac.za/.
-------------------------------------------------------
md5sum gzip tar archive
-------------------------------------------------------
a1c0ee1a3c22fe85f721c5810bb33341 asr.lwazi.afr.1.0.zip
8eb6e313f34b716cf3d6772a4757a947 asr.lwazi.eng.1.0.zip
daef774901b84dafe69b61c6ca227f3c asr.lwazi.nbl.1.0.zip
706db8facc89900551c695cd9a574d09 asr.lwazi.nso.1.0.zip
6b13668a4810fda2b749b090ac4be3e2 asr.lwazi.sot.1.0.zip
46020f9e3d5a9df4ebe6d92791322e5c asr.lwazi.ssw.1.0.zip
ffd794de1b38a63d0775d95b264be8c8 asr.lwazi.tsn.1.0.zip
655d3a7f968d7ae07a1f0a2ae71aed2e asr.lwazi.tso.1.0.zip
2b39c1947907b56777f144fb9cefa4c4 asr.lwazi.ven.1.0.zip
d323f972da751ece86b45f2a9a6dae49 asr.lwazi.xho.1.0.zip
2128b450e5cf8de8f79a1acd1f27af88 asr.lwazi.zul.1.0.zip
-------------------------------------------------------
The dictionaries made availabe on this site are derived works of the "NCHLT-inlang Pronunciation Dictionaries" by the Meraka Institute, CSIR and the North-West University, available from the RMA and released under a Creative Commons Attribution 3.0 Unported License (CC BY 3.0). When using these dictionaries, please cite the following papers:
To access the lists and dictionaries, click on any of the languages below (to download the lists and dictionaries for all of the languages, click here):
Below are the current best results we've been able to obtain on the Lwazi corpus.
Test set
Dev set