Publications

Gender Identification in Sepedi Speech Corpus

August 20212021 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD)At: Durban, South Africa

Gender identification is the task of identifying the gender of the speaker from the audio signal. Most gender identification systems are developed using datasets belonging to well-resourced languages. There has been little focus on creating gender identification systems for under resourced African languages. This paper presents the development of a gender identification system using a Sepedi speech dataset containing a duration of 55.7 hours made of 30776 males and 28337 females. We build a gender identification system using machine learning models that are trained using multilayer Perceptron (MLP), convolutional neural network (CNN), and long short-term memory (LSTM). Mid-term features are extracted from time domain features, frequency domain features and cepstral domain features, and normalised using the Z-score normalisation technique. XGBoost is used as a feature selection method to select important features. MLP achieved the same F-score and an accuracy of 94% for data with seen speakers while LSTM and CNN achieved the same F-score and an accuracy of 97%. We further evaluated the models on data with unseen speakers. All the models achieved good performance in F-score and accuracy. 

A Cross-platform Interface for Automatic Speaker Identification and Verification  

August 20214th International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD)At: Durban, KwaZulu Natal, South Africa

The task of identifying and/or verifying the identity of a speaker automatically from a recording of a speech sample, known as automatic speaker recognition, has been studied for many years and automatic speaker recognition technologies have improved recently and becoming inexpensive and reliable methods for identifying and verifying people. Although automatic speaker recognition research has now spanned over 50 years, there is not adequate research done with regard to low-resourced South African indigenous languages. In this paper, a multi-layer perceptron (MLP) classifier model is trained and deployed on a graphical user interface for real time identification and verification of Sepedi native speakers. Sepedi is a low-resourced language spoken by the majority of residents in the Limpopo province of South Africa. The data used to train the speaker recognition system is obtained from the NCHLT (National Centre for Human Language Technology) project. A total of 34 short-term acoustic features of speech are extracted with the use of the pyAudioAnalysis library and Sklearn is used to train the MLP classifier model which performs well with an accuracy of 95%. The GUI is developed with QT Creator and PyQT4 and it has obtained a true acceptance rate (TAR) of 66.67% and a true rejection rate of (TRR) 13.33%. 

Practical Approach on Implementation of WordNets for South African Languages  

January 2021Proceedings of the 11th Global Wordnet ConferenceAt: University of South Africa (UNISA)

This paper proposes the implementation of WordNets for five South African languages, namely, Sepedi, Setswana, Tshivenda,  isiZulu and isiXhosa to be added to open multilingual WordNets (OMW) on natural language toolkit (NLTK). The African WordNets are converted from Princeton WordNet (PWN) 2.0 to 3.0 to match the synsets in PWN 3.0. After conversion, there were 7157, 11972, 1288, 6380, and 9460 lemmas for Sepedi, Setswana,  Tshivenda,  isiZulu, and isiXhosa respectively. Setswana, isiXhosa, and Sepedi contain more lemmas compared to 8 languages in OMW and isiZulu contains more lemmas compared to 7 languages in OMW.  A  library has been published for the continuous development of African Word-Nets in OMW using NLTK.

Emotional Speaker Recognition based on Machine and Deep Learning 

November 2020In 2020 2nd International Multidisciplinary Information Technology and Engineering Conference (IMITEC)At: Kimberley, South Africa

Speaker recognition is a method which recognise a speaker from characteristics of a voice. Speaker recognition technologies have been widely used in many domains. Most speaker recognition systems have been trained on normal clean recordings, however the performance of these speaker recognition systems tends to degrade when recognising speech which has emotions. This paper presents an emotional speaker recognition system trained using machine and deep learning algorithms using time, frequency and spectral features on emotional speech database acquired from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). We trained and compared the performance of five machine learning models (Logistic Regression, Support Vector Machine, Random Forest, XGBoost, and k-Nearest Neighbor), and three deep learning models (Long Short-Term Memory network, Multilayer Perceptron, and Convolutional Neural Network). After the evaluation of the models, the deep neural networks showed good performance compared to machine learning models by attaining the highest accuracy of 92% outperforming the state-of-the-art models in emotional speaker detection from speech signals. 

The Effects of Acoustic Features of Speech for Automatic Speaker Recognition

August 20203rd International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD) 2020At: Durban, KwaZulu Natal, South Africa

Automatic speaker recognition is the task of automatically determining or verifying the identity of a speaker from a recording of his or her speech sample and has been studied for many decades. One of the most important steps of speaker recognition that significantly influences the speaker recognition performance is known as feature extraction. Acoustic features of speech have been researched by many researchers around the world, however, there is limited research conducted on African indigenous languages, South African official languages in particular. This paper presents the effects of acoustic features of speech towards the performance of speaker recognition systems focusing on South African low-resourced languages. This study investigates the acoustic features of speech using the National Centre for Human Language Technology (NCHLT) Sepedi speech data. Acoustic features of speech such as Time-domain, Frequency-domain and Cepstral-domain features are evaluated on four machine learning algorithms: K-Nearest Neighbours (K-NN), two kernel-based Support Vector Machines (SVM), and Multilayer Perceptrons (MLP). The results show that the performance is poor for time-domain features and good for spectral-domain features and even better for cepstral-domain features. However, the combination of these three features resulted in a higher accuracy and F1 score of 98%.

Effects of Language Modelling for Sepedi-English Code-Switched Speech in Automatic Speech Recognition System

August 20203rd International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD) 2020At: Durban, KwaZulu Natal, South Africa

Speech is the primary means of communication among people. Spoken dialogue system give out some means for people to be able to interact with computer systems. The automatic speech recognition system itself is forms part a of spoken dialogue systems. This type of system did a great job for European languages with more challenges encountered for the recognition of South African languages. In this study, we investigate the appropriate approaches for the development of language models for the recognition of Sepedi-English code-switched speech and their effect in ASR. The SRI Language Modeling (SRILM) toolkit was used to develop the Language Model (LM). The Kaldi toolkit to develop ASR system was chosen which is specifically used for speech recognition. This toolkit was used to evaluate the effects of the smoothing techniques. We have evaluated Four smoothing techniques namely Good-Turing (GT), Witten-Bell (WB), Modified Kneser-Ney (MKN), and Laplace (LP) Smoothing. The Witten-Bell smoothing technique was found to perform better than the other three smoothing techniques for Sepedi-English CS data in Language Modelling and also in our ASR. 

Low resource language dataset creation, curation and classification: Setswana and Sepedi -- Extended Abstract 

March 2020License : CC BY-SA 4.0

The recent advances in Natural Language Processing have only been a boon for well represented languages, negating research in lesser known global languages. This is in part due to the availability of curated data and research resources. One of the current challenges concerning low-resourced languages are clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creating two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and the creation of a news topic classification task from these datasets. In this study, we document our work, propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages in order to improve the performance of the classifiers. 

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi  

March 2020License : CC BY-SA 4.0 

The recent advances in Natural Language Processing have been a boon for well-represented languages in terms of available curated data and research resources. One of the challenges for low-resourced languages is clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creation of two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and creation of a news topic classification task. We document our work and also present baselines for classification. We investigate an approach on data augmentation, better suited to low resource languages, to improve the performance of the classifiers 

Automatic Speaker Recognition System based on Optimised Machine Learning Algorithms

September 2019IEEE AFRICON 2019 At: Accra, Ghana

Speaker recognition is a technique that automatically identifies a speaker from a recording of their voice. Speaker recognition technologies are taking a new trend due to the progress in artificial intelligence and machine learning and have been widely used in many domains. Continuing research in the field of speaker recognition has now spanned over 50 years. In over half a century, a great deal of progress has been made towards improving the accuracy of the system's decisions, through the use of more successful machine learning algorithms.  This paper presents the development of automatic speaker recognition system based on optimised machine learning algorithms. The algorithms are optimised for better and improved performance.  Four classifier models, namely, Support Vector Machines, K-Nearest Neighbors, Random Forest, Logistic Regression, and Artificial Neural Networks are trained and compared. The system resulted with Artificial Neural Networks obtaining the state-of-the-art accuracy of 96% outperforming KNN, SVM, RF and LR classifiers. 

Grammar-driven Text-to-speech Application for Articulation of Mathematical Expressions

September 2019Southern Africa Telecommunication Networks and Applications Conference (SATNAC).At: Fairmont Zimbali Resort in Ballito, KwaZulu-Natal, South Africa

Natural Language Processing (NLP) forms one of the important and fundamental components of speech synthesis while a language grammar forms one of the important requirements for NLP tasks. One of the major requirements in processing speech synthesis tasks is the correctness of grammar analysis. Grammar-based applications tend to be effective when embedded within text-to-speech (TTS) synthesis systems. The TTS synthesis systems assist with the correct word spelling and intonation. Spoken languages plays a vital role to the educational journey of children as their brains are naturally wired to speak but not read and write. This paper presents the development of a grammar-driven TTS application for the reading of mathematical expressions in the Sepedi language. The application front-end component parses mathematical expression text inputs before a TTS synthesis system processes them to produce the correct articulation of the mathematical expression. Acceptable performance results are observed when the application is evaluated using word error rate for intelligibility, and subjective mean opinion score for pronunciation, naturalness, pleasantness, understandability, and overall system impression. The application achieved an accuracy 84,85%.

The Effect of Data Size on Text-Independent Automatic Speaker Identification System

August 20192nd International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD) 2019.At: Drakensberg Sun Resort, Winterton, KwaZulu Natal, South Africa

Speaker recognition is a technique that automatically identifies a speaker from a recording of their speech utterance. Speaker recognition technologies are taking a new direction due to progress in artificial intelligence and have been widely used in many domains. Research in the field of speaker recognition has now spanned over decades and has shown fruitful results, but there is not much work done for African indigenous languages that have limited data resources. This paper presents how data size impacts the accuracy of an automatic speaker recognition system models under limited data and larger data settings, focusing on South African under-resourced languages. The data is acquired from the South African Centre for Digital Language Resources. Four learning models, namely, Support Vector Machines (SVM), K-Nearest Neighbors, Multilayer Perceptrons and Logistic Regression (LR) are trained under four data setting environment. LR performed better than other models with the highest accuracy of 91% while SVM obtained the highest increase of 4% in accuracy as data size increases exponentially.

HMM-based Speech Synthesis System incorporated with Language Identification for Low-resourced Languages

August 20192nd International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD) 2019. At: Drakensberg Sun Resort, Winterton, KwaZulu Natal, South Africa

Text-to-speech (TTS) synthesis systems are of benefit towards learning new or foreign languages. These systems are currently available for various major languages but not available for low-resourced languages. Scarcity of these systems may lead to challenges in learning new languages specifically low-resourced languages. Development of language-specific systems like TTS and Language identification (LID) have an important task to address in mitigating the historical linguistic effects of discrimination and domination imposed onto low-resourced indigenous languages. This paper presents the development of a multi-language LID+TTS synthesis system that generate audio of input text using the predicted language in four South African languages, namely: Tshivenda, Sepedi, Xitsonga and IsiNdebele. On the front-end, is the LID module that detects language of the input text before the TTS synthesis module produces output audio. The LID module is trained on a 4 million words dataset resulted with 99% accuracy outperforming the state-of-the-art systems. A robust method for building TTS voices called hidden Markov model method is used to build new voices in the selected languages. The quality of the voices is measured using the mean opinion score and word error rate metrics that resulted with positive results on the understandability, naturalness, pleasantness, intelligibility and overall impression of the system of the newly created TTS voices. The system is available as a website service.

Automatic Speaker Recognition System based on Machine Learning Algorithms

January 2019SAUPEC/RobMech/PRASA 2019 Conference. At: Bloemfontein, South Africa

Speaker recognition is a technique that automatically identifies a speaker from a recording of their speech utterance. Speaker recognition technologies are taking a new direction due to progress in artificial intelligence and have been widely used in many domains. Research in the field of speaker recognition has now spanned over decades and has shown fruitful results, but there is not much work done for African indigenous languages that have limited data resources. This paper presents how data size impacts the accuracy of an automatic speaker recognition system models under limited data and larger data settings, focusing on South African under-resourced languages. The data is acquired from the South African Centre for Digital Language Resources. Four learning models, namely, Support Vector Machines (SVM), K-Nearest Neighbors, Multilayer Perceptrons and Logistic Regression (LR) are trained under four data setting environment. LR performed better than other models with the highest accuracy of 91% while SVM obtained the highest increase of 4% in accuracy as data size increases exponentially.

Grammar-based Speech-enabled Application for Reading Mathematical Expressions

September 2018Southern Africa Telecommunication Networks and Applications Conference (SATNAC). At: Arabella, Hermanus,Western Cape, South Africa.

The grammar specification component forms an important part of language learning and language processing. Text-to-Speech (TTS) synthesis is the conversion of raw input and the output is speech. TTS systems embedded into grammar-based application systems can help learners with the correct articulation of unfamiliar words and expressions encountered in their studies. In this research project, we propose to develop a grammar-based framework that will be embedded within an existing TTS synthesis system and will assist the foundation and the intermediate phase learners cope with the reading of mathematical expressions in their first languages. The system strives to assists the learners with articulation and pronunciation of the mathematical expressions in their first language. The use of grammar-based systems in the first language will also help the parents and teachers who struggle to understand the English-based language of instruction that is currently being used to engage school going learners.

Development of a Text-Independent Speaker Recognition System for Biometric Access Control

September 2018Southern Africa Telecommunication Networks and Applications Conference (SATNAC). At: Arabella, Hermanus,Western Cape, South Africa.

Biometric recognition is a process of authenticating access by capturing, analysing and comparing some behavioural and physiological characteristics of a human being. The characteristics include face, iris, fingerprints, palm and voice. Voice is one biometric characteristic that is not yet broadly used for person recognition (identification or verification) in comparison to other biometric characteristics such as fingerprints and face. Speaker recognition (or voice recognition) is the process of recognising the speaker from a given utterance by matching the voice biometrics of the utterance with those utterances stored in the models beforehand. Speaker recognition technologies are taking a new direction due to progress in artificial intelligence and have been widely used in many domains including security, banking, etc. This technology has matured over recent years to become a low-cost and reliable approach to person recognition. In this paper, we present the development of a biometric recognition system that automatically recognises people from their voices. We acquired speech audio data and features were extracted and saved to a local text file for all the speakers. The support vector machines (SVM) were used to train and test the model on a 10 fold cross-validation. The polynomial SVM and linear SVM performed well with accuracy and F-measure of above 94% outperforming the Radial Basis Function SVM by 35% in both accuracy and F-measure.

The Automatic Recognition of Sepedi Speech Emotions Based on Machine Learning Algorithms

August 2018International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD). At: Durban, South Africa

This paper discusses the development of a speech emotion recognition (SER) system that classifies and recognise six basic emotions (anger, sadness, disgust, fear, happiness, and neutral) from speech spoken in Sepedi language (one of South Africa’s official languages). An emotional speech corpus was collected from Sepedi speakers and from a TV broadcast drama to train three standard machine learning (ML) algorithms (KNN, SVM and MLP) and Auto-WEKA algorithm. A total of 34 speech features was extracted from the corpus using the pyAudioAnalysis tool, and WEKA software was used to perform the experiments with a 10 folds cross-validation method. Auto-WEKA surpassed all the standard algorithms regarding accuracy.

Development of a speech-enabled basic arithmetic m-learning application for foundation phase learners

September 2017IEEE AFRICON 2017. At: Cape Town, South Africa

In very simple terms, speech synthesis is the process of generating spoken language by machine on the basis of text input, and text-to-speech is a specific type which takes as input raw text and aims to mimic the human process of reading. Computer-assisted learning (CAL) can be defined as learning or teaching through the use of computers with packaged knowledge content learning materials. CAL involves a computer program or file developed specifically for educational purposes. Mobile learning or “m-learning” is the ability to obtain or provide educational content on personal pocket devices such as PDAs, smartphones and mobile phones. m-Learning as an educational activity makes sense only when the technology in use facilitates and supports mobility in learning. In this paper, we discuss the development of a mathematical computer-assisted learning mobile application that integrates a text-to-speech synthesis module for South African low-resourced languages, initially targeting the Sepedi language. The system is aimed at assisting mathematically illiterate persons and foundation phase learners to learn and understand the representation and articulation of mathematical expressions incorporating four basic arithmetic operations (addition, subtraction, multiplication, and division). It also incorporates a few numeracy functions. The results obtained from the experiments conducted with the prototype CAL system show that 80% of the participants were impressed by the developed mobile application. There is great need to enhance the development of software applications that support the teaching and learning activities at the foundation phase of education in South Africa.

Speech-enabled Application for Foundation Phase Learners

September 2016Southern Africa Telecommunication Networks and Applications Conference (SATNAC). At: Fancourt, George

Computer Assisted Learning (CAL) can be defined as learning or teaching through the use of computers with packaged knowledge content learning materials. Speech synthesis (aka text-to-speech synthesis) is a computer-based process of generating spoken language utterances from input text. In this paper, we propose the development of a CAL application for mathematical expressions that integrates text-to-speech synthesis using under-resourced official languages of South Africa. The application strives to assist in the articulation of mathematical expressions incorporating the four basic arithmetic operations, namely, addition, subtraction, multiplication, and division. The application is targeted to assist mathematically illiterate persons and foundation phase learners to learn and understand the representation and articulation of mathematical expressions incorporating four basic arithmetic operations in a preferred local home language.