Gadgets reliant on speech processing for human commands, including those for home automation, laptops, smart televisions, and vehicles, are increasingly gaining popularity and becoming trendy. However, the majority of these items, which could be found in my country, were operated exclusively in English. In contrast, Chinese people effortlessly control their devices in their native language. This contrast piqued my curiosity and led me to explore the field of speech processing for my own indigenous language. As an undergraduate researcher, my desire was to play a role in advancing the control of devices through the Bengali language. This served as the primary motivation behind my decision to delve into research on the recognition of Bengali speech.
For the purpose of English speech recognition, an abundance of datasets is readily available; however, the availability of datasets for Bengali is notably scarce. Therefore, the initial phase of this research endeavor involved the creation of a dedicated dataset.
To facilitate the recognition of the alphabet, a random selection process was employed, focusing on seven discernible vowels and consonants. the sound recording functionality of the 'One Plus 7' smartphone was utilized to capture the sounds. A diverse pool of 42 speakers, spanning various age groups, geographical regions, and possessing distinct accents, participated in the recording process. To enhance dataset diversity, recordings for some speakers were duplicated with a different accent.
Here is an overview of the meticulously curated dataset, featuring individuals from 10 distinct regions within Bangladesh, comprising a total of 24 males and 18 females.
Convolutional Neural Networks (CNNs) are conventionally tailored for the efficient processing of single-channel audio data. In the context of stereo (dual-channel) audio, the additional channel seldom yields substantial benefits for speech recognition and computational overhead without commensurate improvements in performance.
In order to optimize the recognition task for the envisaged Neural Network architecture and to elicit distinct features from each audio file, a conversation from stereo to mono channels was performed utilizing the Audacity software.
Following the preparation of the dataset, the subsequent step involved the extraction of Mel-Frequency Cepstral Coefficient (MFCC) features. Utilizing the Python programming language, the librosa package was employed for the extraction process. Upon loading all mono files, the audio waveform underwent downsampling by selecting every fifth sample, a measure implemented to mitigate data volume and expedite processing. Subsequently, the MFCCs were computed from the downsampled audio waveform, with the sampling rate set to 16 kHz.
During the data preprocessing phase, it was observed that the audio files containing vowels and consonants exhibited varying time lengths, resulting in disparate lengths of the extracted MFCC vectors. However, to facilitate the incorporation of the dataset into the neural network, it was imperative that all input data assumed a uniform size. Consequently, zero values were appended in cases where the MFCC vector length was below 13. Conversely, for vectors surpassing the length of 13, the excess MFCCs were truncated, as they held diminished significance. The choice of the specific length '13' was not arbitrary; rather, it was informed by the prevalent observation that a majority of MFCC lengths clustered around this value. Following these preprocessing steps, the extracted MFCCs were saved as numpy array files.
The proposed architecture of the Convolutional Neural Network (CNN) encompasses an input layer, an output layer, two Convolutional (Conv2D) layers, two corresponding MaxPooling (MaxPooling2D) layers, a Flatten layer, and a Dense layer. The network specifications are delineated as follows:
For each audio, 20 Mel-Frequency Cepstral Coefficient (MFCC) coefficients were extracted. Consequently, the input layer is configured with a shape of 20x13x1, signifying 20 as the number of MFCC coefficients, 13 as the length of the MFCC vector, and 1 for the mono channel in audio files.
In the initial convolutional layer, 128 filters were employed with a kernel size of (2, 2) and activated using the hyperbolic tangent (tanh) activation function.
Subsequent to the convolutional layer, both MaxPooling layers were introduced with a pool size of (2, 2) to reduce spatial dimensions.
The second convolutional layer featured 64 filters with the same kernel size and activation function.
To mitigate overfitting, a dropout layer with a rate of 0.4, serving as a regularization technique, was incorporated.
A Flatten layer was implemented to convert the 2D data into a 1D array containing 2880 elements.
A densely connected hidden layer, comprising 128 neurons and activated with the rectified linear unit (ReLU) activation function, was added.
Following the addition of another dropout layer with the same rate as previously, the output layer was introduced with 7 units and activated using the softmax function, given the multiclass classification nature of the task.
In compiling the model, the categorical crossentropy loss function and the RMSprop optimizer were employed.
A K-fold cross-validation, with K set to 4, was conducted to assess the performance of the recognition system. The dataset was partitioned into four sections, with three sections utilized for model training and the remaining section employed for validation. Examination of the accuracy chart revealed a significantly higher accuracy in vowel recognition compared to consonant recognition. This observation was further substantiated through validation using a confusion matrix.
For a comprehensive report, kindly reach out via email.