This blog details about my work as a Machine Learning Intern at the Indian Institute of Technology (BHU)
Eminent IIT BHU professors are creating corpora for low-resource languages like Bhojpuri, Maithili, and Magahi to develop Computational Linguistics technologies, such as Machine Translation models.
The corpora were collected in the form of bi-lingual text files with sentences in 1 language and its corresponding translation in other language, the pairs being Hindi-Bhojpuri, Hindi-Magahi and Hindi-Maithili. But, as the corpora were created from several sources, it had some major ambiguities, such as:
Duplicate Sentences
Inconsistent pair structures (comma-separated, tab-separated, space-separated)
Incorrect sentence ordering
Unwanted whitespaces and blank lines
in order to develop Computational Linguistics technologies for low-resource languages, we needed to extract parallel-corpora out of the raw corpus.
In order to extract the parallel-corpora, I implemented a 2-step pipeline:
Developed a Language Identification (LI) model with a maximum accuracy of 99.61%.
Combined the LI model with a Python script to clean the raw corpora and extract the parallel corpora.
Splitting Dataset: Dataset is split into Training dataset (80%) and testing dataset (20%).
Features Extraction: From the training dataset, 2 types of features are extracted:
Linguistic features: text_length: length of each text in the text sample and num_words: number of words in
each text in the text sample.
N-Gram Features: Word N-Gram (1 to 3 words) and Character N-Gram (2 to 5 characters).
Combining Features: The extracted features are combined into combined_features.
Feeding to RFD: The extracted features and data is the fed into a Random Forest Classifier model.
Language Identification Model: The trained Language Identification model is ready to do language predictions.
Pipeline to predict the language of a new sentence using the Language Identification model:
Input the new sentence.
Pre-process the sentence using techniques like tokenization.
Extract Linguistic and N-Gram features out of the data.
Feed the combined features in our trained Language Identification Model.
The model will return the predicted language.
Here is how I automated the parallel-corpora formation using my Language Identification model:
Input the unorganized raw corpus of bilingual data.
The raw corpus is processed by a Python script.
The Python script removes the ambiguities (like duplicate sentence, inconsistent pair structure, etc.) .
The Python script returns an aligned bilingual .txt file in the following format:
Hindi sentence
English translation of the Hindi sentence
Each sentence is passed through the LI model to predict its language, one-by-one.
The LI model predicts the language of the sentence and are written to their respective language files (like Hindi.txt, Bhojpuri.txt etc.).
For example, if the raw corpus of bilingual data is of Hindi and Bhojpuri language, the output will be 2 files: Hindi.txt containing all the Hindi sentences and Bhojpuri.txt containing all the Bhojpuri sentences.
Language identification model performance for the 3 pairs: