IIT BHU

Building a Language Identification Model for Low-Resource Indian Languages

This blog details about my work as a Machine Learning Intern at the Indian Institute of Technology (BHU)

Problem Statement & Solution

Overview

Eminent IIT BHU professors are creating corpora for low-resource languages like Bhojpuri, Maithili, and Magahi to develop Computational Linguistics technologies, such as Machine Translation models.

Problem Statement

The corpora were collected in the form of bi-lingual text files with sentences in 1 language and its corresponding translation in other language, the pairs being Hindi-Bhojpuri, Hindi-Magahi and Hindi-Maithili. But, as the corpora were created from several sources, it had some major ambiguities, such as:

Duplicate Sentences
Inconsistent pair structures (comma-separated, tab-separated, space-separated)
Incorrect sentence ordering
Unwanted whitespaces and blank lines

in order to develop Computational Linguistics technologies for low-resource languages, we needed to extract parallel-corpora out of the raw corpus.

Solution

In order to extract the parallel-corpora, I implemented a 2-step pipeline:

Developed a Language Identification (LI) model with a maximum accuracy of 99.61%.
Combined the LI model with a Python script to clean the raw corpora and extract the parallel corpora.

Architecture

Training the Language Identification (LI) Model

Splitting Dataset: Dataset is split into Training dataset (80%) and testing dataset (20%).
Features Extraction: From the training dataset, 2 types of features are extracted:

Linguistic features: text_length: length of each text in the text sample and num_words: number of words in

each text in the text sample.

N-Gram Features: Word N-Gram (1 to 3 words) and Character N-Gram (2 to 5 characters).

Combining Features: The extracted features are combined into combined_features.
Feeding to RFD: The extracted features and data is the fed into a Random Forest Classifier model.
Language Identification Model: The trained Language Identification model is ready to do language predictions.

Using the LI Model to Predict Language of a New Sentence

Pipeline to predict the language of a new sentence using the Language Identification model:

Input the new sentence.
Pre-process the sentence using techniques like tokenization.
Extract Linguistic and N-Gram features out of the data.
Feed the combined features in our trained Language Identification Model.
The model will return the predicted language.

Automating Parallel-Corpora formation using LI model

Here is how I automated the parallel-corpora formation using my Language Identification model:

Input the unorganized raw corpus of bilingual data.
The raw corpus is processed by a Python script.
The Python script removes the ambiguities (like duplicate sentence, inconsistent pair structure, etc.) .
The Python script returns an aligned bilingual .txt file in the following format:

Hindi sentence

English translation of the Hindi sentence

Each sentence is passed through the LI model to predict its language, one-by-one.
The LI model predicts the language of the sentence and are written to their respective language files (like Hindi.txt, Bhojpuri.txt etc.).

For example, if the raw corpus of bilingual data is of Hindi and Bhojpuri language, the output will be 2 files: Hindi.txt containing all the Hindi sentences and Bhojpuri.txt containing all the Bhojpuri sentences.

Model Performance

Language identification model performance for the 3 pairs:

Google Sites

Report abuse