Frequency Analysis

Frequency Analysis of Urdu Text for NLP Data Collection

Objective:

To develop a Python program that reads an Urdu text file line by line, extracts words, stores them uniquely, and finally sorts and prints them in alphabetical order.

Prerequisites:

Python 3.x
Urdu text file ("2563.txt")
Basic understanding of string manipulation and file handling in Python

Introduction:

Natural Language Processing (NLP) involves analyzing and processing textual data. A fundamental step in NLP is tokenization and frequency analysis. This experiment demonstrates proficiency in data collection by extracting unique words from an Urdu ebook and sorting them alphabetically.

Tasks:

Open the file "2563.txt" in read mode using Python.
Read the file line by line.
Tokenize each line into words by splitting it using whitespace.
Maintain a list to store unique words.
If a word is not already in the list, add it.
After processing all lines, sort the list alphabetically.
Print the sorted list of words.

Code Implementation:

# Open the file in read mode with UTF-8 encoding

with open("2563.txt", "r", encoding="utf-8") as file:

words_list = [] # List to store unique words

for line in file:

words = line.strip().split() # Tokenizing the line into words

for word in words:

if word not in words_list:

words_list.append(word)

# Sorting the words alphabetically

words_list.sort()

# Printing the sorted unique words

for word in words_list:

print(word)

Expected Output:

The program prints all unique words from the Urdu text file in sorted order.

Observations:

The presence of punctuation may affect tokenization.
Urdu words with similar spellings but different diacritics may be treated as distinct words.

Conclusion:

This experiment demonstrates a simple method for collecting and processing textual data in NLP. The program efficiently extracts and organizes unique words from a dataset, which is a fundamental step in text preprocessing.

Page updated

Google Sites

Report abuse