To develop a Python program that reads an Urdu text file line by line, extracts words, stores them uniquely, and finally sorts and prints them in alphabetical order.
Python 3.x
Urdu text file ("2563.txt")
Basic understanding of string manipulation and file handling in Python
Natural Language Processing (NLP) involves analyzing and processing textual data. A fundamental step in NLP is tokenization and frequency analysis. This experiment demonstrates proficiency in data collection by extracting unique words from an Urdu ebook and sorting them alphabetically.
Open the file "2563.txt" in read mode using Python.
Read the file line by line.
Tokenize each line into words by splitting it using whitespace.
Maintain a list to store unique words.
If a word is not already in the list, add it.
After processing all lines, sort the list alphabetically.
Print the sorted list of words.
Code Implementation:
# Open the file in read mode with UTF-8 encoding
with open("2563.txt", "r", encoding="utf-8") as file:
words_list = [] # List to store unique words
for line in file:
words = line.strip().split() # Tokenizing the line into words
for word in words:
if word not in words_list:
words_list.append(word)
# Sorting the words alphabetically
words_list.sort()
# Printing the sorted unique words
for word in words_list:
print(word)
The program prints all unique words from the Urdu text file in sorted order.
Observations:
The presence of punctuation may affect tokenization.
Urdu words with similar spellings but different diacritics may be treated as distinct words.
This experiment demonstrates a simple method for collecting and processing textual data in NLP. The program efficiently extracts and organizes unique words from a dataset, which is a fundamental step in text preprocessing.