Project 2: Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level
University: Nara Institute of Science and Technology
Department: Graduate School of Information Science
Lab: Natural Language Processing
Supervisor: Prof. Taro Watanabe
Task Details: I led the development of the first manually annotated Pashto paraphrase detection corpus, comprising 5,793 sentences across 10 domains. Trained XLM-RoBERTa on the dataset, achieving a 96% F1 score in Pashto and competitive scores in Indonesian and English in zero-shot settings. Planned to release 1,800 instances to benefit the research community and aid in tasks like plagiarism detection
Impact: Accepted at LREC-COLING 2024, provided a valuable resource for the speakers of the Pashto language, thereby enhancing the integrity of written work and aiding in tasks such as plagiarism detection
University: Nara Institute of Science and Technology
Department: Graduate School of Information Science
Lab: Natural Language Processing
Supervisor: Prof. Taro Watanabe
Task Details: Through multilingual and qualitative analysis, I led research on the visual and linguistic abilities of Vision-Language Models (VLMs), focusing on GPT-4. Introduced six vision-language tasks and created multilingual multimodal datasets in English, Japanese, Swahili, and Urdu for robust evaluations. Conducted comparative analyses between GPT-4 and open-source VLMs, contributing pioneering insights, especially in Swahili and Urdu
Impact: Arxiv 2024, Advanced research in the field by conducting first-of-its-kind analyses in Swahili and Urdu, expanding the scope and inclusivity of VLM research.
Project 4 : HLU: Human Vs LLM Generated Text Detection Dataset for Urdu at Multiple Granularities
University: Nara Institute of Science and Technology
Department: Graduate School of Information Science
Lab: Natural Language Processing
Supervisor: Prof. Taro Watanabe
Task Details: Developed the HLU dataset for detecting LLM-generated text in Urdu, including 1,014 instances across thirteen domains. Conducted human evaluation at document, sentence, and paragraph levels, fine-tuning XLM-RoBERTa for Urdu text detection and comparing performance against English datasets.
Impact: Accepted at COLING 2025, First attempt to address human vs LLM-generated text detection in Urdu, providing foundational insights for NLP research in low-resource settings.