humayoun

Research

Doctoral Work

MathNat (Mathematics in controlled Natural language) is a system which tries to automatically formalize the language of mathematics. The ultimate objective of this project, of which my thesis is a part, is to investigate how to check mechanically the validity of a mathematical text on a machine. This involves, among other things, translating an informal mathematical text into a formal text that can be understood by a proof assistant, theorem prover, or some similar system, to be veri ed or validated mechanically.

The system MathNat provides a controlled language having a look and feel of textbook mathematics. It also support miscellaneous linguistic features to make it natural and expressive. A number of transformations are further applied on it to completely formalize it. An overview of this work is reported in [HR10b]; while a formal language to write mathematics is proposed in [HR10a].

Among others, an important usage of such a system (software tool), when completed, is in the area of teaching. The teacher may use it as an assistant for teaching students how a theorem can be proved using different proof techniques. Students from science and technology as well as social sciences (for subjects such as Logic, which is needed for law and religious jurisdiction) can benefit from this tool.

References:

[HR10a] M. Humayoun and C. Raffalli (2010). MathAbs: A Representational Language for Mathematics. 8th International Conference on Frontiers of Information Technology. December 21-23, 2010, Islamabad, Pakistan. ACM 978-1-4503-0342-2/10/12. (Acceptance rate: 29.25%)
[HR10b] M. Humayoun and C. Raffalli (2010). MathNat - Mathematical Text in a Controlled Natural Language. Special issue: Natural Language Processing and its Applications. Journal on Research in Computing Science. Volume 46. ISSN:1870-4069. CICLing 2010:11th International Conference on Intelligent Text Processing and Computational Linguistics, March 21-27, 2010, Iasi, Romania. (Acceptance rate: 27%).
PhD Thesis: Thesis-Humayoun1.20110116.pdf
PhD Defence slides : Thesis-defence-slides.pdf

Natural Language Processing of South Asian languages

The first project is my Master thesis titled "Urdu Morphology, Orthography and Lexicon Extraction". In this work, I reported a suite of resources including a fairly complete Urdu morphology, a lexicon and a small fragment of syntax (published in [HHR07]).

The second project is based on the above work and I was involved as second author. It is an elementary open-source Urdu grammar under GF resource grammar library. It is reported in [VHR10].

The third project is related to the language Punjabi. It is about the development of Punjabi morphology, corpus and lexicon. The morphology is written in Grammatical Framework. Half of the corpus is bootstrapped from Wikipedia and the lexicon is extracted semi-automatically. These resources are reported in paper [HR10c].

The fourth project is based on the third project. Similar to the second project, it is an elementary open-source Punjabi grammar under GF resource grammar library. Again, I was involved as second author. It is reported in paper [VHR11].

In most of these projects, I have built corpora from online texts, and extracted lexicons semi automatically. The impact of these resources could also be assessed by the fact that these are (yet partially) used by Apertium which is an open-source machine translation system. Apertium community actively participates in "Google code-in" and "Google summer of code" contests. Two projects based on my work (one for Urdu and one for Punjabi) are available for these contests since 2010.

References:

[HHR07] M. Humayoun, H. Hammarstrom, and A. Ranta (2007). Urdu Morphology, Orthography and Lexicon Extraction. In Ali Farghaly & Karine Megerdoomian (eds.), Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages. Pages 59–68, LSA 2007 Linguistic Institute, Stanford University, USA. (Acceptance rate: not mentioned, but frequently cited paper)
[HR10c] M. Humayoun and A. Ranta (2010). Developing Punjabi Morphology, Corpus and Lexicon. In R. Otoguro, K. Ishikawa, H. Umemoto, K. Yoshimoto, and Y. Harada, editors, Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation (PACLIC24). Pages 163–172. Tohoku University, Japan, November 2010. ISBN 978–4–905166–00–9. (Acceptance rate:27.45%)
[VHR10] Shafqat M. Virk, M. Humayoun, A. Ranta (2010). An Open Source Urdu Resource Grammar. Proceedings of the Eight Workshop on Asian Language Resources. August 2010, Beijing, China. Co-located with Coling 2010. (Acceptance rate: 62.86%)
[VHR11] Shafqat M. Virk, M. Humayoun, A. Ranta (2011). An Open Source Punjabi Resource Grammar. Proceedings of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011). (Ranking: 0.54, in range 0.00–1.00, short paper acceptance rate: 38%)

Text summarization

Urdu is a national language of Pakistan and widely spoken in Indo-Pak subcontinent. In this work, I along with my students, have developed a benchmark corpus to facilitate Single Document Summarization for Urdu -- a language spoken by millions but under resourced computationally. I also analyzed the effects of basic preprocessing settings (some mentioned below) on this corpus using various benchmarking experiments.

The effect of four different stopword lists.
The effect of different levels of stemming approaches such as lemmatization, rule based and fixed length stemming.
The effect of stopwords and stemming together.

The analysis is performed using four general purpose state-of-the-art automatic summarization algorithms (including TextRank and LexRank). Such an evaluation is important because Urdu has rich morphology and free word order; making it very different from English. As far as we know, this work is a pioneering effort in context of Urdu.

Recently: We also developed a benchmark corpus of Urdu extractive summaries. The corpus contains 161 documents with manually selected extractive summaries from the newswire domain. We also performed a number of experiments on the corpus to show how it can be used to develop, evaluate, and compare text summarization systems using a supervised learning approach for the Urdu language. The main reason to pick supervised learning was due to its general popularity and effectiveness in the eld of natural language processing and extensive use in automatic text summarization. A journal paper reporting these results is published [HA22] (Github repository: https://github.com/humsha/CORPURES ).

References:

Muhammad Humayoun, Rao Muhammad Adeel Nawab, Muhammad Uzair, Saba Aslam and Omer Farzand (2016). Urdu summary corpus. In Nicoletta Calzolari, et al., editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA). ISBN: 978-2-9517408-9-1.
Muhammad Humayoun and Hwanjo Yu (2016), Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization. In Nicoletta Calzolari, et al., editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA). ISBN: 978-2-9517408-9-1. Github repository: https://github.com/humsha/USCorpus
[HA22] Muhammad Humayoun and Naheed Akhtar. Corpures: Benchmark corpus for Urdu extractive summaries and experiments using supervised learning. Intelligent Systems with Applications, page 200129, 2022.

Electronic Assessment

There is a dire need of systems/tools for automated assessment termed as Electronic Assessment (e-assessment) or Computer Aided Assessment (CAA). In Pakistani higher education institutions, teachers remain under heavy burden in performing various teaching, research and administrative tasks. With my colleagues, I have partly written a project proposal (as co-principal investigator) one this topic in collaboration with Colleagues. This project aims at developing a CAA system for the introductory computer programming courses (Considering only C and C++ languages which are usually opted for introductory courses). Our preliminary findings have been published at [KAH14].

Measuring plagiarism in programming assignments is an essential task to the educational procedure. In 2022, we published a paper [HHK22] that discusses the methods of plagiarism and its detection in introductory programming course assignments written in C++. A small corpus of assignments is made publicly available. A general framework to compute the similarity between a solution pair is developed that uses the three token-based similarity methods as features and predicts if the solution is plagiarized. The importance of each feature is also measured, which in return ranks the effectiveness of each method in use. Finally, the artificially generated dataset improves the results compared to the original data. We achieved an F1 score of 0.955 and 0.971 on original and synthetic datasets.

References

[KAH14] Muhammad Salman Khan, Adnan Ahmad, Muhammad Humayoun (2014). A Survey of Current Opportunities for Developing Automated Assessment System for C/C++ Programming Assignments. Proceedings of 28th Annual Conference of the Asian Association of Open Universities. The Hong Kong University, Hong Kong, China.
[HHK22] Muhammad Humayoun, Muhammad Adnan Hashmi, and Ali Hanzala Khan. Measuring plagiarism in introductory programming course assignments. In Proceedings of 8th International Conference on Information Technology Trends (ITT). Higher Colleges of Technology - Dubai Men's Campus on 25-26 May 2022. Dubai, United Arab Emirates., 2022.

Text classification

In 2021, I participated in two shared tasks and my developed systems got recognition as follows. Both were accepted as peer-reviewed conference publications.

Detecting, abusive & threatening language [Hum21a]. The submitted results were selected for the third recognition with a monetary prize of 10K Rub (Russian ruble) from ODS Summer of Code.

Fake news detection [Hum21b]. The system ranked 5th among 18 teams. During the paper submission, I improved my results higher than the second-best score in the competition.

References

Hum21a] M. Humayoun. Experiments for Abusive and Threatening Language Detection in Urdu. In Forum for Information Retrieval Evaluation, December 13-17, 2021, India (FIRE 2021). CICLing 2021 track, International Conference on Computational Linguistics and Intelligent Text Processing, 2021.
[Hum21b] M. Humayoun. Experiments for the 2021 Task: Fake News Detection in the Urdu Language. In Forum for Information Retrieval Evaluation, December 13-17, 2021, India (FIRE 2021). CICLing 2021 track. International Conference on Computational Linguistics and Intelligent Text Processing., 2021.

In a bigger picture

I want to develop the needed building blocks required to do natural language processing for Pakistani languages. Once we have these building blocks in the next 3 to 5 years, I plan to focus on tasks such as machine translation, parsing, word sense disambiguation, etc.

Notwithstanding this, I am also open to work on the regional languages where I find myself living.

Google Sites

Report abuse