Computational Resources for Pakistani languages

Most of the Pakistani languages are under resourced. It means that there are very limited computational resources available for them. In this work, we focus on the development of computational resources for Pakistani languages. In past, the following resources have been developed:

A suite of computational resources for Urdu, Punjabi and Sindhi language, including fairly complete morphologies, lexicons, and the implementation of elementary level multilingual grammars in the context of GF resource library (link: http://www.grammaticalframework.org/lib/doc/synopsis.html).

GF is a special-purpose programming language for writing grammars. It supports the complexities found in different natural languages. It also supports abstractions and linguistic generalizations; works for single languages and across multiple languages. See its homepage for more details (Link: http://www.grammaticalframework.org/).

For above mentioned Pakistani languages, we have built corpora from on-line texts (wikipedia, news websites, books, blogs, etc) and extracted lexicons semi-automatically. The computational  resources we develop can be used in multilingual translation systems and language-based human-computer interaction 

Urdu resources: For morphology and lexicon visit homepage; for grammar visit GF resource library, and GF homepage.

Punjabi resources: For morphology and Lexicon visit homepage; for grammar visit GF resource library, and GF homepage.

Sindhi resources: Will be uploaded soon.

 

Contact Person:

Dr. Muhammad Humayoun, Assistant Professor, COMSATS Institute of Technology, Lahore, Pakistan. Homepage

Other Contributors:

Shafqat Mumtaz Virk, PhD student, Chalmers University of Technology, Sweden. Homepage

Foreign Adviser:

Dr. Aarne Ranta, Professor of Computer Science, Department of Computer Science and Engineering, University of Gothenburg and Chalmers University of Technology. Homepage

 

Selected Publication:

  • Shafqat M. Virk, M. Humayoun and A. Ranta (2011) "An Open Source Punjabi Resource Grammar", Proceedings of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011), pp: 70-76,  (ranking: 0.54, in range 0.00–1.00, short paper acceptance rate: 38%)
  • Shafqat M. Virk, M. Humayoun, A. Ranta (2010) "An Open Source Urdu Resource Grammar", Proceedings of the Eight Workshop on Asian Language Resources. Colocated with Coling 2010,  pp:153–160, (acceptance rate: 62.86%) 
  • M. Humayoun and A. Ranta (2010) "Developing Punjabi Morphology, Corpus and Lexicon", In R. Otoguro, K. Ishikawa, H. Umemoto, K. Yoshimoto, and Y. Harada, editors, Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation (PACLIC24), pp: 163–172, Standard: 978–4–905166–00–9, (acceptance rate: 27.45%)
  • M. Humayoun, H. Hammarstrom and A. Ranta (2007) "Urdu Morphology, Orthography and Lexicon Extraction", In Ali Farghaly & Karine Megerdoomian (eds.), Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages,  pp: 59-68, Impact Factor: (acceptance rate: not mentioned, but frequently cited paper)

 

Comments