- Understanding Temporal Locations and Relations (U of Rochester, NY)
- Semantic Search Engine for Searching Help Files (U of Rochester, NY)
- In-car Dialog System (Bosch RTC, CA; Summer 2008)
- Breast Cancer Detection (Bosch RTC, CA; Summer 2008)
- Bangla Phonetic Spelling Checker (CRBLP, BRAC U, Bangladesh)
- Bangla Phonetic Name Searching (CRBLP, BRAC U, Bangladesh)
- English-to-Bangla Phonetic Transliteration (CRBLP, BRAC U, Bangladesh)
- Text Input System with Phonetic Support for Mobile Devices(BRAC U, Bangladesh)
- Automate Bangla Pronunciation Generator (CRBLP, BRAC U, Bangladesh)
- Automated Part-Of-Speech (POS) Tagging for South Asian Languages (CRBLP, BRAC U, Bangladesh)
- Bangla Text Categorization (CRBLP, BRAC U, Bangladesh)
- Bangla Grammar Checker (CRBLP, BRAC U, Bangladesh)
- Backward n-gram for Bangla (CRBLP, BRAC U, Bangladesh)
Back to homepage
Understanding Temporal Locations and Relations (U of Rochester, NY)
TopShort description of Understanding Temporal Locations and Relations:
Trying to understand the temporal locations of events (time-event relations) and temporal relations between events (event-event relations) in the text for better natural language understanding.
Semantic Search Engine for Searching Help Files (U of Rochester, NY)
TopShort description of Semantic Search Engine for Searching Help Files:
Implemented semantic search engine for searching software help.files and experimented the system on different software, e.g. Quicken (banking software), Mac Office suite (Pages, Keynote, Numbers), Preview (Mac pdf viewer) and iTunes. We found that around 80% times our system gives the expected result in top 5 results, whereas the software help .file search engine can result the expected result only around 30% times.
In-car Dialog System (Bosch RTC, CA; Summer 2008)
TopShort description of In-car Dialog System:
During my stay at Bosch (Summer 2008), I developed syntactic and semantic grammar for a commercial in-car dialog system. The domain of the car dialog system included the navigation and local businesses.
Breast Cancer Detection (Bosch RTC, CA; Summer 2008)
TopShort description of Breast Cancer Detection:
During my stay at Bosch (Summer 2008), I worked on a large scale Data Mining project of Breast Cancer Detection. Experimented with different machine learning techniques, e.g. Support Vector Machine (SVM), Decision Tree, KNN, Bayes Net, Neural Network, etc. Finally built the system using SVM and also implemented the Feature Selection technique to improve the performance and reduce the computation time.
Bangla Phonetic Spelling Checker (CRBLP, BRAC U, Bangladesh)
TopShort description of Bangla Phonetic Spelling Checker:
The complex orthographic rules of Bangla present a significant challenge in producing suggestions for a misspelled word when employing the traditional methods; one must take phonetic similarity into account for suggested alternatives to be reasonably accurate. In Bangla there are several algorithms available for spelling checker, however, none of these considers the complex orthographic rules of Bangla.
In this research project, considering the complexities of Bangla orthographic rules, I have implemented a spelling checker that detects the misspelled words, generate suggestions that are phonetically similar to the misspelled words, and finally, rank the suggestions of the misspelled word. In this project, I also compared my spelling checker with the existing spelling checkers available for Bangla and also showed performance and evaluation of my spelling checker. Besides phonetic error, this spelling checker also provides suggestions for typographical errors. Input of the document can be Bangla and can be even English. In case of English text, it gives Bangla words as suggestions that are phonetically similar to the English words.
Related Publication of Bangla Phonetic Spelling Checker:
1. Naushad UzZaman and Mumit Khan, A Bangla Phonetic Encoding for Better Spelling Suggestions, Proc. 7th International Conference on Computer and Information Technology (ICCIT 2004), Dhaka, Bangladesh, December 2004.
2. Naushad UzZaman and Mumit Khan, A Double Metaphone Encoding for Bangla and its Application in Spelling Checker, Proc. 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 705-710, Wuhan, China, October 30 - November 1, 2005.
3. Naushad UzZaman and Mumit Khan, A Comprehensive Bangla Spelling Checker, Proc. International Conference on Computer Processing on Bangla (ICCPB-2006), Dhaka, Bangladesh, 17 February, 2006.
4. Naushad UzZaman, Phonetic Encoding for Bangla and its Application to Spelling Checker, Name Searching, Transliteration and Cross Language Information Retrieval, Undergraduate Thesis (Computer Science), BRAC University, May 2005.
Related software of Bangla Spelling Checker:
This spelling checker package is implemented, released and distributed under the GNU General Public License in:
i) Puspa Speller, spelling checker engine package [description of puspa speller]. Available online for download at <http://sourceforge.net/project/showfiles.php?group_id=158301&package_id=180247> or <http://sourceforge.net/projects/puspaspeller>
ii) BanglaPad, open source, full-featured cross-platform Unicode rich text editor capable of editing Bangla [description of BanglaPad], uses this spelling engine for spelling suggestions. Available online for download at <http://sourceforge.net/project/showfiles.php?group_id=158301&package_id=180246>
Bangla Phonetic Name Searching (CRBLP, BRAC U, Bangladesh)
TopShort description of Bangla Name Searching:
Almost any word can be a Bangali name, and the name in turn is often spelled in many different ways, all of which are considered correct and interchangeable. The reason for the spelling complication is two-fold: (1) there is a large gap between the script and pronunciation in Bangla, largely attributed to the large scale Sanskritization process that started in the 12th century and continued throughout the middle ages, and (2) typical Bangla names have very different origins, from the indigenous names derived primarily from Sanskrit, to the imported Muslim names from Persian and Arabic, Christian names from Portuguese, and even the names from popular Western TV soap-operas. However, there is always a large degree of phonetic similarity in the spelling variants of a name, which is the key to searching and matching names in records.
In this project, I implemented a name searching algorithm, taking into account the various spelling and phonetic rules in use, which can be used by applications to search for and match names. A name-searching algorithm may employ various figures of merit to narrow the list of possibilities when searching for similar names; I demonstrated one such figure of merit using name encoding and edit distance that has shown good promise.
Related Publication of Bangla Name Searching:
1. Naushad UzZaman and Mumit Khan, A Double Metaphone Encoding for Approximate Name Searching and Matching in Bangla, Proc. The Fourth IASTED International Conference on Computational Intelligence, pp. 108-113, Calgary, Alberta, Canada, July 2005.
2. Naushad UzZaman, Phonetic Encoding for Bangla and its Application to Spelling Checker, Name Searching, Transliteration and Cross Language Information Retrieval, Undergraduate Thesis (Computer Science), BRAC University, May 2005.
English-to-Bangla Phonetic Transliteration
TopShort description of English-to-Bangla Transliteration:
A transliteration scheme from Roman (English) to Bangla can help increase the use of Bangla in essential and diverse computing areas such as word processing, and many web applications, e.g. online groups, forums, blogs, etc. The Bangla script's irregular phonetic nature and its large repertoire of consonant clusters (juktakkhors) create a large gap between the pronunciation and the orthography for a given Bangla word. In this project, I have implemented a comprehensive Roman (English)-to-Bangla transliteration scheme that is designed to handle the full complexity of the Bangla script.
Related Publication of English-to-Bangla Transliteration:
1. Naushad UzZaman, Arnab Zaheen and Mumit Khan, A Comprehensive Roman (English) to Bangla Transliteration Scheme, Proc. International Conference on Computer Processing on Bangla (ICCPB-2006), Dhaka, Bangladesh, 17 February, 2006.
2. Naushad UzZaman, Phonetic Encoding for Bangla and its Application to Spelling Checker, Name Searching, Transliteration and Cross Language Information Retrieval, Undergraduate Thesis (Computer Science), BRAC University, May 2005.
Related software of English-to-Bangla Transliteration:
This English-to-Bangla transliteration package is implemented, released and distributed under the GNU General Public License in:
iii) Pata, English-to-Bangla transliteration package [description of pata]. Available online for download at <http://sourceforge.net/projects/pata>
Text Input System with Phonetic Support for Mobile Devices (BRAC U, Bangladesh)
TopShort description of Text Input System for mobile devices:
The popular T9 text input system for mobile devices uses a predictive dictionary-based disambiguation scheme, enabling a user to type in commonly used words with low overhead. We present a new text input system called T12, which in addition to providing T9’s capabilities, also allows a user to cycle through the possible choices based on phonetic similarity, and to elaborate commonly used abbreviations, acronyms and other short forms. This ability to cycle through the possible choices acts as a spelling checker, which provides suggestions from the dictionary with similar pronunciation as the input word.
Related Publication of Text Input System for mobile devices:
1. Naushad UzZaman and Mumit Khan, T12: An Advanced Text Input System With Phonetic Support For Mobile Devices, Proc. The Second IEE International Conference of Mobile Technology, Applications and Systems, Guangzhou, China, November 2005.
Automated Bangla Pronunciation Generator (CRBLP, BRACU, Bangladesh)
TopShort description of Automated Bangla Pronunciation Generator:
In this project we implemented a rule based pronunciation generator for Bangla words. It takes a word and finds the pronunciations for the graphemes of the word. A grapheme is a unit in writing that cannot be analyzed into smaller components. Resolving the pronunciation of a polyphone grapheme (i.e. a grapheme that generates more than one phoneme) is the major hurdle that the Automated Pronunciation Generator (APG) encounters. Bangla is partially phonetic in nature, thus we can define rules to handle most of the cases. Besides, until now we lack a balanced corpus, which could be used for a statistical pronunciation generator. As a result, for the time being a rule-based approach towards implementing the APG for Bangla turns out to be efficient.
This work was supervised by Dr. Mumit Khan and me, Naushad UzZaman.
Related Publication of Rulebased Automated Pronunciation Generator:
Ayesha Binte Mosaddeque, Naushad UzZaman and Mumit Khan, Rule based Automated Pronunciation Generator, Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006.
Related software:
This rule based automated pronunciation generator package is implemented, released and distributed under the GNU General Public License. A web-based implementation is available at <http://student.bu.ac.bd/~u02201011/APG/>
Automated Part-Of-Speech (POS) Tagging for South Asian Languages (CRBLP, BRAC U, Bangladesh)
TopShort description of Automated Part-Of-Speech (POS) Tagging for South Asian Languages:
Part-of-Speech (POS) tagging is a technique for assigning each word of a text with an appropriate parts of speech tag. The significance of part-of-speech (also known as POS, word classes, morphological classes, or lexical tags) for language processing is the large amount of information they give about a word and its neighbor. POS tagging can be used in TTS (Text to Speech), information retrieval, shallow parsing, information extraction, linguistic research for corpora and also as an intermediate step for higher level NLP tasks such as parsing, semantics, translation, and many more. POS tagging, thus, is a necessary application for advanced NLP applications in Bangla or any other languages.
In this project, we experimented with some of the widely used approaches for POS tagging on Bangla and two other South Asian languages (Hindi and Telegu) using corpora of different sizes (and different sized tagset for Bangla) to observe their performances and found the transformation based Brill tagger’s performance to be superior to other approaches, though the use of this approach has been very limited until recently.
Related Publication of Part-Of-Speech (POS) Tagging for South Asian Languages:
Fahim Muhammad Hasan, Naushad UzZaman and Mumit Khan, Comparison of different POS Tagging Techniques (n-gram, HMM and Brill’s tagger) for Bangla, Proc. of International Conference on Systems, Computing Sciences and Software Engineering (SCS2 06) of International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering (CIS2E 06), December 4 - 14, 2006.
Implementation of Part-Of-Speech (POS) Tagging for South Asian Languages:
For this project we used the library of NLTK (Natural Language Toolkit) for POS tagging algorithms and used one of the leading daily newspapers, The Daily Prothom-Alo as our Bangla corpus. Another Hindi, Telegu and Bengali corpus was collected from SPSAL 2007 workshop.
Bangla Text Categorization (CRBLP, BRAC U, Bangladesh)
TopShort description of Bangla Text Categorization:
The goal of any classification is to build a set of models that can correctly predict the class of different objects. Text categorization is one such application and can be used in many classification task, e.g. news categorization, language identification, authorship attribution, text genre categorization, recommendation systems etc. In this project we analyzed the performance of n-gram based text categorization for Bangla in a Bangladeshi newspaper, Prothom-Alo corpus. Our results show that n-grams of length 2 or 3 are the most useful for categorization. Using gram lengths more than 3 reduces the performance of categorization.
Related Publication of Bangla Text Categorization:
Munirul Mansur, Naushad UzZaman and Mumit Khan, Analysis of N-gram based text categorization for Bangla in a newspaper corpus, Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006.
Bangla Grammar Checker (CRBLP, BRAC U, Bangladesh)
TopShort description of Bangla Grammar Checker:
In this project, we implemented a statistical grammar checker, which considers the n-gram based analysis of words and POS tags to decide whether the sentence is grammatically correct or not. We employed this technique for both Bangla and English and also described limitation in our approach with possible solutions.
Related Publication of Bangla Grammar Checker:
Md. Jahangir Alam, Naushad UzZaman and Mumit Khan, N-gram based Statistical Grammar Checker for Bangla and English, Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006.
Backward n-gram for Bangla (CRBLP, BRAC U, Bangladesh)
TopShort description of Backward n-gram for Bangla:
This work presents a directional advantage of n-gram modeling in terms of backward or forward n-gram modeling in Bangla. The most commonly used n-gram analysis is predominantly a forward n-gram. However in Bangla it appears that a backward n-gram is repeatedly more successful and yields more grammatical results than a forward n-gram. This work hypothesizes that the rationale behind this success is the syntactic ordering of constituents in Bangla. Bangla is a head-final specifier-initial language as opposed to English, which is head-initial specifier-initial. Hence in Bangla, the head comes after its argument in a phrase. If an n-gram analysis begins with a head and moves backwards it will stretch to its own argument but if you move forwards then you'll probably grab the argument of another head. As probability of occurrence of heads is higher, probability of depending on a head is also higher and hence a backward n-gram will probably have a greater chance of yielding grammatical results. We carried out several experiments to compare different directional results in different applications with an advantage in the backward direction. This will prove a useful linguistic insight in terms of n-gram based analysis depending upon variations of constituent analysis.
Related Publication of Backward n-gram for Bangla:
Naira Khan, Md. Tarek Habib, Md. Jahangir Alam, Rajib Rahman, Naushad UzZaman and Mumit Khan, History (forward n-gram) or Future (backward n-gram)? Which model to consider for n-gram analysis in Bangla?, Proc. of 9th International Conference on Computer and Information Technology (ICCIT 2006), Dhaka, Bangladesh, December 2006.