Research projects and fun coursework

Word Prediction (thesis work)

People who are unable to speak use devices that speak for them. They type words and the device speaks those words. But it's much slower than normal speech. The goal of word prediction is to help people type faster for these devices, especially people affected by motor impairments. If you're used to typing on a phone, you've probably used this before.

In this work, I'm looking to improve the guesses of the word prediction software so that the right word shows up sooner and at the top of the list. Currently, the most common method is the ngram model, which generates predictions that are appropriate for the previous few words. In general, these predictions are grammatically appropriate, but I'm making it match what you're talking about and your personal style.

Vehicle to Grid (V2G)

There are many problems with our power infrastructure - both the reliance on fossil fuels (e.g., coal) and foreign oil cause problems. Renewable resources such as wind and solar power address both problems but they only generate power intermittently. However, when you have a way to save energy in batteries and use it later, renewable resources may be able to (partially) replace fossil fuels.

Also, in the existing infrastructure there is economic incentive to connect batteries to the power grid - you can "sell" the ability of your batteries to stabilize the power grid. There are several reasons why power companies need this - generators take a while to respond to consumer demand, leaving a gap that needs to be temporarily filled with a quicker response. Similarly, generators sometimes fail and a backup must kick in quickly to prevent things like blackouts.

The goal of this project is to develop a way to meet these needs in a vision of the future world where pure electric vehicles are common. While a pure electric vehicle is plugged in overnight or at work, not only could the battery be recharged but the battery can be used to sell grid stability. In the short term, it means electric vehicle owners would make money by leaving their cars plugged in more, enough to cover the costs of charging and more.

I designed and developed the original coalition software, which treats a collection of vehicles as a distributed battery. It involves quite a lot of work to use the battery when possible without causing the car's owner any trouble. My software runs both on a server that interacts with the power company, as well as on the electric vehicles which tells them to charge or discharge.

Natural Language Generation in Summarization

Summarization has become an important application for language processing. Most research has focused on extracting the most important information from the original document and presenting it within some length restriction. However, the coherence of the resulting text is usually ignored in current research. I looked at applying Natural Language Generation (NLG) techniques to make the summary easier to read. I did a little work towards this goal in a class. The three problems I've identified are discourse marker generation, sentence generation, and referring expression generation.

I implemented discourse marker generation using an RST-based summarization algorithm and the RST Discourse Treebank along with statistical methods to generating the discourse marker. Unfortunately, with a reasonably high probability threshold, I only generated a few discourse markers, at least one of which was not applicable because both text spans involved were quotations.

Parse Tree Application

Working with a system that requires a parser can be difficult. It's even more difficult to debug a grammar. To allow research involving parsers to accelerate, I've developed a tool for visualizing and comparing parse trees. This program, dubbed Parse Tree Application, will allow a user to enter a parse tree by hand or from a file and the program draws the tree. Two parses of the same sentence/string ma be compared visually using PTA. Click on the link above to go to this application's website.

PTA has many useful features, such as the ability to output rendered trees in EPS or SVG, rendering of feature structures associated with constituents, and drawing of some partial parses (e.g., (S (NP I) (VP went (PP to the store)))).

Predicting the gender of first names

I had the idea that first names have associated morphology for gender. To evaluate this hypothesis, I collected lists of first names, separated by gender, for a few different countries (America, Ireland, India, Greece, France). I found that I could predict gender with about 80% accuracy for each using a simple rule learning algorithm, even simpler than sequential coverage.

Even though I allowed the system to learn multiple letters, the rules that were learned used only the last letter of the name.

Email Classification using Naive Bayes

As a teaching assistant, I get a lot of emails from students. When I have multiple different courses at the same time, this is also a problem. Although I did this project to fulfill the requirements of CISC889: Machine Learning in the Spring of 2004, I came up with the idea before then. I designed an implementation of Naive Bayes to predict the folder that each new email should go in. While I didn't exhaust all of my ideas on the matter, I found that I could complete the artificial task I arranged with 80-90% accuracy. Subsequently, I derived a measure of confidence in the predictions to only move emails to a folder if the confidence was high enough. This method achieved roughly 95% precision (correct folder) and 90-95% recall (percent of emails classified). The primary features used in classifying email were the sender's email address, keywords in the subject, and keywords in the body of the email. When I have the opportunity, I'll place the presentation I gave on this here. I have decided, however, not to continue work on this project, because an existing tool, POPFile, has implemented Naive Bayes email sorting.

Summarizing Information Graphics

I worked for about a year with Sandee Carberry and Stephanie Elzer in understanding information graphics. The idea is that graphs, such as bar charts, are designed intentionally to be understood a certain way. This intentional graphic design is a form of communication, and language processing methods can be applied to the design. For instance, highlighting a particular bar or providing a value on top of a particular bar might allow the reader to deduce certain conclusions more easily.

My part in this work was twofold: to build a collection of information graphics and to investigate the utility of captions. In particular, I looked at captions that served as titles for the graphic.

Lexical Chains

I gave a presentation of Silber and McCoy's work in summarization with lexical chains in the Fall of 2003 for CISC882: NLP. Here is the presentation I gave with minor corrections:

The Word Scramble Problem

Latent Semantic Indexing (LSI) and Latent Semantic Analysis (LSA)

I gave a presentation of LSI and LSA in the Spring of 2003 for CISC889: Statistical NLP. Here it is with minor corrections: