Recent Projects

Dissertation and Related Work

I'm currently working in measuring the impact of statistical word aligments in Phrase-Based SMT, more specifically, measuring the effect that different alignment characteristics (structure, quality) have on the PBSMT pipeline ( phrase-table composition, translation models, translation quality, etc.)
My objective is to provide a multivariate statistical model that will be able to predict how changes in word alignment composition will influence changes in translation performance. Thus, provide guidelines that help in fields such as discriminative word alignment research.  My intention is to be done by the end of the summer.

What's brewing?
Regression, path analysis, factor analysis, descriptive and confirmatory statistics, etc.

What else have I've dealt with?
Data Mining techniques (classifiers, clustering, clustering). Design of Experiments, ANOVA analysis. I've also had some time to explore Multinomial Bayesian Classifiers and implemented my own version of CNMNB for document classification.

Speaker Verification

Since joined the Speaker Verification TEChila team @ ITESM.  My contributions are related with helping to deploy advanced Machine Learning techniques to the current system.  I'm the Advanced Statistics guy. This has given me the opportunity to learn more about Speech processing, Mel Frequency Cepstral Coefficients, Gaussian Mixture Models, enrollment techniques (MAP, JFA).  As well, I could reinforce and put into practice some concepts such as: the EM algorithm for Maximum Likelihood Estimation of Factor Analysis, Likelihood Ratio tests, etc. In addition, it has allowed me to coach one Master's student for his Thesis.

Student Coaching

Extra officially, I've been coaching several of my labmates with their research projects, Master's thesis writing,  and specially their thesis defense.  This has been a very enjoyable experience, because it has allowed me share some of my older-PhD-guy experience with other students.


Gale Project (2008-2009)

During my research stay at Carnegie Mellon University, I was part of the CMU/Interact team for the Gale Project under the lead of Stephan Vogel. There I was in charge of the Discriminative Alignment (DWA) training. I had the chance to work with two languages new to me: Chinese and Arabic. I learned about segmentation, POS Tagging, labeling and many other preprocessing techniques which are very useful (specially with Chinese). I also grew fond of word alignment analysis, which became my dissertation topic.

Nahuatl Translation (Summer 2008)

During the summer of 2008, I was the lead for a Nahuatl-Spanish translation project. I lead a team of three undergraduate students which visiting ITESM for an "innovation summer camp". For this project, we had to gather resources, build parallel corpora from scratch and train language and translation models. Given the data scarcity problems and the nature of Nahuatl language itself, the results were modest. Still, we managed to set up a translation web-service based on AJAX.

Trilingual Translation (2006-2008)

My initial dissertation proposal was inspired on the fact that there is a considerable amount of information shared between languages. As polyglot, I've experienced this myself. For instance, I consider that learning French after learning English was easier to me given that French and English share some constructs that are unfamiliar in Spanish (at least in Mexican Spanish). Learning Italian after learning French was way easier . Take for instance the phrase Fui al mercado in Spanish vs. Je suis allé au marché in French and the phrase Sono andato al mercato in Italian. Both Spanish and Italian have the subject (yo, io) implicit in the verb. On the other hand, Italian and French share the passé composé  structure where certain verbs use être/essere as their auxiliar, and other use avoir/avere. In most cases, they are the same kind of verbs. Thus knowing French and Spanish, it's easier to guess what sono andato al mercato  could mean.

I wanted to use this in cases where we had limited amount of training data between two languages, say English and Nahuatl, but we had enough Nahuatl-Spanish and Spanish-English data. In that situation, we could use the third language (Spanish) as a source of additional information. This could help to generate para-translations, which can be helpful in cases we don't know how to translate a certain construct.  This idea was inspired on Chris Callison-Burch's Paraphrases. 

Way back...

In 2004 I did my final career project (projet de fin d'études) doing research in multiagent based simulation for urban traffic. I implemented a framework for these types of scenarios based upon  the now recently revived Madkit.