Current Projects

Ranking people, places, and things according to their fame, quality, or significance is an important task, serving to direct greater attention to prominent entities at the expense of lesser ones. Top 10 (or 100) lists satisfy people's need for order, and their curiosity about other people's opinions. Rank orderings are by nature time-dependent, subjective, and culturally biased. Still, we study the problem of ranking entities (primarily people) by "significance" through algorithmic methods.

Distributed word representations (word embeddings) have recently contributed to competitive performance in language modeling and several NLP tasks. In this work, we train word embeddings for more than 100 languages using their corresponding Wikipedias. We quantitatively demonstrate the utility of our word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages. We find their performance to be competitive with near state-of-art methods in English, Danish and Swedish. Moreover, we investigate the semantic features captured by these embeddings through the proximity of word groupings. We will release these embeddings publicly to help researchers in the development and enhancement of multilingual applications.

The increasing diversity of languages used on the web introduces a new level of complexity to Information Retrieval (IR) systems. We can no longer assume that textual content is written in one language or even the same language family. In this paper, we demonstrate how to build massive multilingual annotators with minimal human expertise and intervention. We describe a system that builds Named Entity Recognition (NER) annotators for 40 major languages using Wikipedia and Freebase. Our approach does not require NER human annotated datasets or language specific resources like treebanks, parallel corpora, and orthographic rules. The novelty of approach lies therein - using only language agnostic techniques, while achieving competitive performance. Our method learns distributed word representations (word embeddings) which encode semantic and syntactic features of words in each language. Then, we automatically generate datasets from Wikipedia link structure and Freebase attributes. Finally, we apply two preprocessing stages (oversampling and exact surface form matching) which do not require any linguistic expertise.

Who are similar with Barack Obama, Michael Jordan or Wolfgang Mozart? Peoplesimilarity is a project focus on identifying analogous historical figures. We represent the figures by their Wikipedia pages, analyzing their features using distributed word embedding, word clusters and topic analysis. Similarity is then calculated by the distance or divergence of the vectors representing people. The results are reasonable and interesting. Our demo uses the pre-processed results to find the nearest people for each person, and list them sorted by the similarity percentage.

Web application used to visualize high dimentional words' embeddings.

Generated Sentiment Lexicons for multilingual sentiment analysis.

Generated Sentiment Lexicons for multilingual sentiment analysis.

Previous Projects
The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. We have built an ethnicity classifier where all training data is extracted from public, non-confidential sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.

Online content analysis employs algorithmic methods to identify entities in unstructured text. Both machine learning and knowledge-base approaches lie at the foundation of contemporary named entities extraction systems. However, the progress in deploying these approaches on web-scale has been been hampered by the computational cost of NLP over massive text corpora. We present SpeedRead (SR), a named entity recognition pipeline that runs at least 10 times faster than Stanford NLP pipeline. This pipeline consists of a high performance Penn Treebank- compliant tokenizer, close to state-of-art part-of-speech (POS) tagger and knowledge-based named entity recognizer.

TextMap tracks references to people, places, and things appearing in news text, so as to identify meaningful relationships between them. TextMap monitors the state of the world by analyzing both the temporal and spatial distribution of these entities. We currently analyze over 1000 domestic and international news sources every day. TextMap uses natural language processing techniques to identify entity references and a variety of statistical techniques to analyze the juxtapositions between them.

TextMed is a search engine for medical entities: diseases, drugs, chemicals, organs and organisms.TextMed aims to identify relationships between these medical entities.TextMed uses natural language processing techniques to track medical entity references from the scientific literature, and a variety of statistical techniques to analyze the relationships between them. TextMed presents our analysis of roughly 15 million Medline/PubMed medical abstracts, including the latest abstracts analyzed as they arrive each day.