Data Science Primer: for students doing their first data science project, with helpful Linux commands and advise.


Controversy Lexicon

Yelena Mejova, Amy X. Zhang, Nicholas Diakopoulos, Carlos Castillo. "Controversy and Sentiment in Online News". Computation+Journalism Symposium (CJ), 2014.

Using a multi-stage crowdsourced effort, we have created a lexicon of terms associated with controversial topics (primarily in the US press). We also distinguish between controversial, weakly controversial, and also provide some non-controversial terms. 

Enriched American Food Lexicon

Sofiane Abbar, Yelena Mejova, Ingmar Weber. "You Tweet What You Eat: Studying Food Consumption Through Twitter". Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2015.

The foods were extracted from a large sample of food-related tweets of 210K users in the late 2013. We began by collecting 50M tweets through the Twitter Streaming API using a hand-picked keyword filter over a span of 2013/10/29 - 2013/11/29. Then we selected all geo-tagged tweets and randomly selected 210K users from US for whom we collect up to 3.2K historical tweets. We then use this dataset to bootstrap a new lexicon by crowdsourcing a labeling of these new tweets to build a Naive Bayes classifier. Finally, we select 500 most popular terms in the tweets the classifier deems to be on food-related topic and manually clean and annotate it with the below information. Ambiguous terms, including seeds, beverage, brewed, as well as food characteristics like powdered, salted, and mashed, were removed. 

Foursquare Category Hierarchy

Are you working with Foursquare data, but having difficulty with the formatting of their category documentation? The below files contain the hierarchy described on on March 10, 2015 in easy to compute and interact files.

Instagram restaurant tag labels

Yelena Mejova, Hamed Haddadi, Anastasios Noulas, Ingmar Weber. "#FoodPorn: Obesity Patterns in Culinary Interactions". The 5th International Conference on Digital Health, 2015.

These are the top 2000 (minus non-latin alphabet tags) collected from the Instagram images taken at restaurants across the United States during September, October, and November 2014. The tags have been labeled using Crowdflower, taking a majority label out of 3 annotations. The agreement on these tasks was very high, at 92-99% label overlap.

Names of Qatari Tribes and Families in Arabic and English

The tiny country of Qatar has a fascinating and rich history, some of which can be found in the names of the most prominent families in the country. In the list you can find the family names in Arabic and several versions in English.

Halal on Instagram: sub-topical lexicons

Yelena Mejova, Youcef Benkhedda, Khairani. #Halal Culture on Instagram. Frontiers in Digital Humanities: Big Data, 2017.

These lexicons are for Arabic, English, and Indonesian languages, and were extracted from a large collection of Instagram posts mentioning #halal (in English/Indonesian and Arabic). The topics span food, religion, animal trade, health, and supplements.


Language of Politics on Twitter

AI Summer School, American University in Beirut, June 16, 2015
At International Conference on Web and Social Media 2017