Shared Evaluation

One of the most important challenges that we see in distributional semantics is fragmentation with regard to data sets, methods and evaluation metrics, which makes it difficult to compare studies and achieve scientific progress.

We want to improve the comparability of geometrical models by on shared evaluation. To this end, GEMS 2011 will provide two datasets suitable for the evaluation of distributional models. These datasets cover two of the most compelling challenges in distributional semantics today: differentiation between semantic relations and compositionality. The first set will include concrete nouns belonging to different semantic classes (living, non-living, etc.) with associated sets of other words for specific semantic relations such as "attribute", "category coordinate", "event", or "metonym". The second dataset will contain phrase similarity judgments, in order to address the evaluation of distributional models in "compositional tasks", i.e. beyond single words.

Authors of papers submitted to GEMS are strongly encouraged to test and evaluate their models on the data below, or, if this is impossible, to discuss why their models are not applicable. The goal is to allow researchers to directly compare the output of their models, and to gain a better insight into a quickly growing field. We suggest that participants use (concatenated) PukWaC and WaCkypedia_EN as source corpora (these corpora can be obtained at, and that they test their models on the following test sets:

1. BLESS data (Baroni-Lenci Evaluation of Semantic Similarity)

The first data set includes 200 concrete nouns (100 animate and 100 inanimate nouns) from different classes (e.g., tools, clothing, vehicles, animals, etc.). Each target noun is associated with a set of other words (nouns, verbs or adjectives) via the following semantic relations:

- hyperonymy

e.g., yacht hyper {boat, craft, vehicle, vessel, watercraft}

- cohyponymy (aka coordinate terms)

e.g., yacht cohypo {sailboat, ferry, steamboat, motorboat}

- meronymy

e.g., yacht mero {sail, sailor, engine, motor, cabin, fin, keel}

- typical attribute

e.g., yacht attri {expensive, fancy, large, big, luxurious, wooden}

- typical related event

e.g., yacht event {cruise, sail, transport, race}

- random

e.g., yacht random {justice, flower, apple, walk}

The full dataset can be downloaded from the bottom of this page.

We recommend that participants who test their models on this data set report results according to the following procedure. First, pick, for each target, the nearest neighbour from each semantic relation type (the nearest hyperonym, part, etc.). Then, report the distribution, across concepts, of the cosines of the selected elements, grouped by semantic relation. A convenient way to report such distribution is by means of a boxplot. Model testing on all the data set relations and cross-relation evaluations are strongly encouraged, but participants can also choose to address one relation only.

Please see the README file distributed with the dataset for further details.

2. Compositional semantics (Mitchell & Lapata)

The second data set is taken from Mitchell and Lapata's (2008,2010) research on compositional semantics. In a psycholinguistic experiment, Mitchell and Lapata asked participants to rate the similarity between 2 adjective-noun combinations, verb-object combinations or compound nouns. For example, the following excerpt of the data shows that participant 2 gave a similarity score of 7 (out of 7) to the adjective-noun combinations "vast amount" and "large quantity", but a 1 to "hot weather" and "elderly lady".

> participant2 adjectivenouns 0 vast amount large quantity 7

> participant2 adjectivenouns 0 hot weather elderly lady 1

These similarity scores for each combination were then used to evaluate a variety of models of compositional semantics. They can be obtained here:

We recommend that participants, like Mitchell and Lapata (2010), use their models to compute a similarity score for all of the adjective-noun combinations, verb-object combinations and compound nouns, and report the Spearman correlation between these figures and all of Mitchell and Lapata's participants' scores. For example, when the model predicts a value of 1.5 for one data point, and three of Mitchell and Lapata's participants assigned the scores 1, 2 and 2, respectively, three pairs will enter the final correlation analysis: (1.5,1), (1.5,2) and (1.5,2).

Erratum: earlier we asked participants in this task to compute the correlation between their model's predictions and the mean of Mitchell and Lapata's experimental results. In order to account better for variation across subjects, we have replaced this evaluation metric by the one above. Please adjust your scores accordingly.