Resources

ONLINE SERVICES & DEMOS

Universal Evaluation Service for NLP & IR systems. Users can upload their system outputs and receive a LaTeX report and csv files with evaluation results according to all suitable metrics, including a direct comparison with the best system outputs stored in EvALL. Currently covers ranking (plus ranking with diversity), classification and clustering tasks.

DATASETS

HESML (Half-Edge Semantic Masures Library). An efficient and scalable JAVA semantic measures library for the general and biomedical domain.

MC4WEPS (Multilingual Corpus for WEb People Search) corpus provides a real scenario to train and evaluate systems to disambiguate web people searches. The two main features of this corpus are: it includes multilingual results, and it keeps the social networking profiles.

We would like to keep track of who has downloaded the corpus. Please contact us in order to download it.

Two sets of annotations for evaluating the task of entity profiling in Microblog Posts. The first dataset is created using a pooling methodology, for which various methods for automatically extracting aspects from tweets that are relevant for an entity have been implemented. Human assessors have labeled each of the candidates as being relevant or not. The second dataset contains opinion targets for which annotators consider individual tweets related to an entity and manually identify whether the tweet is opinionated or not. If so, they annotate which part of the tweet is subjective and what the target of the sentiment is.

1,000 reviews extracted from booking.com.

5,496 words and 2,190 synsets from WordNet 2.1 labeled with an emotional category.

Test-suite for Information Synthesis studies, made up of 72 manually-generated reports (topic-oriented summaries of large sets of relevant documents).

User logs capturing all the information relevant to user interaction with the search interface during the iCLEF 2008-2009 campaigns.

A corpus testbed for people searching algorithms.

CODE

The heterogeneity property of text evaluation measures states that the probability of a real (i.e. human assessed) similarity increase is directly related to the heterogeneity of the set of automatic similarity measures that corroborate such increase. This script implements a method for similarity measures that is based on the heterogeneity principle. The method is completely unsupervised (it does not use any kind of human assessments on the quality of the measures to be combined) and leads to top performing combined similarity measures in multiple tasks like Document Clustering, Textual Entailment, Semantic Text Similarity, and automatic MT and Summarization.

Many Artificial Intelligence tasks cannot be evaluated with a single quality criterion and some sort of weighted combination is needed to provide system rankings. A problem of weighted combination measures is that slight changes in the relative weights may produce substantial changes in the system rankings. This software implements the Unanimous Improvement Ratio (UIR), a measure that complements standard metric combination criteria (such as van Rijsbergen's F-measure) and indicates how robust the measured differences are to changes in the relative weights of the individual metrics. UIR is meant to elucidate whether a perceived difference between two systems is

Some key Information Access tasks -- Document Retrieval, Clustering, Filtering, etc. -- can be seen as instances of a generic "document organization" problem that establishes priority and relatedness relationships between documents. In this paper we propose two complementary evaluation measures -- Reliability and Sensitivity -- for the generic Document Organization task which are derived from a proposed set of formal constraints (properties that any suitable measure must satisfy).

For each of the tasks subsumed under the document organization problem, Reliability and Sensitivity satisfy more formal constraints than previously existing evaluation metrics. Their most characteristic feature, in addition, is their strictness: in order to reach high Reliability and Sensitivity values, a system must also achieve high values with all standard evaluation measures.

SentiSense Tagger and SentiSense Visualizer are included in the SentiSense Tools package.