draw(s) lost control error(s)::  politic(s) / violence / chance(s


look(s):: strategy / link(s) / workshop(s)


  • ...

 create / learn / observe / play / topical media 


reading club, led by Davide, is on this paper:
Dance in the World of Data and Objects

Del Vicario, Michela, et al. "The spreading of misinformation online."
Proceedings of the National Academy of Sciences 113.3 (2016): 554-559.

... Cristina will lead the discussion of 
this paper:
In the subsequent reading club, Dena will lead the discussion of this 
Oana will lead the discussion of this 
Linkset Quality Assessment for the Thesaurus Framework LusTRE
by Riccardo Albertoni, Monica De Martino, Paola Podestà
Also find the pdf attached, abstract follows below.

I was interested in this paper because it presents a way of assessing the quality improvement one gets by adding links to datasets. This is relevant not only for projects where vocabulary alignment is done, but also Xander's link prediction or Oana's crowd enrichments. Such measures could be nice to report in our evaluations of datasets. 

abstract: Recently a great number of controlled vocabularies (e.g., thesauri) covering several domains and shared by different communities, have been published and interlinked using the Linked Data paradigm. Remarkable efforts have been spent from data producers to make their thesauri compliant with Linked Data requirements both for the content encoding and for the connections (aka, linkset) with others thesauri. Also in our experience in the creation of the framework of multilingual linked thesauri for the environment (LusTRE), within the EU funded project eENVplus, the development of the interlinking among thesauri, have required significant efforts, thus, the evaluation of their quality in term of usefulness and enrichment of information became a critical issue. In this paper, to support our claim, we discuss the results of the quality evaluation of several linksets created in LusTRE. To this purpose, we consider two quality measures, the average linkset reachability and the average linkset importing, able to quantify the linkset-accessible information.

- The Resource Identification Initiative: A cultural shift in publishing
opinion(s): universities & refugees
Up to now, relation extraction systems have made extensive use of features  generated by linguistic analysis modules. Errors in these features lead to errors of relation detection and classification. In this work, we depart from these traditional approaches with complicated feature engineering by introducing a convolutional neural network for relation extraction that automatically learns features from sentences and minimizes the dependence on external toolkits and resources. Our model takes advantages of multiple window sizes for filters and pre-trained word embeddings as an initializer on a non-static architecture to improve the performance. We emphasize the relation extraction problem with an unbalanced corpus. The experimental results  show that our system significantly outperforms not only the best baseline systems for relation extraction but also the state-of-the-art systems for relation classification.

lead the discussion 
on this paper:

- Let’s make peer review scientific.

It got the IJCAI 2014 best paper award, a document worth reading.

E. Gabrilovich and S. Markovitch (2009) "Wikipedia-based
                 Semantic Interpretation for Natural Language Processing", Volume

34, pages 443-498 2014 IJCAI-JAIR Best Paper Prize

PDF | PostScript | doi:10.1613/jair.2669

Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.

Click here to return to Volume 34 contents list

best paper of WebSci16, Understanding Video-Ad Consumption on YouTube by Mariana Arantes, Flavio Figueiredo, and Jussara M. Almeida.

Discussion point:
- social media data access
- (indirectly) measuring user interaction
- data bias

To this end, I think that two other papers add interesting points to this discussion:

- the extended abstract from Ricardo Baeza-Yates, Data and Algorithmic Bias in the Web
- a paper by Katrin Weller and Katharina E. Kinder-Kurlanda, A Manifesto for Data Sharing in Social Media Research

paper on Linked Data exploration tools that Liliana pointed me to.
Reading it, I thought it was interesting not only for the DIVE team but also for for example Niels (Linked Data browsers) and Valentina (Recommendations). 

Paper: Survey of linked data based exploration systems
Authors: Nicolas Marie and Fabien Gandon

- Bradi Heaberlin and Simon DeDeo. The Evolution of Wikipedia’s Norm Network.

Self-organizing semantic maps

Self-organized formation of topographic maps for abstract data, such as
words, is demonstrated in this work. The semantic relationships in the
data are reflected by their relative distances in the map. Two different
simulations, both based on a neural network model that implements the
algorithm of the self-organizing feature maps, are given. For both, an
essential, new ingredient is the inclusion of the contexts, in which
each symbol appears, into the input data. This enables the network to
detect the “logical similarity” between words from the statistics of
their contexts. In the first demonstration, the context simply consists
of a set of attribute values that occur in conjunction with the words.
In the second demonstration, the context is defined by the sequences in
which the words occur, without consideration of any associated
attributes. Simple verbal statements consisting of nouns, verbs, and
adverbs have been analyzed in this way. Such phrases or clauses involve
some of the abstractions that appear in thinking, namely, the most
common categories, into which the words are then automatically grouped
in both of our simulations. We also argue that a similar process may be
at work in the brain.

While rather old, this paper does the best job explaining the concepts
of a semantic SOM.

Time to put your newly-learned knowledge of neural networks to the test!

Our work is generally focused on recommending for small or medium-sized e-commerce portals, where we are facing scarcity of explicit feedback, low user loyalty, short visit times or low number of visited objects. In this paper, we present a novel approach to use specific user behavior as implicit feedback, forming binary relations between objects. Our hypothesis is that if user select some object from the list of displayed objects, it is an expression of his/her binary preference between selected and other shown objects. These relations are expanded based on content-based similarity of objects forming partial ordering of objects. Using these relations, it is possible to alter any list of recommended objects or create one from scratch.
We have conducted several off-line experiments with real user data from a Czech e-commerce site with keyword based VSM and SimCat recommenders. Experiments confirmed competitiveness of our method, however on-line A/B testing should be conducted in the future work.
Active Learning from Crowds

Obtaining labels can be expensive or time- consuming, but unlabeled data is often abundant and easier to obtain. Most learning tasks can be made more efficient, in terms of labeling cost, by intelligently choosing specific unlabeled in- stances to be labeled by an oracle. The general problem of optimally choosing these instances is known as active learning. As it is usually set in the context of supervised learning, active learn- ing relies on a single oracle playing the role of a teacher. We focus on the multiple annotator scenario where an oracle, who knows the ground truth, no longer exists; instead, multiple labelers, with varying expertise, are available for query- ing. This paradigm posits new challenges to the active learning scenario. We can now ask which data sample should be labeled next and which annotator should be queried to benefit our learn- ing model the most. In this paper, we employ a probabilistic model for learning from multiple annotators that can also learn the annotator ex- pertise even when their expertise may not be con- sistently accurate across the task domain. We then focus on providing a criterion and formu- lation that allows us to select both a sample and the annotator/s to query the labels from.

This is a written manuscript from 1984 by Edsger W. Dijkstra which discusses what the nature is of computer science and what it should be in the future. He makes interesting claims, for instance that there will be no distinction between applied and pure computing science, and it is to become a formal branch of mathematics. Furthermore he raises issues about the academic reward system and unnecessary complexities in academic work. Using this manuscript I want to discuss the raised issues and our view of computer science in the future.

we will discuss the paper "The ProteoRed 
MIAPE web toolkit: A framework to connect and share proteomics 
standards" proposed by Dena:

I am interested in how the ideas gathered in this paper relates to
visualisation and data analysis ideas in our own projects, mostly
DIVE, DSS, BiographyNet, but I'm sure it has some relation to *your*

Paper: Linked Open Data Visualization Revisited: A Survey

Abstract: Mass adoption of the Semantic Web’s vision will not become a
reality unless the benefits provided by data published under the
Linked Open Data principles are understood by the majority of users.
As technical and implementation details are far from being interesting
for lay users, the ability of machines and algorithms to understand
what the data is about should provide smarter summarisations of the
available data. Visualization of Linked Open Data proposes itself as a
perfect strategy to ease the access to information by all users, in
order to save time learning what the dataset is about and without
requiring knowledge on semantics. This article collects previous
studies from the Information Visualization and the Exploratory Data
Analysis fields in order to apply the lessons learned to Linked Open
Data visualization. Datatype analysis and visualization tasks proposed
by Ben Shneiderman are also added in the research to cover different
visualization features. Finally, an evaluation of the current
approaches is performed based on the dimensions previously exposed.
The article ends with some conclusions extracted from the research

It's a short paper - so it should be doable. In case you are interested, here you can find a longer read:

Guha et al. Evolution of Structured Data on the Web.

For next week's reading club, I suggest we discuss the paper
Here's the abstract:

This article is a position paper about crowdsourced microworking systems and especially Amazon Mechanical Turk, the use of which has been steadily growing in language processing in the past few years.  According to the mainstream opinion expressed in the articles of the domain, this type of on-line working platforms allows to develop very quickly all sorts of quality language resources, for a very low price, by people doing that as a hobby or wanting some extra cash.  We shall demonstrate here that the situation is far from being that ideal, be it from the point of view of quality, price, workers’ status or ethics and bring back to mind already existing or proposed alternatives.  Our goal here is threefold:  1 - to inform researchers, so that they can make their own choices with all the elements of the reflection in mind, 2- to ask for help from funding agencies and scientific associations, and develop alternatives, 3- to propose practical and organizational solutions in order to improve new language resources development, while limiting the risks of ethical and legal issues without letting go price or quality.

The paper I'm suggesting for Dec 7 is "Nanopublication beyond the Sciences" at

This is about a reference dataset for time periods, called PeriodO.


My paper next week is Argument
        graphs: Literature-Data Integration for Robust and Reproducible
At the KCAP conference I attended a one-day workshop on scientific knowledge capture. Tim Clark presented this work on micro publications, i.e., machine- and human-readable models of scientific publications, which enable representing scientific claims together with the corresponding evidence (data, methods and citations)". The micropublications model can be considered an extended version of the nano-publications model.

This workshop paper is based on a (longer!) original paper in the journal of biomedical semantics, which will give you more information about the model and its applications.