TransWeb: Data on the Web, Tuesday 19th February, Nantes

The Web has been evolving from  Web of documents to Web of Data. This Web of Data, also referred to as Semantic Web enable complex query processing on the Web.   Semantic Web presents a revolutionary opportunity for deriving insight and value from data. Linked Open Data (LOD)  uses the Web to connect data sets that were not connected before. This evolving the Web into a Global data space. ''LOD create rich pathways different through diverse learning resources, spot previously unseen factors in road traffic accidents, and scrutinise more effectively the operation of our democratic systems.'' (Linked data book) 

Themes handled by  GDD, GRIM and COD teams of LINA are:

  • Fundations of distributed system;Privacy and confidentiality in distributed data management; Distributed Data Management in Web of Data: data integration, social semantic web, linked data.  (GDD team)
  • Data processing and management, especially clustering and derived research tracks : 
    summarization, clustering, data integration, indexing and retrieval, querying, ensemble clustering 
    . (GRIM team)
  • Data mining (association rule mining and clustering) and machine learning  (probabilisticgraphical models), knowledge engineering,knowledge visualization (COD team)
Program:
09:15-9:30 Welcome
09:30 - 10:10 : Gabriela Montoya, GDD Team, slides
Title: GUN: An Efficient Execution Strategy for Querying a Large Number of RDF Data Sources
Abstract
Mediator-Based approaches provide a uniform interface for querying and integrating heterogenous data sources. In case the correspondences between the mediator global schema and the data sources are expressed using the Local-as-View approach, a mediator query may be rewritten in an exponential number of query rewritings on the data sources. Consequently, the time to produce the first answer may be very high. We propose a query processing technique able to speed up the query rewriting execution task by maximizing results obtained by evaluating k rewritings. We formulate the Result-Maximal k-Execution problem (ReMakE) as the problem of maximizing the number of query results obtained from the execution of only k rewritings. We propose a novel query execution strategy called GUN that solves the ReMakE problem for RDF data sources. We empirically compare the performance of GUN with respect to existing query execution techniques with different experimental setups based on synthetic datasets generated with the Berlin SPARQL benchmark. Our experimental results suggest that GUN may overcome existing approaches in environments with a large number of RDF data sources.

10h10 - 10:50  : Guillaume Raschia, GRIM Team, Slides 
Title:  Bilan et perspectives sur l'open data , vu de la recherche en informatique

10:50 - 11:05 Coffee Break
11:10 - 11h50  Anthony COUTANT, COD Team
Title: Behaviour-based community detection on the web, with Probabilistic Relational Models
Abstract: 
Networks are a preponderant structure in real life which have led to many studies in the litterature. Among all of the related problems already addressed, community detection is a well-known problem and numerous solutions have been proposed to solve it. Although there exist many community detection algorithms for homogeneous networks, even then it is difficult to find a solution for heterogeneous
ones. Practically there exist some solutions, but they are limited to specific problems with small scope, which makes generalization difficult for other network structures.
In this talk, we introduce Probabilistic Relational Models, which are a relational generalization of Bayesian Networks. We show how these models can help in defining a general approach for Community Detection on heterogeneous networks, using both nodes and relationships information. Additionally, we
propose a model for detecting communities on the web, using the concept of user behaviour.

12h00 - 14h00 Lunch Break

14h00 - 15h00 Maria Esther-VidalFull Professor at the University Simón Bolívar, Venezuela
Bio:
Professor Vidal received a PhD on Computer Science from the University Simón Bolívar (2000), and has been awarded with Honors in her master thesis (1991), her doctoral dissertation (1999), The Excellent Graduate Student (1991), The Best Professor Award in the Associate Professor Category (2000) and Full Professor (2012), and The Best Scholar-Professor, Procter-Gamble (2004). 

She has been Assistant Researcher at the Institute of Advanced Computer Studies at the University of Maryland (UMIACS) (1995-1999), and Visiting Professor during the summers (2000-2012) at UMIACS; also, she has lectured graduate courses at Universidad Politecnica de Catalunya (2003) and Universidad Politecnica de Madrid (2012), and invited talks at Leipzig University (2011), Schloss Dagstuhl (2012), University of Athens (2012) and Mayo Clinic (2012).  

She has participated in several international projects supported by NFS (USA), AECI (Spain) and CNRS (France), and advised six PhD students and more than 60 master and undergraduate students. She has published more than 70 papers in International Conferences and Journals of the Database and The Semantic Web areas. She has been reviewer and has participated in the Program Committee of several International Journals and Conferences, and Co-chair of Workshop on Resource Discovery (2010-2013), accompanying professor of On the Move Academy (2009-2012), and co-organizer and co-lecturer of the tutorial on Adaptive Semantic Data Management Techniques for Linked Data at ESWC 2011, 2012 and 2013.

Title: Challenges for Efficient Semantic Data Management in the Web of Data
Abstract:  
In the Linked Open Data cloud a large number of huge RDF linked datasets have become available, and this number keeps growing. Simultaneously, scalable RDF engines that follow the traditional optimize-then-execute paradigm have been developed to locally access RDF data, and SPARQL endpoints have been implemented for remote query processing. Although queries against locally stored data can be efficiently executed, remote query executions may frequently be unsuccessful. First, the most efficient RDF engines rely their query processing algorithms on physical access and storage structures that are locally stored; however, because of the size of existing linked datasets, loading the data and their links is not always feasible. Second, remote linked data query processing can be extremely costly because of the lack of query planning; also, current techniques are not adaptable to unpredictable data transfers or data availability, thus, executions can be unsuccessful.

In this talk, I will describe both optimize-then-execute techniques and adaptive query processing strategies that have been developed to access RDF data; linked RDF datasets will be used to illustrate the performance of the proposed approaches. I will present existing SPARQL engines to access federations of endpoints. Particularly, I will describe ANAPSID, an adaptive query engine for SPARQL endpoints that adapts query execution schedulers to data availability and run-time conditions when data is remotely accessed. ANAPSID provides physical SPARQL operators that detect when a source becomes blocked or data traffic is bursty, and opportunistically, the operators produce results as quickly as data arrives from the endpoints. ANAPSID performance will be compared with respect to state-of-the-art RDF stores and endpoints; experimental results will show that ANAPSID can speed up execution time, in some cases, in more than one order of magnitude.

15h00 - 15h40 Luis Daniel Ibáñez González, GDD Team

Title: "Live Linked Data: Making Linked Data writable with eventual consistency guarantees"
Abstract: 
The Linked Data initiative provides the means to  allow data providers to publish an interconnect their information and knowledge in a way that facilitates the querying among distributed sources, notably through the W3C standard query language SPARQL. Recently, the W3C added to the standard the SPARQL 1.1 Update language, which allows to perform updates on the data sources and opens the door to collaboration between them for curation and enriching purposes. However, updating in a network of autonomous participants raises issues about data consistency. 

In this talk we 1) Define 'Live Linked Data' as a network of autonomous linked data participants connected by streams of SPARQL Update 1.1 operations, enabling collaboration between them, and 2) Show how to use Commutative Replicated Data Types (CRDTs), a recent formalism from the Distributed Systems field, to guarantee eventual consistency at a reasonable cost.

15h40 - 15h55 Coffee Break
15h55 - 16h35 Frédéric Dumonceaux, GRIM Team
Title: "Operating on multiple data partitions: issues and expectations"

16h35 - 17h15 Christophe Thovex, COD Team
Title: Analyse Sémantique des Réseaux Sociaux

Abstract: Résumé : En 1977, Freeman formalisait les premières mesures standard d’Analyse de Réseaux Sociaux (ARS). Puis, les réseaux sociaux du Web « 2.0 » sont devenus planétaires (e.g., FaceBook, MSN). Nous avons défini un modèle sémantique, non probabiliste et prédictif, pour l’analyse décisionnelle de réseaux sociaux professionnels et institutionnels. Ce modèle, en parallèle à lasociophysique de Galam, intègre des méthodes de traitement sémantique du langage naturel et d’ingénierie des connaissances, des mesures de sociologie statistique et des lois électrodynamiques, appliquées à l’optimisation de la performance économique et du climat social. Il a été développé et expérimenté dans le cadre du projet Socioprise, financé par le Secrétariat d’Etat à la prospective et au développement de l’économie numérique et le Fonds Communautaire Européen (FCE). Nos travaux se poursuivent actuellement en intégrant des modèles d'analyse d'opinions d'une part, et des patrons fréquents similaires (Frequent Closed Patterns) d'autre part, ainsi qu'en appliquant l'ARS sémantique dans divers contextes tels que la gestion de connaissances QHSE, l'aide à la décision dans le tourisme numérique, l'aide à la recommandation d'œuvres musicales et le journalisme participatif.

17h15  Session End