Research Internship for a Master 2 student

Post date: Jan 28, 2016 6:52:34 AM

Federated queries or not federated queries,

that’s the question.

Nantes, November 2015.

Research internship to be developed in collaboration between the GDD Team of the LINA computer science laboratory and the BiRD bioinformatics facility at Institut du Thorax life science laboratory.

Contacts: Patricia.Serrano-Alvarado@univ-nantes.fr, Pascal.Molli@univ-nantes.fr and

Alban.Gaignard@univ-nantes.fr

Keywords: federated queries, SPARQL, query engines, ontologies, RDF, usage control, health and life-science applications.

Required skills: knowledge of semantic web technologies, capacity of abstraction, curiosity. Basis on data mining will be valuable.

Gratification: 554,40 Euros by month (15 % du plafond horaire de la Sécurité sociale française.)

Context

Linked Data (LD) makes possible interlinking massive amounts of data across the Web. LD providers range from governments to enterprises, social organizations or research institutions. They publish RDF datasets through SPARQL endpoints. Federated query engines, allow data consumers to query data stored in the federation of SPARQL endpoints in a transparent way as if they were a single RDF graph.

In Life Sciences, and in cancer studies in particular, federated queries allow to join heterogeneous datasets to identify valuable information from multiple perspectives on biological or pathological mechanisms. As an example, we consider in this project, genomic expression data (microarray) acquired from patient tissue samples and in vitro cell lines. Federated queries will be valuable to develop in-silico screening of potential drugs and targets based on linked open data sources (e.g. DrugBank or Bio2RDF datasets) and local biomedical data.

Problem statement

Federated query engines [1,3,4] split a user’s federated query into subqueries that are distributed among endpoints without revealing the whole federated query. Hence, data providers do not know the complete federated query in which they participate, they do not know which of their data are combined, when and by whom [2]. An endpoint, simply does not know if a received query is single query or a subquery that is part of a federated query. Consequently, data providers do not know how their sources are used. The federation does not hold global meta-information about queries it process to be able to make efficient materialization to improve joins, improve indexing, predict workload, improve maintenance, ensure usage control, identify valuable data subsets, etc.

Objective

The objective of this work is, from a federated log, to distinguish federated subqueries from single queries. One way to do this, is to use techniques of supervised learning [5] where a set of entry logs will be used as input for a training classifier. The idea is to train a classifier to learn the characteristics of single queries and federated subqueries. Then the trained classifier will be used to classify queries of a federated log.

Work plan

    • To characterize queries and subqueries in terms of structural features.
    • To setup the experimental environment.
    • To classify logs produced by different query engines.

This work is funded by the Pays de la Loire french region through the Connect Talent SyMeTRIC project (http://symetric.univ-nantes.fr/).

References

[1] Acosta, M., Vidal, M. E., Lampo, T., Castillo, J., & Ruckhaus, E. (2011). Anapsid: An adaptive query processing engine for sparql endpoints. In The Semantic Web–ISWC 2011 (pp. 18-34). Springer Berlin Heidelberg.

[2] Nassopoulos, G., Serrano-Alvarado, P., Molli, P., & Desmontils, E. (2015). Tracking Federated Queries in the Linked Data. arXiv preprint arXiv:1508.06098.

[3] Saleem, M., & Ngomo, A. C. N. (2014). Hibiscus: Hypergraph-based source selection for sparql endpoint federation. In The Semantic Web: Trends and Challenges (pp. 176-191). Springer International Publishing.

[4] Schwarte, A., Haase, P., Hose, K., Schenkel, R., & Schmidt, M. (2011). Fedx: Optimization techniques for federated query processing on linked data. In The Semantic Web–ISWC 2011 (pp. 601-616). Springer Berlin Heidelberg.