2020-2021

Evolution over time of the structure of social graphs

Advisor: Frédéric Giroire et Nicolas Nisse

Emails: frederic.giroire@inria.fr

Laboratory: COATI project - INRIA (2004, route des Lucioles – Sophia Antipolis)

Web Site:

http://www-sop.inria.fr/members/Frederic.Giroire/

Pre-requisites if any:

Knowledge and/or taste for graph algorithms, big data, graph algorithms, network analysis

Description:

The goal of the project is to develop methods to analyse the evolution across time of a social network. We will consider as example the graph of scientific collaborations as it can be crawled freely.

The project will have two phases:

- Data collection. In the first phase, the student will use the available bibliographic research tools (SCOPUS, Web of Science, Patstat) to create data sets. One corresponding to the current situation and others corresponding to past moments. The data sets will correspond mainly to networks (annotated graphs) of scientific collaborations.

- Data analysis. In the 2nd phase, the student will analyse this data. First, they will focus on simple metrics (number of publications, number of patent applications...) and compare the evolutions across time. Then, if there is time, she will start studying the evolution of the structure of the network and will look at whether they are observing an evolution of its clustering due to the emergence of new collaborations.

The PFE will be part of a larger project on the evaluation of the impact of funding on scientific research. The project involve researchers in economics, sociology, and computer science.

The PFE can also be done in a group of two students.

The PFE may be followed by an internship for interested students.

Construction and analysis of safe distributed Processes

Name: Co-advisors: Eric Madelaine & Rabea Ameur-Boulifa

Email: eric.madelaine@inria.fr, Rabea.Ameur-Boulifa@telecom-paris.fr

Telephone: 06 87 47 99 80

Place of the project: Inria

Address: Sophia-Antipolis

Team: KAIROS

Web page: https://team.inria.fr/kairos/

Pre-requisites if any:

Java, Eclipse,

Affinities with formal/rigourous activities

Description:

Context:

We are developing a platform called VerCors for the efficient design and the analysis of distributed

systems. At the core of the platform is a new model called pNets, expressing the behavior of such

systems, in the form of hierarchical automata [1]. This behavior model is able to express the

semantics of various distributed or parallel languages, and serves as an input formalism

for various analysis tools. One important such analysis technique is model-checking, that consists

in checking the validity of requirements, in terms of temporal logic formulas, on the states of the model.

Objective:

This internship proposal is around this platform.

As a starting point to ease the reading, capture and manipulation of such hierarchical

automata, we specified its abstract syntax by using the Eclipse Modeling Framework

(https://www.eclipse.org/modeling/emf). Preliminary work led to a prototype of a "pNet editor"

that includes the definition of a (core) textual language for pNets, and a module generating Java code

for liaison with the analysis tools.

Schedule and sharing out of the work:

The first weeks (1/2 day per week) will be targeted to understand the context and explore

the documentation about the technology proposed, and refine the program of work for the project main phase.

Then the student will target the following goals:

- extend the existing editor to incorporate feedback from our experience, and from research advances.

This includes (small) modifications to the pNet description language, and adaptions of the semantic backend.

The goal here is to build a tool with a better usability, and allowing an easier usage of analysis tools by non-specialists.

- make a proposal for a more expressive language built on top or as an extension of pNets, including some features

describing "parameterised" structures (arrays, sequences, matrices...), allowing to described (and utlimately, analyse)

realistic use-cases.

Bibliography:

[1] Ludovic Henrio, Oleksandra Kulankhina, Siqi Li, Eric Madelaine. Integrated

environment for verifying and running distributed

components -Extended version. [Research Report] RR-8841, INRIA Sophia-Antipolis.

2015, pp.24. <hal-01252323>

[2] https://www.eclipsecon.org/france2014/sites/default/files/slides/Xtext_Sirius.pdf

and https://www.infoq.com/presentations/sirius-xtext

Useful information:

This work can potentially be a prequel to a summer internship for a good student, in the same context.

QoE-aware bandwidth sharing for video streaming traffic

Name: Chadi Barakat, Othmane Belmoukadam

Mail: Chadi.Barakat@inria.fr, othmane.belmoukadam@inria.fr

Telephone: +33492387596

Web page: http://team.inria.fr/diana/chadi/

Place of the project: Inria Sophia Antipolis

Address: 2004, route des lucioles, 06902 Sophia Antipolis, France

Team: DIANA

Web page: http://team.inria.fr/diana/

Pre-requisites if any: Knowledge in network and video streaming protocols. Programming skills (C++, scripting).

Description:

Video streaming is a major application in the Internet today, which is at the same time greedy in terms of consumption of network resources, and sensitive to available network resources in terms of end users Quality of Experience [1]. Further, the operation region of video streaming strongly depends on the terminal characteristics on the user side, with these characteristics largely varying between small phones, tabs and large screens [2,3]. This poses considerable challenges on network operators as they have to accommodate different requirements with often limited resources especially in the cellular environment. Furthermore, video streaming platforms (youtube, netflix) often resorts to encrypting their content, thus making the task of operators even more complex in terms of their availability to make any differential treatment of video traffic. In this context, the common practice today is to relay on end-to-end DASH and end-to-end TCP to share the network resources, with these protocols normally targeting a fair sharing of network resources independently of users' QoE and terminal characteristics. Any consideration of terminal characteristics is left to end systems (video server and video player) running on top of these two protocols.

In a current project at Inria (e.g. [4,5]), we are working towards a framework for optimal sharing of network bandwidth that takes into consideration the QoE of end users and the variability of their terminal characteristics. Such framework has been developed for a centralized scenario where the allocation of resources for each video streaming flow is calculated and enforced inside the network. With this centralized solution, we managed to improve the total QoE of end users especially in scenarios of low bandwidth and heterogeneous terminals. We are now working on a distributed version of this framework around packet marking and differentiated buffer management in the core of the network, leveraging the mathematical foundations that have been used for the centralized approach and the well known results about the dependency of end users' Quality of Experience on available network resources for video streaming. This PFE fits within this context with a roadmap that can clearly go beyond a PFE to a Master internship if there is an interesting fit between the student and the topic. For now, the initial tasks of the PFE will be:

- Consolidate the bibliography in the domain, with a focus on distributed solutions to improve the end users Quality of Experience of video streaming applications.

- Develop a first high level version of the solution, and identify the elements that has to be involved, both at the end system level and the network level.

- Implement those elements in the network simulator ns3 [6], together with a traffic model mimiking well the video streaming traffic in practice.

- Carry out initial simulations proving that these elements work properly.

The internship roadmapm, if any, will build on top of these results to propose a detailed implementation of the distributed solution, consolidated proof of its convergence and extensive simulation results proving its efficiency in dynamically managing the nework resources so that the global QoE of video streaming consumers is improved.

Useful Information/Bibliography:

[1] H. Mao and R. Netravali and M. Alizadeh, “Neural adaptive video streaming with pensieve,” ACM SIGCOMM, 2017.

[2] G. Cermak and M. Pinson and S. Wolf, “The relationship among video quality, screen resolution, and bit rate,” IEEE Transactions on Broadcasting, 2011.

[3] A. Elmnsi and N. Osman and I. Mkwawa, “The impact of mobile device preference on the quality of experience,” International Journal of Computing Information Sciences, 2016.

[4] Othmane Belmoukadam, Muhammad Jawad Khokhar, Chadi Barakat, "On Accounting for Screen Resolution in Adaptive Video Streaming: A QoE-Driven Bandwidth Sharing Framework", in proceedings of the 15th International Conference on Network and Service Management (CNSM), Halifax, Canada, October 2019.

[5] Othmane Belmoukadam, Muhammad Jawad Khokhar, Chadi Barakat, "On excess bandwidth usage of video streaming: when video resolution mismatches browser viewport", in proceedings of the 11th IEEE International Conference on Networks of the Future (NoF 2020), Bordeaux, France, October 2020.

[6] ns3, “network simulation.” http://www.nsnam.org.

Privacy issues with Wi-Fi SSID

Name: Arnaud Legout

Mail: arnaud.legout@inria.fr

Telephone: +33 4 92 38 78 15

Web page: http://www-sop.inria.fr/members/Arnaud.Legout/

Place of the project: Inria Sophia Antipolis

Address: 2004 route des Lucioles

Team: DIANA

Web page: https://team.inria.fr/diana/

Pre-requisites if any: Python, machine learning is a plus

Description:

Wi-Fi Access points broadcast their Wi-Fi SSID tens of meters around them.

However, this information is considered public and can be measured by

apps or companies is order to build databases. Such SSIDs might disclose

personal information (such as a name), which is usually overlooked by the access point owners.

The ElectroSmart project collected during the past 4 years hundred of millions of SSIDs

worldwide. The PFE student will have have to analyze the kind of private information

that is disclosed by the SSIDs.

The first step is not understand which natural language processing techniques can be used

to extract privacy sensitive information from SSIDs. The second step is to analyze and describe the

kind of personal information disclosed. The third step is to explore whether they are

specificities in the way SSID are defined per country

The student will have the possibility to work with real world unique data.

This PFE can be continued by an internship and a Ph.D. thesis for excellent students.

Time-varying Topology Design for Cross-Silo Federated Learning

#SUPERVISORS

Name: Giovanni Neglia, Alain Jean-Marie, Othmane Marfoq

Mail: {firstname.familyname}@inria.fr

Telephone:

Web page: www-sop.inria.fr/members/Giovanni.Neglia/

Where?

Place of the project: Inria

Address: 2004 route des Lucioles

Team: NEO

Web page: https://team.inria.fr/neo/

Pre-requisites if any: The student should have good analytical skills (mostly algorithms and knowledge about graphs) and basic programming skills in Python.

Knowledge about distributed optimization/machine learning and experience using deep learning frameworks (Tensorflow/Pytorch) is a plus but not mandatory.

# DESCRIPTION

Federated learning (FL), “involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized” [1] because of privacy concerns or limited communication resources. FedAvg [2], and its similar alternatives (e.g., FedProx [3], SCAFFOLD [4], q-FFL [5]) rely on a client-server architecture and were proven to efficiently tackle the federated learning challenges (expensive communication, system heterogeneity, statistical heterogeneity and privacy concerns) in the case of mobile and edge device applications. In the cross-silo setting, data silos (e.g., data centers) are almost always available and enjoy high-speed connectivity comparable to the server’s one. A client-server architecture is then potentially inefficient, because it ignores fast inter-silo communication opportunities and makes the orchestrator a candidate for congestion. In that scenario fully decentralized training [6, 7], where communication with the server is replaced by peer-to-peer communication between individual clients (data silos), is a more suitable alternative to the server-client architecture.

The communication topology has two contrasting effects on training duration, resulting in an error-runtime trade-off. First, a more connected topology leads to faster convergence in terms of iterations or communication rounds, as quantified by convergence bounds in terms of the spectral properties of the topology. Second, a more connected topology increases the duration of a communication round (e.g., it may cause network congestion).

Recent works studied the problem of (optimal) topology design for cross-silo federated learning. For example, the authors of [8] proposed MATCHA, an algorithm that can achieve a win-win in this error-runtime trade-off for any arbitrary network topology by decomposing the topology into matchings and only communicating over (randomly) selected matchings at each round. The matchings selection probabilities are chosen in order to optimize the algebraic connectivity of the expected topology. In [9], the authors (including some of the supervisors of this project) addressed this problem by using the theory of linear systems in the max-plus algebra to compute the system throughput (communication time/delay per iteration), and proposed algorithms that, under the knowledge of measurable network characteristics, find a topology with the largest throughput or with provable throughput guarantees.

Different from [8], where the topology was not restricted to be constantly strongly connected, [9] had a strong connectivity constraint. The goal of this project is to explore if further improvements on top of those found in [9] can be obtained if this strong connectivity constraint is relaxed. The student should propose and test different approximate algorithms/heuristics and implement them in our existing experimental framework.

# REFERENCES

[1] Tian Li, et al. "Federated learning: Challenges, methods, and future directions." IEEE Signal Processing Magazine 37.3 (2020): 50-60.

[2] Jakub Konečný, et al. "Federated optimization: Distributed machine learning for on-device intelligence." arXiv preprint arXiv:1610.02527 (2016).

[3] Tian Li, et al. "Federated optimization in heterogeneous networks." arXiv preprint arXiv:1812.06127 (2018).

[4] Sai Praneeth Karimireddy, et al. "Scaffold: Stochastic controlled averaging for on-device federated learning." arXiv preprint arXiv:1910.06378 (2019).

[5] Tian Li et al. "Fair resource allocation in federated learning." arXiv preprint arXiv:1905.10497 (2019).

[6] Angelia Nedić, Alex Olshevsky, and Michael G. Rabbat. "Network topology and communication-computation tradeoffs in decentralized optimization." Proceedings of the IEEE 106.5 (2018): 953-976.

[7] Konstantinos I. Tsianos, Sean Lawlor, and Michael G. Rabbat. "Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning." 2012 50th annual allerton conference on communication, control, and computing (allerton). IEEE, 2012.

[8] Jianyu Wang, et al. "MATCHA: Speeding up decentralized SGD via matching decomposition sampling." arXiv preprint arXiv:1905.09435 (2019).

[9] Othmane Marfoq, Chuan Xu, Giovanni Neglia, and Richard Vidal, "Throughput-Optimal Topology Design for Cross-Silo Federated Learning." NeurIPS 2020.

Privacy-preserving Techniques in Federated Learning

SUPERVISORS

Names:

Giovanni Neglia (giovanni.neglia@inria.fr), Eitan Altman, (eitan.altman@inria.fr), Chuan Xu (chuan.xu@inria.fr)

Web pages:

http://www-sop.inria.fr/members/Giovanni.Neglia/

https://www-sop.inria.fr/members/Eitan.Altman/

https://sites.google.com/view/chuanxu

LOCATION

Inria Sophia-Antipolis Méditerranée

Address: 2004 route des Lucioles, 06902 Sophia Antipolis

Teams:

Neo, Inria (https://team.inria.fr/neo/)

DESCRIPTION

Federated learning (FL), “involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized” [Li20] because of privacy concerns or limited communication resources.

Federated learning [McM17, Pet19, Li20] offers naturally a certain level of privacy, as the raw data is kept locally by the users and never needs to be sent elsewhere. However, maintaining the data locally does not provide by itself formal privacy guarantees. The server node or the curious onlooker can still infer some sensitive user information just by looking at the exchanged messages. Moreover, recent research demonstrates that a trained ML model may reveal private data when model inversion techniques are applied [Fred15]. For example, one could recover an individual photo from a trained softmax regression model for facial recognition systems. Sensitive information is hidden inside the model. Therefore, a private preserving algorithm should not only prevent the onlooker from inferring the training samples, but also from deducing the local model, as the local model probably contains already sensitive information, e.g., about the user's preferences or behaviours.

There exists a huge amount of literature on formal privacy models. Among them, differential privacy [Dwo14] is a popular formal definition for the privacy loss associated with any data release. Even if the user changes just one training sample, the onlooker should not observe a certain level of difference in the message he listens to and thus could not draw any conclusions. To ensure such privacy in machine learning [Mar16], the intermediate messages between users and server are perturbed via unbiased noise which compromise the accuracy of the training model. Besides differential privacy, delta-presence [Ner10] and distance correlation [Gup19] are other privacy definitions that could be also studied in the framework of federated learning. Delta-presence protects the presence of a data record to be identified during the training. Distance correlation reduces the correlation between the raw data and the communication data.

The student should overview the existing literature on privacy models to gain a clear understanding of the main concepts and algorithms in the field. He/she should also implement one of the differential privacy algorithms and illustrate its behavior in a toy-example for federated learning.

PREREQUISITES

We are looking for a candidate with a strong background on probability and statistics

OTHER INFORMATION

This subject is research oriented and can lead to a following internship.

REFERENCES

[McM17] McMahan et al, Communication-Efficient Learning of Deep Networks from Decentralized Data, AISTATS 2017, pages 1273-1282

[Pet19] Kairouz Peter et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019

[Fred15] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In 22nd Conference on Computer and Communications Security, pages 1322–1333, 2015.

[Li20] Tian Li et al, Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, pages 50-60, 2020

[Dwo14] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and TrendsR© in Theoretical Computer Science, 9(3–4):211–407, 2014

[Mar16] Abadi Martin et al. Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. Page 308-318, 2016

[Ner10] M. E. Nergiz and C. Clifton.δ-presence without complete world knowledge.IEEE Transactions on Knowledge andData Engineering, 22:868–883, 2010

[Gup19] Vepakomma, O. Gupta, A. Dubey, and R. Raskar. Reducing leakage in distributed deep learning for sensitive health data.arXiv preprint arXiv:1812.00564, 2019

On the digital footprint of Internet Users

Advisors: Dino Lopez Pacheco <dino.lopez@univ-cotedazur.fr>, Guillaume Urvoy-Keller <urvoy@univ-cotedazur.fr>

Description :

The digital footprint of computer networks and especially the Internet (from the cloud to the end user, with all intermediate AS networks) represents 4% of greenhouse gas emission (GHG) of the world [Shift1]. Forecasts for the following years estimate that it could reach 8%, which is equivalent to the total worldwide road traffic.

In France, thinktanks like the Shift Project [Shift1] and official organization like ARCEP [ARCEP] have tackled this issue, producing reports, e.g. on the crucial impact of video traffic [Shift2].

In the SigNet team, we recently started studies aiming at evaluating the digital footprint of the end user. We have devised a tool, following the idea of the Carbonalyzer [Carbo] Web plugin (also a mobile app). The current version of the tool works on MacOSX and enables to capture, with a minimal impact on user experience, her traffic and determine the networks (notional, European, International) that conveyed this traffic, each network having a specific energetic footprint; and send regularly summary to a centralized server.

The objectives of this PFE are to:

Extend the server-side data visualization capabilities of the tool;

Develop Linux and Windows versions of the client;

Refine the energetic models that relate the number of exchanged bytes to the electrical consumption [Modèle];

Carry longitudinal studies of specific users and mining the results along specific dimensions, e.g application or service level (which application consumes the most, depending on its servers’ localization) or access network (fixed vs. mobile).

This PFE will start with a study of the state of the art. Next, the student will have to test the server and client side of the tool and carry initial measurements. In parallel, extensions (Windows version, graphical interface) will be developed.

A 6-month internship is possible after this PFE.

Expected skills Python and client server programming. Computer networks.

References:

[Shift1] https://theshiftproject.org/article/pour-une-sobriete-numerique-rapport-shift/

[Shift2] https://theshiftproject.org/article/climat-insoutenable-usage-video/

[ARCEP] https://www.arcep.fr/uploads/tx_gspublication/reseaux-du-futur-empreinte-carbone-numerique-juillet2019.pdf

[Carbo] https://theshiftproject.org/carbonalyser-extension-navigateur/

[Modèle] Coroama, Vlad C., and Lorenz M. Hilty. "Assessing Internet energy intensity: A review of methods and results." Environmental impact assessment review 45 (2014): 63-68.

Cooperative Localization in LoRa Low Power Wide Area Networks

This PFE targets 2 students Advisors: Name: Walid Dabbous & Thierry Turletti & Leonardo Lizzi Mail: walid.dabbous@inria.fr & thierry.turletti@inria.fr & leonardo.lizzi@univ-cotedazur.fr Telephone: 0492387718 Web pages: https://team.inria.fr/diana/team-members/walid-dabbous/ & https://team.inria.fr/diana/team-members/thierry-turletti/ & https://www.linkedin.com/in/leonardolizzi/ Place of the project: Inria Address: 2004 route des Lucioles, 06902 Sophia Antipolis Team: Diana project-team Web page: https://www.inria.fr/equipes/dianaDescription: Pre-requisites: Signal Processing, RF communication, Matlab. Description: The Internet of Things (IoT) is playing an increasingly important role today and more than half of major new business systems are expected to incorporate IoT elements by 2020. LoRa [1,2] is an emerging communication technology for Low Power Wide Area Network (LPWAN) which is known to be particularly efficient for long range communication links (several kilometers) at very low cost. MIMO techniques applied in LoRa context were proven beneficial to estimate the angle of arrival (AoA) of the signal [3]. In order to provide full localisation, information distance information is needed in addition to the AoA. Ranging techniques can provide distance information based on Time of flight but need costly synchronisation. Another way to obtain distance information is to use RSSI measurements. The goal of this PFE is to explore a way to combine AoA and distance information to provide precise localisation information in a collaborative way. MIMO equipped gateways detect the AoA of the signal coming for a target node and ask a number of relays to collaborate in localising this target node by providing RSSI information. This work is proposed in the context of the I-LL-WIN project to develop intelligent wireless IoT networks capable of self-reconfiguration to optimise the application scenario in collaboration with LEAT. Work plan: The students will start by a state-of-the-art review on LoRa geolocation and ranging techniques. They will then perform simulations with channel models corresponding to different environments (indoor, outdoor) to evaluate the interest of collaborative localisation in these environments comparing to the use of ranging techniques. Then the approach(es) will be evaluated through experimentations both indoor and outdoor environments with different set-ups: - a MIMO gateway and several B-L072Z-LRWAN1 Discovery kits - a 868 MHz LoRa board with GPS - a 2.4GHz LoRa board. References: [1] N. Sornin, M. Luis, T. Eirich, T. Kramp, O.Hersent , “LoRa Specification 1.0,” LoRa Alliance Standard specification., 2016. https://www.lora-alliance.org/ [2] Augustin, A., Yi, J., Clausen, T., & Townsley, W. M. (2016). A study of LoRa: Long range & low power networks for the internet of things. Sensors, 16(9), 1466. http://www.mdpi.com/1424-8220/16/9/1466/pdf [3] Mahfoudi MN, Sivadoss G, Korachi OB, Turletti T, Dabbous W. Joint range extension and localization for low-powerwide-area network. Internet Technology Letters. 2019. https://doi.org/10.1002/itl2.120 [4] Geolocation for LoRa Low Power Wide Area Network, Othmane Bensouda Korachi, Ubinet Internship report. August 2018. [5] Cooperative Localization in LoRa Low Power Wide Area Network, Ubinet PFE report. Florinda Fragassi. January 2020. [6] An Introduction to Ranging with the SX1280 Transceiver. Semtch document. https://www.semtech.com/uploads/documents/introduction_to_ranging_sx1280.pdf

Deep learning for Image / DNA strand denoising

Name: Marc Antonini, Eva Gil San Antonio

Mail:​ am@i3s.unice.fr, gilsanan@i3s.unice.fr

Telephone: -

Web page: http://mediacoding.i3s.unice.fr/index.php/en/membres

Place of the project: I3S/CNRS

Address: Euclide B, 2000 Route des Lucioles, 06900 Sophia Antipolis

Team: Mediacoding (http://mediacoding.i3s.unice.fr/index.php/en/)

Web page: https://oligoarchive.github.io/

Pre-requisites if any:​ Matlab/Python

Description: Rapid technological advances and the increasing use of social media has caused a tremendous increase in the generation of digital data, a fact that imposes nowadays a great challenge for the field of digital data storage due to the short-term reliability of conventional storage devices. Hard disks, flash, tape or even optical storage have a durability of 5 to 20 years while running data centers also require huge amounts of energy.

An alternative to hard drives is the use of DNA, which is life’s information-storage material, as a means of digital data storage. Recent works have proven that storing digital data into DNA is not only feasible but also very promising as the DNA's biological properties allow the storage of a great amount of information into an extraordinary small volume for centuries or even longer with no loss of information. During the past few years, high throughput DNA sequencing has allowed to reduce dramatically the cost of sequencing, however, this technology introduces a considerable amount of noise in the DNA strands and the full retrieval of the stored information can be at stake.

The main goals of this project are:

1. Optimization of our encoding solution to increase its robustness against sequencing error.

2. To explore different deep learning models for image denoising (such as convolutional autoencoders, stacked denoising autoencoders, etc.) and DNA strand denoising (Recurrent Neural Networks) to improve the quality of the decoded data.

Students will be required to work in one of the above topics.

Useful Information/Bibliography:

[1] Dimopoulou, M., Antonini, M., Barbry, P., & Appuswamy, R. (2019, September). A biologically constrained encoding solution for long-term storage of images onto synthetic DNA. In 2019 27th European Signal Processing Conference (EUSIPCO)​ (pp. 1-5). IEEE.

[2] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P. A., & Bottou, L. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12).

[3] Li, H. (2014). Deep learning for image denoising. International Journal of Signal Processing, Image Processing and Pattern Recognition, 7(​ 3), 171-180.

[4] Convolutional Autoencoders for Image Noise Reduction, Medium,

https://towardsdatascience.com/convolutional-autoencoders-for-image-noise-reduction-32fce9fc1763

[5] DCNet — Denoising (DNA) Sequence With a LSTM-RNN and PyTorch, Medium, https://medium.com/@infoecho/dcnet-denoising-dna-sequence-with-a-lstm-rnn-and-pytorch-3b454ff7 27e7

Impact of containerization on data streams frameworks.

Advisor:

Name: Fabrice Huet

Mail: fabrice.huet@unice.fr

Telephone: +33 4 92 94 26 91

Web page: https://sites.google.com/site/fabricehuet/

Place of the project: I3S Laboratory, Sophia Antipolis

Address: 2000 route des lucioles

Team: Scale

Web page: https://scale-project.github.io/

Description:

The advent of Big Data has given birth to a large number of models and environment to process large amount of data, such as MapReduce[1] and its implementation[2]. More recently, a lot of work has been focused on processing data streams[3,4] or a mix of batch and streams[5] in so called Lambda architectures[6].

The large number of frameworks, each with its own dependencies and tools has increased the complexity of deploying and running data streams frameworks. This issues has been encoutered in other fields and two solutions have been proposed. The first one relies on virtualization [7] as a way to abstract the hardware and construct a portable image which can easily be deployed. A lighter solution relies on containers [8] which brings most of the benefits of virtualization while being much lighter.

In this PFE we want to investigate how it is possible to deploy datastream frameworks inside containers and what is the performance cost. The work should mainly focus on already published articles but if time permits, experiments could be conducted on clusters, with either Storm or Flink.

References

[1] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008.

[2] Hadoop. https://hadoop.apache.org/.

[3] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 147–156, New York, NY, USA, 2014. ACM.

[4] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 239-250. DOI=http://dx.doi.org/10.1145/2723372.2742788

[5] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. In Proceed- ings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Association.

[6] Lambda architecture. http://lambda-architecture.net/.

[7] Barham P, Dragovic B, Fraser K, Hand S, Harris T, Ho A, Neugebauer R, Pratt I, Warfield A. Xen and the art of virtualization. ACM SIGOPS operating systems review. 2003 Oct 19;37(5):164-77.

[8] Bernstein D. Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing. 2014 Sep;1(3):81-4.

[9] https://storm.apache.org/

[10] https://flink.apache.org/

Simulating Resource discovery in omnet++ discrete event simulator

Name: Luigi Liquori

Mail: Luigi.Liquori@inria.fr

Telephone: 06 78 35 80 88

Web page: https://cv.archives-ouvertes.fr/lliquori

Place of the project: Inria SAM

Address: 2004 route des lucioles

Team: EPC KAIROS

Web page: https://team.inria.fr/kairos/

Pre-requisites if any: Experts en C++ are warmly welcomed !)

Description:

The objective of the present stage is to *simulate*, using the omnet++ discrete event simulator, the different components and services involved in the current oneM2M resource discovery process. A discovery starts with a request expressed in a dedicated language, with the objective of retrieving the requested “thing” that matches the required characteristics as indicated in the query, according to the used query language.

In oneM2M this maps into the finding of oneM2M resources matching the query, expressed by means of filter criteria and/or ontology-based characteristics. The request initiated by a so-called originator is routed into the tree-like structure of oneM2M that contains the resources down to the destination.

In oneM2M the queries are always generated by an AE and addressed to one or many CSE, that are knows a priori by that AE. The queries are expressed in SPARQL language extended to adapt to the oneM2M resources and topology structure, and currently requires to explicitly indicate the query targets (resources/CSEs) to address the query applicability domain.

The communication is matched on Mca and Mcc interfaces, and the routing is made according to the oneM2M tree-like structure of the oneM2M SP, that at the root (IN-CSE) is potentially connected with the other SP infrastructure in a mesh-like topology. This means that only the SPARQL query language is used, the rest of the SPARQL protocol suite is replaced by the oneM2M communication system.

The oneM2M Access Control Policies are used to regulate the right to access the information. In particular it is described who can discover and who can access the resources and under which conditions, so privacy and access could be properly respected.

To sustain the query, in oneM2M the resources could be associated with a semantic descriptor point to the associated semantic description. Moreover, to supersede the limitations due to the need of indicating the specific nodes in the topology, the implicit assumption that in case of resource that are needed to be generally found without pointing to their hosting CSE, these resources are announced to the CSE that may be target of the queries, typically the IN-CSE.

With Advanced Semantic Discovery (ASD), it is expected that several oneM2M functionalities may need to be enhanced, in particular:

• It is expected that an Overlay Network-based routing mechanism of the queries would need to be introduced to ensure that the query would be properly propagated in the oneM2M system (intra and inter SP) to match the conditions and the logical and topological indications expressed in the query;

• It is expected that the Query language would need to be extended (in the language or in the associated parameters) to cope with the specific IoT oneM2M architecture and functionality; in particular it would be required to give the flexibility to address a query only by knowing the immediate CSE where an AE, raising the query, is registered-in, without requiring any other knowledge of the whole oneM2M system (e.g., knowledge of other CSEs to which address the query), or even to add non-topological indications such as the maximum response time;

• It is expected that the Access Control Policies (ACP) may need to be improved to cope with the enhanced overlay network discovery capability; in particular the case of inter SP providers may needs to be addressed, specifying ACP extensions to support proper Semantic Discovery Agreements;

• It is expected that the registration, announcement, notification and ontologies of oneM2M resources would need to be extended to cope with the specific overlay network-based features;

• It is expected on Mcc’ to be able to interact with external systems adopting the standard SPARQL protocols.

The student will *integrate* a team of 2 senior researchers by Inria/UCA and two Polytech Students.

Keywords: IoT, Simulation Protocoles Réseaux, C++

Useful Information/Bibliography:

https://www.onem2m.org/

https://www.onem2m.org/images/files/IIC_oneM2M_Whitepaper_final_2019_12_12.pdf

https://www.onem2m.org/images/files/onem2m-executive-briefing_A4.pdf

Generation of automata for controlling critical systems

Name:

Sid Touati and Frédéric Mallet

Mail: Sid.Touati@inria.fr

Telephone:

Web page:http://www-sop.inria.fr/members/Sid.Touati/

Place of the project:

INRIA Sophia Antipolis

Address:

2004 Route des Lucioles, Sophia Antipolis

Team:

KAIROS

Web page:

Pre-requisites if any:

This internship is for students who have strong background in computer science (automata, compilation, formal languages).

Description :

Formal methods for modeling critical systems propose precise languages to describe their behavior, whose semantic is richer than imperative programming languages. Thanks to formal methods, it is possible to guarantee or to check certain properties required for the correct execution of an application.

In order to motivate designers to use formal methods, we focus our research efforts over the next few years to think about and implement automatic and efficient code generation methods from a high-level description, such as CCSL (Logical Clock Calculus Algebra). The generated code will be in an imperative language such as C and C ++, which will then be optimised by a compiler. The generated code must be compiler-friendly, to be fairly clear for the compiler and well disposed to some low-level code optimisation techniques.

This internship aims to learn the CCSL formal language (Logical Clock Calculus Algebra), and to understand how to build an automata from a formal CCSL description in order to generate C or C ++ code from a CCSL description (Logical Clock Calculus Algebra). CCSL allows to write not an application entirely, but only the behavior of its clocks. A clock is used to trigger or not the execution of a function or task at a given time. CCSL is not a programming language that allows to express the algorithmic functions, it only expresses the moment of their execution. There will be two main tasks to do :

1. Understand the formal specification with CCSL: reading and understanding articles and reports.

2. Understand how to build an automata from a CCSL description. Such automata must describe all the possible values of the clocks.

Using our own software to understand the behaviour of synchronous systems is also requested.

> Useful Information/Bibliography:

- Frédéric Mallet, Charles André and Robert de Simone. CCSL : specifying clock constraints with UML/Marte. Innovations in Systems and Software Engi- neering 4(3) : 309-314, Springer, 2008. https://hal.inria.fr/inria-00371371/

- Frédéric Mallet and Robert de Simone. Correctness issues on MARTE/CCSL constraints. Science of Computer Programming, 106 : 78-92, Elsevier, 2015. https://hal.inria.fr/hal-01257978/

Deep learning in healthcare for cancer patients and Covid19

Name: Barlaud Michel et Thierry Pourcher

Mail: barlaud@i3s.unice.fr

Telephone: 0492942732

Web page:

Place of the project: I3S 2000 route des luciloles

Address:

Team: Mediacoding

Web page:

Pre-requisites if any: Machine learning , Deep learning

Description:

Machine learning for healthcare is an emerging topic that benefits unique local expertise. Mathematicians, biologists, and clinicians are working on joint innovative projects in healthcare. Our goals are new tools for diagnostic, prognostic and theragnostic of several cancers and Covid19 aiming personalized medicine. Our approaches are based on metabolomics analyses using mass spectrometry. The main issue is to develop new deep learning methods for high- dimensional metabolomic data of limited patient numbers in order to obtain accurate predictive capability.

Variational autoencoder (VAE) have found widespread applications to learn latent distribution for high-dimensional datas. However classical VAEs assume gaussian distributions which results in a poor approximation of the latent distribution. I3S team has recently developed efficient new encoder method relaxing the Gaussian assumption (1). Our method involves a new nonparametric supervised autoencoder.

The Master project aims to adapt our new autoencoder for selecting metabolomic signatures and improving accuracy in diagnostic, prognostic and theragnostic of several cancers as well as Covid 19. The student will provide a python code and compare with previous approaches. The student will work both with I3S Laboratory and TIRO laboratory.

Useful Information/Bibliography:

Michel Barlaud, Frédéric Guyard, and Zhiyun Xu. A non-parametric supervised autoencoder for discriminative and generative modeling. HAL 02937643, July 2020.

Alignment of ontologies with contextual representations of words.

Name: Catherine Faron-Zucker

Mail: faron@i3s.unice.fr

Web page: http://www.i3s.unice.fr/~faron/index.html

Place of the project:

Laboratoire I3S

Polytech' Nice Sophia

930 Route des Colles,

Team: SPARKS

Web page: http://sparks.i3s.unice.fr/

Description

This PFE will be carried out as part of the PREMISSE project which aims to propose an approach to suggest automatic provider recommendations for claims.

This PFE is the continuation of work in the field of ontology alignment [1] carried out within the Wimmics team at INRIA and I3S. In this work, we proposed a new method [3] to find the correspondence between the concepts of two ontologies based on vector representations of words [2]. In our approach, we used FastText as a model to generate the vector representations of the words. In recent years, several more sophisticated context-sensitive models have emerged, such as Camembert [4].

The aim of this EFP is to improve the approach already implemented using these contextual models and to compare the new results with the results obtained with FastText.

Bibliography:

[1] Euzenat, J., Shvaiko, P., et al.: Ontology Matching, vol. 18. Springer, Heidelberg(2007).https://doi.org/10.1007/978-3-540-49612-03.

[2] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-sentations of words and phrases and their compositionality. In: Advances in NeuralInformation Processing Systems, pp. 3111–3119 (2013).

[3] Dhouib, M. T., Zucker, C. F., and Tettamanzi, A. G.(2019). An ontology alignment approach combining word embedding and the radius measure. InInternational Conference on Semantic Systems,pages 191–197. Springer.

[4] Martin, L., Muller, B., Ortiz Suárez, P. J., Dupont,Y., Romary, L., de la Clergerie, É., Seddah, D., and Sagot, B. (2020).CamemBERT: a tasty French language model. InProceedings of the 58thAnnual Meeting of the Association for Computational Linguistics, pages7203–7219, Online. Association for Computational Linguistics.

Automatic labelling of hierarchical clusters

Name: Catherine Faron-Zucker

Mail: faron@i3s.unice.fr

Web page: http://www.i3s.unice.fr/~faron/index.html

Place of the project:

Laboratoire I3S

Polytech' Nice Sophia

930 Route des Colles,

Team: SPARKS

Web page: http://sparks.i3s.unice.fr/

Description:

This PFE will be carried out as part of the PREMISSE project, which aims to propose an approach to suggest automatic recommendations of providers matching service requests.

This work is the continuation of work carried out in the field of automatic ontology construction within the Wimmics team at INRIA and I3S.

In this work [1], we started by proposing an approach for modelling unstructured knowledge into an ontology. This approach is based on machine learning techniques and more specifically hierarchical classification. The objective of this PFE is to improve this construction method by automatically associating labels to the nodes of the clusters formed [2].

Bibliography:

[1] Dhouib, M., Zucker, C. F., and Tettamanzi, A. (2018). Construction d’ontologie pour le domaine du sourcing. In29es Journées Francophones d’Ingénierie des Connaissances, IC 2018, pages 137–144.

[2] Treeratpituk, P., & Callan, J. (2006, May). Automatically labeling hierarchical clusters. In Proceedings of the 2006 international conference on Digital government research (pp. 167-176).