SELECTED Projects

A collection of ideas and projects led by Vero (in reverse-chronological order). Some of them are long-term endeavors that produced several achievements. Projects appearing at the top of the list are work-in-progress.

◾ DISTRIBUTED SYSTEMS AND REPLICATION

Work-in-progress: To be updated soon...

◾ DECENTRALIZED & PRIVACY-ENHANCED STORAGE

Institution: UiS, EPFL

Vero's role: Principal investigator

Research collaborations with scientists at: Delft, UiS

Keywords: plausibly deniable storage, IPFS, Swarm, p2p, Merkle trees, DAGs, erasure codes, alpha entanglement codes, simple entanglements, incentives, tokens

A brief summary of publications. Some papers are under peer-review.

Recently one of Vero’ students built a device mapper target to entangle data in a usb. That is part of an ongoing project to solve data corruption in plausible deniable storage systems. The source code of this last project can be found here.
The work "Altruism, reciprocity, and tokens to reward forwarding data: Is that fair? will be presented at IEEE ICBC this month.
Vero is working on a solution to a long standing problem, i.e., how can peers maintain data availability in a system that is completely decentralized? Recently, Vero presented ongoing efforts for solving this problem in IPFS. The paper is available here.
In the past, Vero proposed entangled Merkle trees. This idea was successfully evaluated in the Swarm network and was presented at Middleware 2021.

More information coming soon💾

◾ ALGORITHMS FOR TRUSTWORTHY DATA STORAGE

Institution: UNINE, UCSC, UiS, EPFL

Vero's role: Principal investigator

Keywords: erasure codes, information theory, practical codes, alpha entanglement codes, simple entanglements, RAID-like systems, replication

Research collaborations with scientists at: University of Houston, University of Santa Clara, UiS

Research outcomes: simple entanglements (IPCCC 2016), alpha entanglement codes (DSN 2018), and entangled Merkle trees (Middleware 2021), PhD thesis (UNINE 2017)

Year: 2012 - ongoing

Alpha entanglement codes create interdependencies between data. They are a crucial component to build fault tolerant storage systems. The encoder algorithm weaves content into the entanglement lattice, an abstraction that represents the relations between files or data chunks. The lattice can be used at any time to recover missing content. This algorithm to protect data against failures can be used in datacenters, in p2p networks, and in any other RAID-like storage system.

Entanglement codes offer an elegant approach to build robust systems against transient, permanent, crash or byzantine failures. They are an alternative to replicate content and to classical erasure codes. Replication is too expensive in terms of storage space, and classical erasure coding techniques are too expensive in terms of bandwidth and I/O. A replica does not use space efficiently, its only useful to recover the particular replicated content and nothing else. Classical erasure codes require k blocks to repair data from a single failure making them impractical for 90% of the errors in a datacenter, transient and single failures. Although erasure codes are space optimal in theory, they are generally restricted to cold data in datacenters. Erasure codes are also problematic in decentralized environments, especially for a p2p network in which any peer may contribute to repair content. Check the decentralized and privacy-enhanced storage project tab for more information.

The interdependencies built with the entanglement codes not only address the problems mentioned above but can also mitigate censorship and improve data integrity in a system. While storing data, the redundant elements continue to propagate information in the system. A multi-way recovery structure emerges from the entanglement lattice in which any data chunk can be thought at the center of a data structure in which other chunks can be combined deterministically to recover or validate content. In other words, there are multiple ways to recover data, to proof availability, and to proof redundancy.

Vero's first novel idea was proposed in 2012, under the name of helical entanglement codes. This work resulted in several follow-up publications, some of them are chapters of her PhD thesis. Many people have found the idea of entanglement codes intriguing and worthy to explore, including several students who produced capstone projects, bachelor, master, and PhD thesis based on this line of research. Vero continues to study new entanglement algorithms as well as their applications, especially in distributed and decentralized systems.

This research project was partially funded by the Swiss project Trustworthy Cloud Storage (SNSF Sinergia Grant No. 136318). In addition, Vero received further funding from SNSF Doc.Mobility 162014 and other small grants⛓️

◾ DIGITAL TRANSFORMATION IN HEALTHCARE

Institution: Quality of Life Technologies Laboratory, Human Computer Interaction Sector, DIKU, University of Copenhagen

Vero's role: Data engineer and data scientist (postdoc)

Keywords: Health informatics, Data sharing, GDPR, Wearable devices, Life tracking, Quality of life, FAIR data

Research collaborations: Open Human Foundation

Year: 2018

Publications: OHA's vision, Collecting, exploring, and sharing personal data, Open Humans Platform

PUBLICATIONS SUMMARY:

Background

Health, quality of life and well-being are closely related. The World Health Organization (WHO) indicates that measuring quality of life can provide valuable information in medical practice, for improving the doctor-patient relationship, as well as for assessing the effectiveness and merits of treatments, in health service evaluation, in research and in policy making. Quality of life (QoL) is multidimensional. The qualitative and quantitative data collection of biological, physiological, psychological and social-environmental factors may tell the beholder unprecedented information about the data subject. Forging meaning from these factors in order to get a better quality of life (QoL) is undoubtedly a good reason to gather data. Health-related data analysis plays an important role in self-knowledge, disease prevention, diagnosis, and quality of life assessment. Concerns appear when we start discussing who controls or who processes these data collections. This dilemma raises privacy and security tensions.

Many enthusiast life trackers collect data per years. The use of non-medical devices to maintain or restore health, reinforces the importance of patient-driven health care models and consumer health care models that include mobile health (mHealth) applications to track lifestyle habits, wrist-band trackers, or other kind of IoT/wearable devices.

With the advent of data-driven solutions, a myriad of apps and Internet of Things (IoT) devices (wearables, home-medical sensors, etc) facilitates data collection and provide cloud storage with a central administration. Arguably, the cloud is not the ideal place for personal health information.

Access to data is essential to conduct data-driven research, but academic labs often lack resources to gather data on a large geographical scale. A remedy could be to conduct research in association with an industry partner to gain access to big data repositories. Unfortunately, this strategy introduces the risks of statistical bias since data collected by users of a single product brand, e.g. Apple smartwatch or Fitbit trackers, may not represent the general population. Aside from that, many vendors are reluctant to share information that they used to share before the introduction of GDPR. Another problem is that research projects involving personal health data require ethical approvals. From our own experience, and that of others, the application process can be complex, time consuming and not necessarily successful.

Centralized or decentralized solutions are not magic bullets

Centralized solutions, administered and regulated by third-parties, are not the only possible architecture for storing personal health information. Though, in the healthcare domain other alternatives had been rarely explored until blockchain became a popular alternative. In fact, there are many different solutions that can give more control to the data subject. Data could be stored locally (at home) in equipment owned and administered by the data subject. It’s worth to mention that storing data in more decentralized environments is not an impediment to get valuable information from these resources. Researchers showed that it is possible to apply Bayesian predictive models and deep learning methods in a more collaborative and distributed environments to reduce privacy concerns.

Completely decentralized solutions based on blockchain or other distributed ledgers to store health data are trying to find the way in the complex regulated healthcare space. When storing data in blockchain-based storage systems the first question to ask is whether is a permissionless (public) or permissioned (private/consortium) blockchain. Permisionless blockchains distribute data across all nodes in the system. Nodes contribute with resources, e.g., storage, bandwidth, computation and electricity, (often in exchange of a fee).

Neither centralized nor decentralized solutions are a magic bullet for data-driven innovation if individual, community and societal values are ignored.

GDPR and challenges

The level of data protection legislation around the globe differs from country to country. The General Data Protection Regulation (GDPR) - (EU) 2016/679 is a positive step to protect individuals within the European Union and the European Economic Area. GDPR imposes obligations to data controllers and data processors, among other protections, to defend the individual’s right to access, move and forget data. However, at this stage, there is no much experience with these new regulation neither with its implementation. As stated by the Open Data Institute (ODI) in its guidance: “Data-related activity can be unethical but still lawful.”

GDPR was designed with centralised architectures in mind, therefore, implementing GDPR can be quite complex if data is distributed in many systems. It is necessary to evaluate all source of personal identifiable information, e.g. legacy systems, backups, etc. To remediate and facilitate implementation, an apparently reasonable advice is to eliminate unnecessary sources of personal identifiable information and centralise even more the solutions. Cloud companies might find easier to implement GDPR than traditional compa- nies. Such approach may lead into even more massive data collections administered by few companies.

Keeping personal data in a public blockchain conflicts with GDPR privacy requirements. Encrypted data is considered under the umbrella of GDPR, since the data subject can be indirectly identified. In permissioned blockchains, an administrator can control and grant permissions. These are still alternatives to traditional databases, but the data subject may become powerless again due to the centralised administration. The combination of blockchain and off-chain storage is becoming a common answer to address privacy concerns. Personal data can be stored in a separate location (where data can be modified and deleted) and linked to a blockchain with a pointer. Given the many implementations, this topic deserves careful consideration and further research.

Individual, community, and society values

The value of personal data goes beyond the individual. There are countless examples of the benefits of personal data to communities and the society as a whole. Unfortunately, lack of transparency erodes trust and the willingness of individuals to share sensitive personal data such as health data.

Modernising this area – with improved data access, sharing and long-term archiving methods – requires careful considerations. As an starting point, the Organization for Economic Cooperation and Development (OECD) proposes the following aspects: (a) technological, (b) financial and budgetary, (c) legal and policy, (d) institutional and managerial, and, (e) cultural and behavioural factors.

Open Health Archive

The Open Health Archive (OHA) is a concept for a collaborative platform to manage and archive personal health information. Such platform would support individual, community and societal needs by facilitating collecting, exploring and sharing personal health and QoL data.

In order to have control over personal data collections, individuals should participate in the system administration and, ideally, have physical access to the storage devices that hold their data repositories. A user-controlled and open infrastructure built with Free/Libre and Open Source Software (FLOSS) can provide more control and more options to customize solutions to individual specific health and life conditions. Furthermore, FLOSS transparency facilitate transdisciplinary research dialog.

Our vision for OHA is to emphasize the creation of value in a pyramid structure that benefits the individual, communities and society. To generate sustainable benefits to communities and societies, we propose a bottom-up innovation that emerges from empowered individuals and transparent management of personal data. Little has been done to provide individuals with tools and resources that empower them with independence and self-sufficiency, in other words, to become more active with their health-related data.

Everyone can contribute and customize tools to help individuals in longitudinal self-quantifying own experiments. That may help to achieve low participation attrition. Gradual openness help to protect and respect the privacy according to individual preferences while making easier the path to share data whenever the data subject takes that decision, including consenting options for posthumous donations.

OHA moves the focus of attention from big data and features rich data. Rich data is constructed by aggregate inputs from different sources (or connectors) that create personal health information.

In brief, OHA is a computational stakeholder powered by algorithms developed by the community to help humans with the management and archival of personal health information.

Open Humans platform

Open Humans is a community-based platform that enables personal data collections across data streams (e.g., personal genetic data, wearable activity monitors, GPS location records, and continuous glucose monitor data), giving individuals more personal data access and control of consent (sharing authorizations), and enabling academic research as well as patient-led projects. The Open Humans project aims at providing a path to share data, such as genetic, activity, or social media data, with researchers. Data is ingested using open source connectors to data providers. Users can store data privately to archive and analyse data or opt to contribute data to research projects proposed by the community. The website and other tools are open source and the storage service is free. Datasets are stored on Amazon’s US Web Services.

Open Humans highlights how a community-centric ecosystem can be used to aggregate personal data from various sources, as well as how these data can be used by academic and citizen scientists through practical, iterative approaches to sharing that strive to balance considerations with participant autonomy, inclusion, and privacy⌚

◾ MACHINE LEARNING APPLIED TO CYBERSECURITY

Institution: Graduate School of Interdisciplinary Information Science, University of Tokyo

Vero's role: Cybersecurity researcher

Dataset source: Network packets captured at the main gateway of the university campus, DARPA 98 training set, KDD Cup 99. The analysis involved 6.5 TB of compressed binary tcpdump data, representing 12 hours of network traffic.

Keywords: Artificial neural networks, libpcap, Wireshark, Bro IDS, Snort IDS, ML, KDD benchmarks, tcpdump, deep-packet analysis, network security

Years: 2008-2011

INSIGHTS

Vero went down the rabbit hole into the characterization of anomalies in a large network environment during her first research experience abroad. The following text is slightly adapted from the abstract of her master thesis.

Intrusion Detection System (IDS) are used to detect exploits or other attacks that raise alarms. Some IDSs are able to detect another category of events that indicate that something is wrong but it may not be an attack. These anomalous events usually receive less attention than attack alarms, causing them to be frequently overlooked by security administrators. Real network traffic captured in large environments exhibits diverse protocol behaviors and contains "the crud," which is unwanted or unnecessary data that intrusion detection systems like Zeek, Snort, and Bro encounter during monitoring activities. Additionally, the complexities involved in the capture process are significant, with flaws often present in datasets. The handling of these flaws can significantly impact measurement accuracy, and ultimately, the detection capabilities of intrusion detection systems are greatly influenced by system configurations.

Given the rise in stealthy attacks and the ongoing threat of botnets, anomalous events observed in traffic should be regarded with increased seriousness. Observing this activity is crucial for understanding network traffic characteristics. While abnormal behaviors might be legitimate, such as misinterpreted protocols or malfunctioning network equipment, they might also be engineered by attackers crafting packets to bypass monitoring systems.

ML algorithms could increase the detection power of standard IDS deployed in operational network environments. Despite the enthusiasm among researchers for applying automatic or semi-automatic learning methods to security logs, these techniques have seen limited acceptance in real-world network environments. A critical factor for the success of machine learning (ML)-based systems in operational environments is a deeper understanding of domain-specific data. Much of the literature on ML applied to intrusion detection relies on outdated datasets derived from simulated networks within limited settings, e.g., the KDD dataset benchmark. The KDD dataset is limited to 41 features that describe the traffic generated by user connections with the assumption that they can describe real, actual Internet traffic. New Internet traffic characteristics are not considered.

This project demonstrates that the most important part of pre-processing is auditing (1) real, (2) up-to-date and (3) large amount of data before selecting reliable features for IDS. It provides a reference framework to understand anomalous behavior, addressing questions such as the frequency and causes of these events. Thus, it assists in better identifying and addressing the nuances of anomalous traffic, which are often overlooked but could indicate significant security threats. Focusing on the goal of achieving more efficient use of ML, priority is given to gaining insights from deep packet analysis based on a large dataset captured at the main university campus gateway.

Overall, this study addresses problems encountered in data preprocessing that are undocumented elsewhere. It evaluates the quality of the captured data to improve understanding of major traffic characteristics and to identify potential explanations for observed anomalies. Anomalies found in operational network environments may indicate cases of evasion attacks, application bugs, and a wide variety of factors that highly influence intrusion detection performance. One of the main contributions is to identify diverse pathologies at different levels of the network stack covering a wide variety of cases such as fragmentation, bad packet checksums, TCP anomalies, and application anomalies, accounting for the 40% of the total anomalous events triggered by Bro IDS. Using the KDD dataset as a reference, this work identifies 5 behaviors that are missing from KDD dataset. Those new features can constitute new features, or can be used to rule out data that may distort statistics and consequently reduce the detection power of machine learning algorithms. These characteristics are: bugs in applications, detection of experimental systems (typical presence in academic networks), server found running on non-standard ports, protocol violations and indicators that may alert about evasion techniques🔎

Technical Report: Master Thesis, January 14th 2011, University of Tokyo

◾ SUSTAINABLE FISHERIES IN ARGENTINA'S EEZ

Institution: National Fisheries Authority,

Vero's role: Chief Information Officer and Responsible of the Fishing Vessel Monitoring System (VMS) for Argentina's Economic Exclusive Zone (EEZ)

Stakeholders: Argentina's federal government, Organization of American States, Naval prefecture, Navy, National Institute for Fisheries Research and Development, shipowners, judicial system, communication providers, Commission for the Conservation of Antartic Marine Living Resources, National Audit Office, etc.

Dataset source: GPS vessel possitions transmitted via Globalstar and Inmarsat-C

Keywords: Geographic information systems (GIS), sustainable fishing, satellite imagery, Inmarsat-C packets, fishery policy, MapInfo, Vessel monitoring system (VMS), GPS, PostGIS, ISO 27001, computer audits

Years: 2004-2006

Applies to SDG Goals: 14 - Life below waters

Background image: Visualization of the vessels in Argentina's EEZ and fishing closed areas. The visualization subsystem created 20 years ago for transparency purposes remains active until today (Last accessed: 30-April-2024).

SUMMARY:

Effective management of all fisheries is an essential part of the United Nations sustainable development agenda. Argentina's commercial fishing industry produces $ 2.7 billion (US dollars). As part of the activities to ensure conservation, protection, and proper use of marine resources, the federal government mandates that large capacity fishing vessels operating in Argentine Exclusive Economic Zone must use a fishing vessel monitoring system (VMS). Fishing vessels must be fitted with a transmitter carrying a built-in GPS receiver in compliance with fishing legislation. Data analysis and visualization in a geographic information system facilitates decision and policy making. Navigation patterns are classified and may trigger alarms. In some cases, e.g., alleged infringement, the authority orders the vessel's captain to return inmediately to its port.

Vero was hired to strengthen the technical capacity of the federal government to control fishing and promote sustainable fishing. Her duties include building and operating the national VMS for the national fisheries authority. At the time Vero took this responsibility, the previous system, known as MONPESAT, was not working. Without any controls in place, the fishery was depleted and in state of emergency.

She developed and implemented the new Satellite VMS system used by the national authority and the online-resource for visualization consulted by the general public. She led the daily surveillance operations and collaborate closely with Argentina's Coast Guard, the Navy, the National Institute for Fisheries Research and Development, and relevant local authorities in the coastal provinces. She had to trained technical operators to ensure the quality of the service. In addition, she conducted audits based on the ISO 27001 to assess the information standards of the communication service providers operating in the system.

Vero's efforts generated positive changes immediately. Argentina's National Audit Office appraised the new system. Upon her first year work anniversary, she took over all the fishing information systems used by the national authority and became the Chief Technology Officer🐟

◾ ISO 9001 CERTIFICATION FOR THE ENERGY SECTOR

Institutions: Tecna Estudios y Proyectos de Ingeniería S.A. (1974-2023), Flargent S.A.

Industry: Oil and Gas

Vero's role: Quality Assurance Office Manager

Stakeholders: all internal deparments, company suppliers, auditors

Keywords: DIN ISO 9001 certification, quality management system, quality assurance, continuous improvement, documentation, internal audit, preventive actions, corrective actions, risk management, non-conformity

Years: 1999-2000

Vero was integral in achieving ISO 9001 certification for both Tecna SA and Flargent SA, two companies within the same holding group specializing in engineering services for the energy sector. As the Quality Assurance Office Manager, she was responsible for overseeing the extensive documentation process requires for certification. This included developing and maintaining quality manuals, process maps, signature registers, nomenclature, document versioning system, and compliance records to meet stringent international standards.

Her role involved coordinating the standardization of procedures and implementing robust quality management systems across both companies. Vero organized detailed training sessions for staff to ensure understanding and adherence to new quality protocols. She also managed the preparation for internal and external audits, which involved scheduling, document review, and facilitating communications between auditors and company departments.

Through her diligent management and strategic planning, both Tecna SA and Flargent successfully met the ISO 9001 standards, highlighting Vero's expertise in handling complex multi-company projects and her ability to foster a culture of continuous improvement and compliance🥇