McGill DISCS Lab

Data-Intensive Storage & Computer Systems

DISCS centers around efficient storage systems.

Our goal is to understand how new storage technologies (persistent memory, NVMe drives, RDMA) will impact the future of computer systems and shape future data-intensive applications, such as machine learning, data science, and edge computing applications.

An important part of the DISCS vision is training researchers that can recognize the full-system impact of new technologies. We are a diverse group and are always looking to recruit talented and motivated students.

📰 News

See the latest events in our group

[May'24] 💰Happy to share the launch of our new NSERC CREATE program on Sustainable Data Systems, funded with $1.65M. This is a 6-year collaboration between colleagues at McGill, Concordia, the University of Toronto, and the University of Waterloo: Bettina Kemme who led the effort, Essam Mansour, Natalie Enright Jerger, Hans-Arno Jacobsen, and Semih Salihoğlu. I look forward to seeing how this work will shape the outlook of our graduate students when designing systems for data science and machine learning.
[May'24] 💰Our group received an FRQNT Research Support for New Academics grant. Thank you, FRQNT!
[May '24]🎉. Jiaxuan Chen won an FRQNT Scholarship for graduate studies. Congratulations!!
[Apr '24]📰 Our group and collaborators have several papers and a poster in EuroSys'24 associated workshops. Congratulations to the entire team, especially to the lead student authors Rahma Nouaji, Yuqin Yan, Pritish Mishra, Vinicius Dantas de Lima Melo, and Myles Thiessen! Check out the EdgeSys and EuroMLSys program, and stop by to say hello if you are attending EuroSys.
[Apr'24] 💰Our group received a cloud credits sponsorship from AWS! Thank you, Amazon, and congratulations to Stella Bitchebe for leading the proposal.
[Jan '24] 🎉Zach Doucet was selected for the Summer@EPFL 2024 internship program. Congratulations!

See all news.

Research

Data powers everything we do and we are collecting it at unprecedented rates. The driver for research at DISCS is to create a storage infrastructure that enables us to gain insights from this data in a fast and energy-conscious manner. See details on our three main research directions below.

See our publications here.
We open-source all our code. Have a look here.

Systems for Data Science & Machine Learning

Explore how storage can support ML & data science workloads in real-time, on TB-scale datasets.

This research is done in collaboration with MLCommons.

Data Science and ML workloads are ubiquitous. From taking care of our health to running businesses to managing our energy systems and transport planning, we leverage learning to make more informed decisions. We obtain these insights through a combination of algorithms and vast amounts of data. The way data is stored and accessed strongly influences how fast the algorithms can provide us with useful insights. Inefficient data management can unfortunately slow down the entire pipeline.

The needs of ML and Data Science workloads are poorly met by current general-purpose storage systems. To obtain fast results, existing systems rely on heuristics or use stale information. At DISCS, we are designing new tools and storage systems that (1) scale with TB-scale datasets used by Data Science and ML (2) ingest and clean incoming data at high throughput, and (3) serve data with low latency.

This challenging goal entails many research directions, such as identifying opportunities to reduce data movement, designing adaptable data structures that harmonize with Data Science workloads, and the creative use of new storage resources (e.g., NVRAM, fast SSDs, etc.).

Storage Building Blocks for Fast Devices

Redesign caching, files systems, and indexes for new hardware and real systems.

Emerging storage technologies are challenging fundamental assumptions in computer systems design. One major assumption is the significant performance gap between memory and persistent storage access. This gap is now bridged by Byte-addressable persistent memory. Another assumption is that I/O bandwidth is the main bottleneck in storage systems. This too has changed with the development of new fast drives (e.g., Intel Optane NVMe SSDs) shifting the bottleneck to the CPU. In addition, the storage stack is getting deeper and more heterogeneous. It is likely that in a typical server developers and system administrators will have to manage will contain RAM, persistent memory, different types of SSDs and hard disks.

These hardware advances provide an opportunity to redesign the basic storage building blocks, such as file systems, caching policies, key-value stores, and relational databases, as well as re-questioning the appropriate level of support that should be ensured by the Operating System.

Ultimately, given that the hardware and the workloads keep evolving, our long-term vision is to create a framework that automatically generates storage systems which meet the desired performance requirements, given the workload profile and a set of generic hardware characteristics as inputs.

Efficient Data Management for Edge Computing

Shape data management for IoT devices, which will be the world’s largest data producers by 2025.

This research is done as a part of a DND IDEaS micro-net, in collaboration with Profs. Eyal de Lara and David Lie from the University of Toronto, Prof. Aastha Mehta from UBC, and Prof. Julien Gascon-Samson from ETS Montreal.

The Internet of Things (IoT) is a fast-growing field that produces vast amounts of data. In fact, it is estimated that the data produced by IoT workloads alone in 2025 will be larger than all of the data we will produce in 2020. Naturally, this is an excellent opportunity for storage research.

IoT poses serious challenges in terms of resource management. Numerous IoT settings make use of battery-powered devices with limited energy, low storage and data processing capacities, and unreliable connectivity. An interesting direction is determining at what granularity such systems should store data at the sensor-, edge-, and cloud-levels, while developing energy-efficient schemes for data-filtering and data movement between these layers. In addition, the nature of the collected data raises compelling questions as well. One possible avenue is designing data layouts that are suitable for storing vast amounts of noisy data, which may also contain high levels of redundancy (e.g., in video surveillance systems).