Research

My research interests lie in the development of techniques that enable efficient data storage and dissemination. Below are a few topics I am currently working on.

DNA-Based Storage

DNA is an attractive potential medium for digital data storage for many reasons including its unparalleled density (as mentioned by Feynman) and durability. Today’s tape storage devices can store about 10 GB/mm³whereas recent optical discs boast a storage density of approximately 100 GB/mm³. DNA has a theoretical storage density of 1 EB/mm³. In other words, to store a single zettabyte using existing technologies would require hundreds of cubic feet whereas a zettabyte could be stored in DNA using roughly one cubic centimeter. In addition to its high density, DNA is remarkably durable. The world’s oldest genome was recently sequenced from 700,000 year-old horse DNA.. If stored properly, it is possible to recover information from DNA after hundreds of thousands of years.

Despite the potential benefits of a DNA storage system, many obstacles to its widespread adoption exist. In particular, the cost to manufacture or synthesize strands of DNA containing user data remains prohibitively high, especially when compared to existing storage technologies. In addition, the readout or sequencing process presents unique challenges. Existing sequencers, such as the Illumina platforms, are bulky and are only able to read relatively short strands of DNA. Oxford Nanopore recently developed a portable nanopore sequencing architecture, but initial devices reported high error rates. The sequencing reliability issue is further exacerbated by the fact that the errors are predominantly deletions and insertions, for which no efficient and practical solutions are known. Our work includes the design of a portable DNA-based storage system that overcomes these challenges through a novel combination of special-purpose alignment methods, reconstruction algorithms and context-dependent deletion correction codes.

The encoding stage. This stage involves compression, representation conversion, encoding into DNA, and subsequent synthesis.

Emerging Memory Devices

The atomic unit in a flash memory device is the floating gate cell. Information is stored by modulating the number of electrons within a cell. While it is relatively easy to increase the number of electrons, reducing the number of electrons is a block operation which affects about 10⁶flash cells. Therefore, reducing the number of electrons in a cell (or decreasing a cell’s level) is a time-consuming operation, which degrades the lifetime of the device and induces asymmetric error patterns. In the context of flash memory, our work aims to extend the lifetime of these devices by developing a) custom error-correction codes which target asymmetric error patterns commonly found in flash and b) coding schemes that delay the need to decrease a cell’s level.

NAND flash memory cell

Synchronization

This work is motivated by the following scenario. Suppose we have some data (a file for instance) on Host A and some related data on Host B, and it is desired that Host B obtains all the information on Host A. One naive approach would be for Host A to simply transmit all of its data to Host B; however, if the sets are similar, this approach could be wasting valuable network resources. Another approach is to compute hashes on the data and then compare hashes to iteratively determine the difference. With this approach, many rounds of communication are required. In adverse network conditions, the more rounds of communication required, the greater the stress placed on already limited network resources. Our work in this area focuses on the development of bounds and coding schemes for synchronizing data that require a small number of communication rounds.

This scenario is shown in the figure to the right where two databases are synchronizing rows in a table, and updates are highlighted in red. In this setup, data is never deleted from either database, but frequently elements in the symmetric difference are the result of small updates.

Two databases synchronizing their data