Research

My research is focused on the application of coding theory and combinatorics to problems that arise in the field of polymer based data storage. Modern digital data storage systems are facing fundamental storage density limits and to address the emerging needs for large data volume archiving, it is of great importance to identify new nanoscale recording media.

Recently proposed molecular storage paradigms offer storage densities that are order of magnitudes higher than those of flash and optical recorders but the systems often come with a prohibitively high cost and slow and error-prone read/write platforms.

My work focuses on two such molecular storage platforms.

Cartoon of a digital polymer is presented on top of the figure. Molecules representing 0 and 1 are drawn as circles, and a chain with such circles represents a digital polymer. The polymer is then given as a input to a tandem mass spectromer. The cartoon of a tandem mass spectrometer is given by a rectangular box. An illustration of the output of the spectrometer is given at the bottom of the image. It is a peak series.

Reading a digital polymer

Polymer-based storage systems

In these platforms, two molecules that significantly differ in their masses represent the bits 0 and 1, respectively, and are used as building blocks in the process of synthesizing user-defined information content. The synthetic polymers are read by tandem mass (MS/MS) spectrometers. A mass spectrometer breaks multiple copies of the polymer uniformly at random, thereby fragmenting the string into substrings. The masses of these substrings are reported as the output of the system.

We propose a new family of codes that allows for both unique string reconstruction and correction of multiple mass errors.

Noting that these polymers are stored and read together, we design codebooks with asymptotically optimal rates such that multiple strings can be reconstructed based on their joint MS/MS readout. [arXiv 2003.02121]

A cartoon of a double helix DNA is represented on the left side of the image. Ten locations on one strand of the DNA are identified. Arrows from the identified locations lead to the rows of a 10 cross 8 matrix. The columns represent different registers. Each register has a 0 or a 1 in its row corresponding to whether or not the DNA location was left as is or nicked.

Writing on DNA

A DNA-based storage system

Group testing was recently shown to increase the storage density of topological DNA-based data storage. In such a system, nanoscopic holes are punched into the sugar-phosphate backbone of one strand of a double-stranded DNA molecule. A "hole" indicates the value 1 while the absence of a hole indicates the value 0. Multiple copies of the same DNA strands are punched to bear different patterns. These are mixed together and stored in a single microwell. The mixture with various punch patterns are then sequenced and read together.

However, multiple holes punched in close proximity results in poorer quality outputs leading to a constraint on the runlength of 0s between pairs of 1s. Thus, we aim to construct a binary code such that multiple strings can be sequenced and read simultaneously with precision.

We address this problem by introducing a new runlength limited group testing paradigm, a simplification of the actual mixture identification problem. [DOI 10.1109/ISIT44484.2020.9174502]

Besides the two alternatives to the digital storage systems described, I also worked on other problems motivated by bioinformatics.

A cartoon of the Parallel Search Algorithm.

Parallel Search Algorithm for SemiQuantitative GT

Group testing for Covid-19

In this collaborative work, we developed a comprehensive guide that highlights 1) the issues specifically posed by the Sars-Cov-2 virus, and 2) features of the RT-PCR testing mechanism and provide innovative algorithms to tackle the problem of testing under limited resources. In particular, we focus on the semiquantitative nature of the RT-PCR tests: not only do these tests detect the presence of the virus in the sample, these tests also provide partial information about the viral loads present in the sample. [arXiv:2011.05223]

Profile HMMs

Identifiability of Profile HMMs

Profile Hidden Markov Models (HMMs) are graphical models that can be used to produce finite length sequences from a distribution, with multiple applications, including protein structure and function prediction, and multiple sequence alignment. Although statistical identifiability is a fundamental aspect of any statistical model, identifiability of profile HMMs was not investigated previously. In our work, we reported on preliminary results towards characterizing the statistical identifiability of profile HMMs in one of the standard forms used in bioinformatics. [DOI 10.1109/TCBB.2019.2933821]

I intend to study interdisciplinary problems, with focus on areas where coding and combinatorial techniques are pivotal. Presently, I am drawn towards certain problems in bioengineering and bioinformatics that serve as promising future research directions. My secondary interests are aligned along theoretical sequential learning and social learning [DOI 10.1109/ICASSP.2018.8462324 ]. I am also open to broadening my research scope, and would like to study and work on other problems in information theory, bioinformatics and learning theory.

Google Sites

Report abuse