Program

Schedule

All times are in the Eastern Standard Time (EST).

9:45am - 10:00am, Opening Remarks

10:00am - 12:30pm, Theory

  • 10:00-10:30: Amir Abboud, Weizmann Institute of Science

Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve

Abstract

Can we analyze data without decompressing it? As our data keeps growing, understanding the time complexity of problems on compressed inputs, rather than in convenient uncompressed forms, becomes more and more relevant. Suppose we are given a compression of size n of data that originally has size N, and we want to solve a problem with time complexity T(⋅). The naive strategy of "decompress-and-solve" gives time T(N), whereas "the gold standard" is time T(n): to analyze the compression as efficiently as if the original data was small.

We restrict our attention to data in the form of a string (text, files, genomes, etc.) and study the most ubiquitous tasks. While the challenge might seem to depend heavily on the specific compression scheme, most methods of practical relevance (Lempel-Ziv-family, dictionary methods, and others) can be unified under the elegant notion of Grammar Compressions. A vast literature, across many disciplines, established this as an influential notion for Algorithm design.

We introduce a framework for proving (conditional) lower bounds in this field, allowing us to assess whether decompress-and-solve can be improved, and by how much.

Our results include almost-tight lower bounds for problems such as Longest Common Subsequence, Pattern Matching with Wildcards, and Context-Free Grammar Parsing under popular complexity assumptions such as the Strong Exponential Time Hypothesis.

This talk is based on two papers (FOCS’17 and NeurIPS’20) coauthored with Arturs Backurs, Karl Bringmann, and Marvin Kunnemann.

Locally-Decodable Data Compression

Abstract

Motivated by database applications, we study the following data structure problem: An encoder wishes to store a collection of jointly-distributed *correlated* files X := (X_1,X_2,..., X_n) ~ μ (where H(X) << Σ_i H_μ(X_i)), using as little (expected) memory as possible, such that each individual file X_i can be recovered quickly with few memory accesses (IOs) on the RAM. This problem (LDDC) generalizes the classic Dictionary problem, which was shown to admit *succinct* DS in the case of independent files (Dodis-Patrascu-Thorup, STOC'10). A natural question is what statistical properties of distributions μ enable efficient LDDC schemes, with s = O(H) space and fast decoding time t = polylog(n) ?

As a natural starting point, we focus on compressing Markov Chains, in particular, storing random walks on (directed or undirected) graphs. We show that there is a data structure which stores a length-n random walk on any graph G with only +O(log n) extra bits beyond the information-theoretic minimum space, such that any vertex along the walk can be decoded in O(1) time on a word-RAM. For strongly-connected graphs, the extra space can be reduced to only +5 bits (!).

Based on joint work Huacheng Yu and Emanuele Viola.

  • 11:00-11:30: Moses Ganardi, Max Planck Institute for Software Systems

Balancing in Grammar-Based Compression

Abstract

In grammar-based compression a string is represented by a context-free grammar, also called a straight-line program (SLP), that generates only that string. The hierarchical structure of SLPs makes them amenable to algorithms that work directly on the compressed representation, without decompressing the string first. I will present a recent balancing result stating that one can transform an SLP of size g in linear time into an equivalent SLP of size O(g) and height O(log N) where N is the length of the represented string. The question whether such a depth reduction was possible while keeping the size linear has been a long-standing open problem in the area of grammar-based compression. Its solution enables a wide range of algorithmic applications on SLPs, with either significantly simplified proofs or improved space/time bounds.

This is joint work with Artur Jeż and Markus Lohrey.

The Ring: Worst-Case Optimal Graph Joins in Compressed Space

Abstract

A recent development in the efficient processing of database queries is that of worst-case optimal (wco) join algorithms, which process join queries in time proportional to the AGM bound: the maximum possible output size produced by a join query over a relational database of a certain size. Such algorithms can be strictly better than any traditional query plan using pairwise joins. Leapfrog-TrieJoin (LTJ) is a seminal join algorithm that can be wco if we index the attributes of the relations in every possible order. For the important case of labeled (or RDF) graphs, this means indexing all the 3! = 6 possible orders of the (subject,predicate,object) triples. This blowup in space prevents using LTJ in large graph databases.

In this talk I will show how regarding the triples as cyclic bidirectional strings and indexing them in a BWT-like form enables the wco LTJ algorithm within the space of the raw data, and even less. Our ring index, consisting of wavelet trees representing the BWT transformed set, gives the needed support for all the LTJ operations. I will show experimentally that the ring competes with state-of-the-art graph database managers while using a fraction of their space. I will also show that, on tables of d attributes, the ring requires indexing far less than d! orders, making wco algorithms realistic for much wider tables.

  • 12:00-12:30: Shay Golan, University of California Berkeley

Locally Consistent Parsing for Text Indexing in Small Space

Abstract

The Locally Consistent Parsing technique was introduced by Sahinalp and Vishkin [STOC'94], as a way of partitioning a given text into blocks. The partitioning has the property that if a string appears as a substring of the text multiple times, then all these substrings are composed of exactly the same blocks, except for possibly small margins of the substrings. In this talk we will describe how the Locally Consistent Parsing technique could be useful for solving two closing related text indexing problems in sub-linear working space. The first problem is the Sparse Suffix Tree (SST) construction, where a text $S$ is given in read-only memory, along with a set of suffixes $B$, and the goal is to construct the compressed trie of all these suffixes ordered lexicographically, using only $\O(|B|)$ words of space. The second problem is the Longest Common Extension (LCE) problem, where again a text $S$ of length $n$ is given in read-only memory with some trade-off parameter $1\le \tau\le n$, and the goal is to construct a data structure that uses $\O(\frac {n}{\tau})$ words of space and can compute for any pair of suffixes their longest common prefix length. We show how to use ideas based on the Locally Consistent Parsing technique, in some non-trivial ways in order to improve the known results for the above problems under the space constraints. We introduce the first almost-linear, $\O(n \cdot \polylog{n})$, deterministic construction for both problems, where all previous algorithms take at least $\Omega\left(\min\{n|B|,\frac {n^2}{|B|}\} \right)$ time. We also introduce the first linear-time Las-Vegas algorithms for both problems, achieving $\O(n)$ construction time with high probability. This is an improvement over the last result of Gawrychowski and Kociumaka~\cite{GK17} which obtained $\O(n)$ time for Monte Carlo algorithm, and $\O(n\sqrt{\log |B|})$ time with high probability for Las-Vegas algorithm.

The talk is based on a SODA'20 paper with Or Birenzwige and Ely Porat.

12:30pm - 1:30pm, Break: Informal Discussion

1:30pm - 4:00pm, Bioinformatics

Genome Assembly with Variable Order de Bruijn Graphs

Abstract

The nodes of a de Bruijn graph (DBG) of order k correspond to the set of k-mers occurring in a set of reads and an edge is added between two nodes if there is a k-1 length overlap between them. When using a DBG for genome assembly, the choice of k is a delicate issue: if k is too small, the DBG is tangled, making graph traversal ambiguous, whereas choosing k too large makes the DBG disconnected, resulting in more and shorter contigs. The variable order de Bruijn graph (voDBG) has been proposed as a way to avoid fixing a single value of k. A voDBG represents DBGs of all orders in a single data structure and (conceptually) adds edges between the DBGs of different orders to allow increasing and decreasing the order. Whereas for a fixed order DBG unitigs are well defined, no properly defined notion of contig or unitig exists for voDBGs. Here we give the first rigorous definition of contigs for voDBGs. We show that voDBG nodes, whose frequency in the input read set is in interval [l,h] for some h and l>h/2, represent an unambiguous set of linear sequences, which we call the set of (l,h)-tigs. By establishing connections between the voDBG and the suffix trie of the input reads, we give an efficient algorithm for enumerating (l,h)-tigs in a voDBG using compressed suffix trees. Our experiments on real and simulated HiFi data show a prototype implementation of our approach has a better or comparable contiguity and accuracy as compared to other DBG based assemblers.

This is joint work with Diego Díaz-Domínguez, Taku Onodera, and Simon J. Puglisi.

Minimizer Space de Bruijn Graphs for Pangenomics

Abstract

DNA sequencing continues to progress toward longer and more accurate reads. Yet, primary analyses, such as genome assembly and pangenome graph construction, remain computationally challenging. We recently introduced the concept of minimizer-space sequencing analysis, expanding the alphabet of DNA sequences to atomic tokens made of fixed-length words. This leads to orders-of-magnitude improvements in speed and memory usage for human genome assembly and metagenome assembly and enables for the first time a representation of a pangenome made of 661,405 bacterial genomes. In the context of the Compression + Computation workshop, we will discuss in particular how minimizers are used in our work to achieve a controlled yet revertible loss of information, with versatile applications within bioinformatics and beyond.

Judging a k-mer by The Company it Keeps: Compressing the de Bruijn Graph and Signals That Live On It

Abstract

The de Bruijn graph (dBG) and its variants, such as the colored dBG, the compacted dBG, and the reference dBG have become indispensable in modern computational genomics. The applications of these structures span the gamut from genome assembly, to pangenome analysis, to the indexing of large corpora of sequencing data. Yet, as the size of the data upon which we wish to build and process such dBG variants grows large, naive representations of the graph and the data with which it is annotated cease to be feasible.

I will describe several related but distinct approaches to compressing the de Bruijn graph, and signals living on the graph. These approaches make use of variations on the idea that information about the value associated with a node in the dBG (a vertex) can be gleaned from this node's local neighborhood. This concept can be extended to the position associated with a k-mer, its set of colors (i.e., the references or samples in which it appears), and even to the manner in which unitigs (non-branching paths) of the de Bruijn graph tile reference genomes. Crucially, the approaches to "referentially" compressing information that I will discuss admit efficient computation, as signals can be locally reconstructed, often with constant bounded work.

Connecting Genome Graphs with String Compression

Abstract

The size of a genome graph --- the space required to store the nodes, their labels and edges --- affects the efficiency of operations performed on it. The size of the graph also affects the size of the graph index that is used to speed up sequence-to-graph alignment. This raises the need to construct space-efficient genome graphs. We point out similarities in the string encoding approaches of genome graphs and the external pointer macro (EPM) compression model. Supported by these similarities, we present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. We show that the algorithms result in an upper bound on the size of the genome graph constructed based on an optimal EPM compression. In addition to the transformation, we show that equivalent choices made by EPM compression algorithms may result in different sizes of genome graphs. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel-Ziv EPM compression algorithm. These results are a first step toward adapting general purpose compression algorithms for challenges to which genome graphs are currently applied.

This is joint work with Yutong Qiu and appeared in ISMB 2021.

Scalable Bait Design for DNA Enrichment

Abstract

Bait-enriched sequencing is a relatively new sequencing protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes ("baits'') are designed, manufactured, and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. This effectively enriches the DNA for which the probes were designed. Most recently, Metsky et al. (Nature Biotech 2019) demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples. In this work, we formalize the problem of designing baits by defining the Minimum Bait Cover problem, which aims to find the smallest possible set of bait sequences that cover every position of a set of reference sequences under an approximate matching model. We show that the problem is NP-hard, and that it remains NP-hard under very restrictive assumptions. This indicates that no polynomial-time exact algorithm exists for the problem, and that the problem is intractable even for small and deceptively simple inputs. In light of this, we design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as Syotti. The running time of our method shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the recent method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that our method requires only 25 minutes to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 8% of the data in 24 hours.

4:00pm - 5:00pm, Break: Informal Discussion

5:00pm - 6:00pm, Discussion panel: Gaps and BridgesTheory to Practice