Research

Broad Interests

- Scalable Algorithms and Data Structures for Compressing, Organizing, Indexing, and Searching High-throughput Genomic Data

Fast, Parallel, and Cache‐friendly Suffix Array Construction [WABI 2023]
Jamshed Khan, Tobias Rubel, Erin Molloy, Laxman Dhulipala, and Rob Patro

We present CAPS‐SA, a simple and scalable parallel algorithm for constructing the suffix array and the LCP array. It has excellent memory‐locality and achieves strong performance on modern multicore systems. We show that CAPS‐SA outperforms existing state‐of‐the‐art parallel suffix array and LCP‐array construction algorithms. [AMB]

CapS-SA is available under an open-source license in GitHub.

Fulgor: A fast and compact k‐mer index for large‐scale matching and color queries [WABI 2023]
Jason Fan, Jamshed Khan, Noor Pratap Singh, Giulio Eramnno Pibiri, and Rob Patro

We present Fulgor, a fast and compact colored de Bruijn graph data structure—built with a dictionary and a compressed inverted index—to determine the subset of references from a given collection that are likely to contain a query nucleotide sequence. We demonstrate that compared to a state‐of‐the‐art index, Fulgor indexes take 2–3.8x less space, is at least 2x fast for queries, and is 2–6x faster to construct. [AMB]

Fulgor is available under an open-source license in GitHub.

Spectrum Preserving Tilings Enable Sparse and Modular Reference Indexing [RECOMB 2023]
Jason Fan, Jamshed Khan, Giulio Ermanno Pibiri, and Rob Patro

We introduce the spectrum preserving tiling (SPT), a general representation of genomic sequence collections that specifies how a set of tiles repeatedly occur to spell out the constituent sequences. We introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping, and implement a practical index with these sampling schemes in the tool pufferfish2. [LNCS]

Pufferfish2 is available under an open-source license in GitHub.

Scalable, Ultra-fast, and Low-memory Construction of Compacted de Bruijn Graphs with Cuttlefish 2 [Genome Biology, 2022]
Jamshed Khan, Marek Kokot, Sebastian Deorowicz, and Rob Patro

We present a fast and memory-frugal algorithm for constructing compacted de Bruijn graphs, Cuttlefish 2, applicable both on raw sequencing short-reads and assembled references, that can scale to very large datasets. It builds upon the novel idea of modeling de Bruijn graph vertices as DFA from Cuttlefish—modifying the model and generalizing the algorithm, so as to facilitate all applicable forms of input.

Cuttlefish 2 is available under an open-source license in GitHub, and also through conda.

An Incrementally Updatable and Scalable System for Large‐Scale Sequence Search using the Bentley‐Saxe Transformation [Oxford Bioinformatics, 2022]
Fatemeh Almodaresi, Jamshed Khan, Sergey Madaminov, Michael Ferdman, Rob Johnson, Prashant Pandey, and Rob Patro

We show how to build a scalable and updatable exact RNA sequence‐search index. Specifically, we extend the Mantis index using the Bentley‐Saxe transformation to support efficient updates. We demonstrate Mantis’ scalability by constructing an index of ≈40K samples from the SRA (Sequence Read Archive) by adding samples one at a time to an initial index of 10K samples.

Dynamic Mantis is available under an open-source license in GitHub.

Cuttlefish: Fast, Parallel, and Low‐memory Compaction of de Bruijn Graphs from Large‐scale Genome Collections [ISMB/ECCB 2021]
Jamshed Khan and Rob Patro

We introduce a new algorithm, Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite‐state automata, and constrains these automata’s state‐space to enable tracking their transitioning states with very low memory usage. It is also fast and highly parallelizable. [Oxford Bioinformatics]

Cuttlefish is available under an open-source license in GitHub, and also through conda.