32nd International Symposium on String Processing and Information Retrieval
September 08 -11, London, UK
September 08 -11, London, UK
Scaling genomic reuse: hypothesis and algorithms for k-mer collections
The rapid growth of genomic sequencing has created an urgent demand for methods that not only store massive amounts of data efficiently but also support its effective reuse across analyses. At the heart of many such methods lies the concept of k-mer sets, compact representations of sequence content that enable fast querying, indexing, and comparison. This talk will explore recent advances in data structures and algorithms designed to manage large collections of k-mer sets, and will discuss scalability, dynamic updates, and query efficiency. We will highlight innovations that bridge theory and practice, from algorithmic breakthroughs to real-world applications in areas such as RNA-seq analysis in clinical research.
New perspectives on the Burrows–Wheeler Transform
The Burrows–Wheeler Transform (BWT) is a mapping from a given string to the string obtained by concatenating the last symbols of all rotations of the string, in the lexicographic order of the rotations. BWT is well known for its countless applications in data compression and text indexing. The “magic” of BWT stems from two important properties: (1) the reversibility of the transform (albeit only up to rotations of the string), and (2) the so-called “clustering effect” in which the transformed string is somehow “easier to compress”.
It is important to note, however, that the BWT is not a bijection: Reversing a BWT image requires additional information, typically in the form of a special end-of-string symbol. Furthermore, there exist strings that are not BWT images in the first place. The bijective BWT (BBWT) addresses this issue, and can be described as the extended BWT (eBWT) applied to the Lyndon factors of the Lyndon factorization of the input string. BBWT can be considered as a generalization of BWT as they are equivalent for strings that are lexicographically smallest rotations.
In this talk, we discuss recent results on the BWT from the perspectives of bijectivity (via BBWT) and compressiveness of the transformed string, showing that BBWT retains many properties of the BWT while intro- ducing additional structure that enables novel compression schemes as well as new analyses on quantifying the clustering effect of the transform in terms of repetitiveness measures based on dictionary compression.
Succinct Dynamic Data Structures (25 years on)
Succinct data structures (SDS) store data in a compact way, often using information-theoretically optimal space (to within lower-order terms). In addition, they support queries on the data, often in time comparable to their classical (non-succinct) counterparts. This theoretical performance is often reflected in practice; indeed, due to their much smaller memory footprint, SDS can often considerably outperform classical data structures in terms of speed. This win-win situation has led to numerous real-world applications of SDS.
Unfortunately, the previous characterization applies only to static SDS. In other words, the entire data is given in advance, is pre-processed, and is then queried — if the data changes, the pre-processing must be performed anew. While this is acceptable for many applications, it is not the most typical use case, where updates to data are interspersed among the queries. To deal with the more general case, many authors have published papers on Dynamic SDS (DSDS). Unfortunately, DSDS run into time lower bounds that make both query and update operations slower than their classical counterparts, usually by a multiplicative logarithmic factor. This logarithmic factor is not only theoretically unpleasant, it is also a major factor in ensuring that the speed of DSDS in practice is not competitive with their classical counterparts (typically, DSDS are unable to fully utilize data locality). In this talk, I will survey the 25-year his- tory of DSDS, and discuss approaches to ensuring that DSDS can indeed close the speed gap with their classical counterparts