Dominik Kempa: Homepage

Suffix/LCP Array Construction

pSAscan

A suffix array construction algorithm that successfully incorporates parallelism into external memory computation, resulting in a record-breaking performance. Using the algorithm we computed the suffix array of a 1TiB file on a modest machine in a little over one week. The largest processed instance previously reported in the literature was 80GiB.

Code

Published in:

Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Parallel External Memory Suffix Sorting, CPM 2015 DOI | Slides

fSAIS

External memory suffix array construction based on the inducing principle. fSAIS is currently the fastest and most space-efficient general suffix array construction running in O(sort(n)) I/O complexity. The key improvements, compared to previous implementations are using our own implementation of radix-heap as the main priority queue, and our own disk-block management, which drastically reduces disk space usage.

Code

Published in:

Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi, Bella Zhukova: Engineering External Memory Induced Suffix Sorting, ALENEX 2017 DOI

LCPscan

The first external memory algorithm for constructing the LCP array given the text and its suffix array as input. The only previous way to compute the LCP array for data that is bigger than the RAM is to use a suffix array construction algorithm with complex modifications to produce the LCP array as a by-product. Compared to the best prior method, our algorithm needs much less disk space (by more than a factor of three) and is significantly faster.

Code

Published in:

Juha Kärkkäinen, Dominik Kempa: LCP Array Construction in External Memory, ACM J. Exp. Algor. 2016 DOI
Juha Kärkkäinen, Dominik Kempa: LCP Array Construction in External Memory, SEA 2014 DOI

EM-SPhi, EM-SI

Two external memory algorithms, constructing the LCP array given the text and its suffix array. The new algorithms are around two times faster than LCPscan. In addition, both algorithms are parallelized (the parallel versions improve the speed up to another factor of two), both are in-place (i.e., do not need any extra disk space, except to accommodate the output), and both work for inputs over large alphabets.

Code

Published in:

Juha Kärkkäinen, Dominik Kempa: Better External Memory LCP Array Construction, ACM J. Exp. Algor. 2019 DOI
Juha Kärkkäinen, Dominik Kempa: Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet, SEA 2017 DOI
Juha Kärkkäinen, Dominik Kempa: Faster External Memory LCP Array Construction, ESA 2016 DOI | Slides

Algorithms for Lempel-Ziv Parsing

LZ77 parsing

A collection of our fast algorithms computing the LZ77 parsing in internal memory. The presented algorithms include the currently fastest (KKP3, KKP2) as well as some of the most space-efficient algorithms (ISA9, ISA6s).

Code

Published in:

Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Lazy Lempel-Ziv Factorization Algorithms, ACM J. Exp. Algor. 2016 DOI
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Linear Time Lempel-Ziv Factorization: Simple, Fast, Small, CPM 2013 DOI | Arxiv | Slides
Dominik Kempa, Simon J. Puglisi: Lempel-Ziv Factorization: Simple, Fast, Practical, ALENEX 2013 DOI
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Lightweight Lempel-Ziv Parsing, SEA 2013 DOI | Arxiv

EM LZ77 parsing

Three external memory algorithms computing the LZ77 factorization, varying in speed and space consumption, including the currently most scalable, fully-external memory implementation of LZ77 parsing in external memory (EM-LPF).

Code

Published in:

Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Lempel-Ziv Parsing in External Memory, DCC 2014 DOI | Arxiv | Poster | Slides

LZ-End toolkit

An algorithm that constructs the LZ-End parsing (a variation of LZ77) of a given string of length n in O(n log l) time and only O(z + l) space (z = number of phrases, l = length of the longest phrase). The algorithm runs in the streaming fashion, using only a single (with high probability) pass over the input.

Code

Published in:

Dominik Kempa, Dmitry Kosolobov: LZ-End Parsing in Compressed Space, DCC 2017 DOI | Arxiv

EM LZ77 decoding

External memory decoder of LZ77 parsing. Our implementation is the first external memory algorithms for LZ77 decoding. It has an optimal I/O complexity, and it is very fast in practice, only about three times slower than in-memory decoding (when reading input and writing output is included in the time).

Code

Published in:

Djamal Belazzougui, Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Lempel-Ziv Decoding in External Memory, SEA 2016 DOI | Arxiv

Converting LZ77 to grammar

An implementation of Rytter's algorithm converting the input LZ77 parsing into (an AVL) grammar. In addition to providing the first highly optimized implementation of Rytter's algorithm, we propose a series of improvements that significantly reduce the peak RAM usage of the algorithm and the size of the final grammar.

Code

Published in:

Dominik Kempa, Ben Langmead: Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing, ESA 2021 DOI | Arxiv | Slides | Video

Compressed Indexing

Faster-Minuter index

The implementation of the FM-index based on the fixed-block boosting technique. The new index is consistently fast and small relative to the state-of-the-art, and thus makes a good "off-the-shelf" choice for most applications.

Code

Published in:

Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, Simon J. Puglisi: Fixed Block Compression Boosting in FM-Indexes: Theory and Practice, Algorithmica 2019 DOI
Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, Simon J. Puglisi: Faster, Minuter, DCC 2016 DOI

Hybrid bitvector

An implementation of a hybrid encoding of compressed bitvectors, which divides the bitvector into blocks and then chooses the encoding of each block separately from a number of different encoding methods. The resulting combination achieves all-around good performance in terms of time and space.

Code

Published in:

Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Hybrid Compression of Bitvectors for the FM-Index, DCC 2014 DOI | Slides