Suffix/LCP Array Construction
pSAscan
A suffix array construction algorithm that successfully incorporates parallelism into external memory computation, resulting in a record-breaking performance. Using the algorithm we computed the suffix array of a 1TiB file on a modest machine in a little over one week. The largest processed instance previously reported in the literature was 80GiB.
Published in:
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Parallel External Memory Suffix Sorting, CPM 2015 DOI | Slides
fSAIS
External memory suffix array construction based on the inducing principle. fSAIS is currently the fastest and most space-efficient general suffix array construction running in O(sort(n)) I/O complexity. The key improvements, compared to previous implementations are using our own implementation of radix-heap as the main priority queue, and our own disk-block management, which drastically reduces disk space usage.
Published in:
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi, Bella Zhukova: Engineering External Memory Induced Suffix Sorting, ALENEX 2017 DOI
LCPscan
The first external memory algorithm for constructing the LCP array given the text and its suffix array as input. The only previous way to compute the LCP array for data that is bigger than the RAM is to use a suffix array construction algorithm with complex modifications to produce the LCP array as a by-product. Compared to the best prior method, our algorithm needs much less disk space (by more than a factor of three) and is significantly faster.
Published in:
Juha Kärkkäinen, Dominik Kempa: LCP Array Construction in External Memory, ACM J. Exp. Algor. 2016 DOI
Juha Kärkkäinen, Dominik Kempa: LCP Array Construction in External Memory, SEA 2014 DOI
EM-SPhi, EM-SI
Two external memory algorithms, constructing the LCP array given the text and its suffix array. The new algorithms are around two times faster than LCPscan. In addition, both algorithms are parallelized (the parallel versions improve the speed up to another factor of two), both are in-place (i.e., do not need any extra disk space, except to accommodate the output), and both work for inputs over large alphabets.
Published in:
Juha Kärkkäinen, Dominik Kempa: Better External Memory LCP Array Construction, ACM J. Exp. Algor. 2019 DOI
Juha Kärkkäinen, Dominik Kempa: Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet, SEA 2017 DOI
Juha Kärkkäinen, Dominik Kempa: Faster External Memory LCP Array Construction, ESA 2016 DOI | Slides
Algorithms for Lempel-Ziv Parsing
LZ77 parsing
A collection of our fast algorithms computing the LZ77 parsing in internal memory. The presented algorithms include the currently fastest (KKP3, KKP2) as well as some of the most space-efficient algorithms (ISA9, ISA6s).
Published in:
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Lazy Lempel-Ziv Factorization Algorithms, ACM J. Exp. Algor. 2016 DOI
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Linear Time Lempel-Ziv Factorization: Simple, Fast, Small, CPM 2013 DOI | Arxiv | Slides
Dominik Kempa, Simon J. Puglisi: Lempel-Ziv Factorization: Simple, Fast, Practical, ALENEX 2013 DOI
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Lightweight Lempel-Ziv Parsing, SEA 2013 DOI | Arxiv
EM LZ77 parsing
Three external memory algorithms computing the LZ77 factorization, varying in speed and space consumption, including the currently most scalable, fully-external memory implementation of LZ77 parsing in external memory (EM-LPF).
Published in:
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Lempel-Ziv Parsing in External Memory, DCC 2014 DOI | Arxiv | Poster | Slides
LZ-End toolkit
An algorithm that constructs the LZ-End parsing (a variation of LZ77) of a given string of length n in O(n log l) time and only O(z + l) space (z = number of phrases, l = length of the longest phrase). The algorithm runs in the streaming fashion, using only a single (with high probability) pass over the input.
Published in:
EM LZ77 decoding
External memory decoder of LZ77 parsing. Our implementation is the first external memory algorithms for LZ77 decoding. It has an optimal I/O complexity, and it is very fast in practice, only about three times slower than in-memory decoding (when reading input and writing output is included in the time).
Published in:
Djamal Belazzougui, Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Lempel-Ziv Decoding in External Memory, SEA 2016 DOI | Arxiv
Converting LZ77 to grammar
An implementation of Rytter's algorithm converting the input LZ77 parsing into (an AVL) grammar. In addition to providing the first highly optimized implementation of Rytter's algorithm, we propose a series of improvements that significantly reduce the peak RAM usage of the algorithm and the size of the final grammar.
Published in:
Compressed Indexing
Faster-Minuter index
The implementation of the FM-index based on the fixed-block boosting technique. The new index is consistently fast and small relative to the state-of-the-art, and thus makes a good "off-the-shelf" choice for most applications.
Published in:
Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, Simon J. Puglisi: Fixed Block Compression Boosting in FM-Indexes: Theory and Practice, Algorithmica 2019 DOI
Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, Simon J. Puglisi: Faster, Minuter, DCC 2016 DOI
Hybrid bitvector
An implementation of a hybrid encoding of compressed bitvectors, which divides the bitvector into blocks and then chooses the encoding of each block separately from a number of different encoding methods. The resulting combination achieves all-around good performance in terms of time and space.
Published in: