Suffix/LCP Array Construction


pSAscan

A suffix array construction algorithm that successfully incorporates parallelism into external memory computation, resulting in a record-breaking performance. Using the algorithm we computed the suffix array of a 1TiB file on a modest machine in a little over one week. The largest processed instance previously reported in the literature was 80GiB.

Code

Published in:


fSAIS

External memory suffix array construction based on the inducing principle. fSAIS is currently the fastest and most space-efficient general suffix array construction running in O(sort(n)) I/O complexity. The key improvements, compared to previous implementations are using our own implementation of radix-heap as the main priority queue, and our own disk-block management, which drastically reduces disk space usage.

Code

Published in:


LCPscan

The first external memory algorithm for constructing the LCP array given the text and its suffix array as input. The only previous way to compute the LCP array for data that is bigger than the RAM is to use a suffix array construction algorithm with complex modifications to produce the LCP array as a by-product. Compared to the best prior method, our algorithm needs much less disk space (by more than a factor of three) and is significantly faster.

Code

Published in:


EM-SPhi, EM-SI

Two external memory algorithms, constructing the LCP array given the text and its suffix array. The new algorithms are around two times faster than LCPscan. In addition, both algorithms are parallelized (the parallel versions improve the speed up to another factor of two), both are in-place (i.e., do not need any extra disk space, except to accommodate the output), and both work for inputs over large alphabets.

Code

Published in:

Algorithms for Lempel-Ziv Parsing


LZ77 parsing

A collection of our fast algorithms computing the LZ77 parsing in internal memory. The presented algorithms include the currently fastest (KKP3, KKP2) as well as some of the most space-efficient algorithms (ISA9, ISA6s).

Code

Published in:


EM LZ77 parsing

Three external memory algorithms computing the LZ77 factorization, varying in speed and space consumption, including the currently most scalable, fully-external memory implementation of LZ77 parsing in external memory (EM-LPF).

Code

Published in:


LZ-End toolkit

An algorithm that constructs the LZ-End parsing (a variation of LZ77) of a given string of length n in O(n log l) time and only O(z + l) space (z = number of phrases, l = length of the longest phrase). The algorithm runs in the streaming fashion, using only a single (with high probability) pass over the input.

Code

Published in:


EM LZ77 decoding

External memory decoder of LZ77 parsing. Our implementation is the first external memory algorithms for LZ77 decoding. It has an optimal I/O complexity, and it is very fast in practice, only about three times slower than in-memory decoding (when reading input and writing output is included in the time).

Code

Published in:



Converting LZ77 to grammar

An implementation of Rytter's algorithm converting the input LZ77 parsing into (an AVL) grammar. In addition to providing the first highly optimized implementation of Rytter's algorithm, we propose a series of improvements that significantly reduce the peak RAM usage of the algorithm and the size of the final grammar.

Code

Published in:

Compressed Indexing


Faster-Minuter index

The implementation of the FM-index based on the fixed-block boosting technique.  The new index is consistently fast and small relative to the state-of-the-art, and thus makes a good "off-the-shelf" choice for most applications.

Code

Published in:


Hybrid bitvector

An implementation of a hybrid encoding of compressed bitvectors, which divides the bitvector into blocks and then chooses the encoding of each block separately from a number of different encoding methods. The resulting combination achieves all-around good performance in terms of time and space.

Code

Published in: